"Artificial-Intelligently Challenged"

前言

Hello everyone, I am coming out again.

Two years ago, I wrote an article"Why are today's artificial intelligence assistants like artificial mental retardation?"At that time, it was mainly "smart assistants".This time it’s the expression “I’m not targeting anyone, it’s just that all deep learning can’t handle conversational AI” and “you see it all.What should I do?AIproduct. "

– Reading threshold –

  • time:This is really too long (near 3)According to the feedback from the preview students, it usually takes a lot of energy to read Part 3 for the first time, but after reading Part 3, it is the essence (and the most brain-burning part). Please arrange the reading time as appropriate.
  • readability:I will invite you to think together in the content (no need for professional knowledge), so it may not be suitable for commuting time reading. Your reading gain depends on the level of participation you think about in the process.
  • Suitable for the crowd: Dialogue with intelligent industry practitioners, AIPM, investors who care about AI, friends who have a strong interest in AI, and friends who care about whether their work will be replaced by AI;
  • About the link: When reading this article, you do not need to read the contents of each link, this does not affect the understanding of this article.

– About"Artificial mental retardation"Four words-

After the last article was sent, a friend told me that the word "artificial mental retardation" in the title seems a bit offensive. As a language student, let me explain why:

At the beginning, I was talking to a corporate consultant about the current state of the artificial intelligence track.Because the conversation was started in English, in order to express my opinion that "the current intelligent assistant industry is in an insurmountable predicament", I told her "Currently all the digital assistants are Artificial-Intelligently challenged".

She laughed after listening. "Intelligallyly challenged" is also a euphemistic expression of mental retardation in English. Assuming that she doesn't understand this common sense, she may ignore this stalk. Although she can understand the core meaning, she just doesn't think there is any funny thing. Then there is a loss in the transmission of information.

When I wrote the article, I translated this information into Chinese and it became an "artificial mental retardation." However, because of the characteristics of Chinese grammar, some information is lost in translation. For example, the actual expression is "a state of dilemma" rather than "one thing."

(By the way, the mental retardation in Chinese is actually a politically correct name. SeeThe wording method of the special Olympics. )

Why do you have to write so many words to explain this wording? becauseDifferent people will get different understandings when they see the same words.This is also one of the key points we want to discuss.

So let's get started.


Part 1: The performance of conversational intelligence: mental retardation

Sophia in AI for Good Global Summit 2017. Source: ITU


2017 year 10 month, the picture above is called Sophia robot,Formal citizenship granted by Saudi Arabia. Citizenship, this evaluation is even more powerful than the Turing test. Moreover, in Saudi Arabia, they just allowed women to drive soon (Decree issued by 2017 9 month).

Sophia often attends various conferences, "speaks", and "interviews", such asGo to the UN dialogueShowing very similar human speech; goShooting MV with Will Smith;acceptGood morning BritainInterviews with mainstream media like it; even the founder of the companyIn an interview with Jim Fallon, a serious saying that Sophia is "basically alive".

Basically alive. You know, the people who eat melons in the West grew up watching the Terminator. They also watched "Western World" some time ago. In their world model, the setting "machine intelligence will awaken" will happen sooner or later.

The general public began to tremble with trepidation. Not only are they beginning to worry about whether their work will be replaced, but many people are beginning to worry about whether AI will rule humans. "The future has come", many people think that true artificial intelligence is close at hand.

However, some people may notice some unreasonable places: "Wait, artificial intelligence threatens humans, is it so stupid for my Siri?"

Source: Dumb And Dumber: Comparing Alexa, Siri, Cortana And The Google Assistant, Forbes, May 2018


Let's see how 2018 has developed in the field of dialogue intelligence at the end of the year.

Do not Japanese food "

I did a test at the end of 2016 and asked a few seemingly simple requirements for several smart assistants: "Recommended restaurant, not Japanese food." Only the AI ​​assistants of each family will give a bunch of restaurant recommendations, all Japanese.

2 year has passed, is there any progress in the handling of this issue? We did another test: 

The result is still unresolved. The word "don't" was ignored by all assistants.

Why pay attention to the word "don't"? Before I went to a very famous intelligent voice startup company, when I talked about this problem, his family's PM showed confusion: "What is the use of this logic processing? We see that users rarely make such expressions in the background. ”

When you hear such a comment, you can basically confirm that the company has not yet penetrated into the field of professional service dialogue.

In terms of scenes, once you get into the multi-round conversations in the service field, it is easy to encounter an expression like this: "I don't want this, is it cheaper?" I haven't encountered it in the background. I can only say that the user has not started the service yet. The scene aspect is related to the domain selection of AI company.

But in terms of technology, it is very important.Because this is the core feature of true intelligence.We will talk about this issue in detail in part 2&3.Now let's throw a conclusion: this problem cannot be solved, and the intelligent assistant will continue to be mentally retarded.

 To C team to To B  "

Since the important deep learning of 2015 has been ignited among developers, both large and small companies want to be the general-purpose assistants (the ultimate goal of To C products) that face individual consumers like “Her”. After a wave of hot money was given to the most promising seed team (with Fancy background), it was completely destroyed. So far, all commercial products in 2C, whether it is a giant or a startup, have failed to meet user expectations.

In people's intuition, they think that "intelligent assistants" deal with some daily tasks, do not involve professional needs, and should be better than "smart experts." This is the idea of ​​continuing "people." Recommending restaurants and arranging trips is something everyone will do; however, only a handful of professionally trained people can handle professional issues such as financial and medical consultations.

For the current AI, the opposite is true. Now it is possible to create an AI that defeats Ke Jie on Go, but it does not create an AI that can help Ke Jie manage his daily life.

With the collapse of the To C assistant track, To B or not to B is no longer a problem, because it has no choice but to To B. This is not a choice of business models, but a technical limitation. At present, To B, especially in the limited field, is more feasible than To C products: one reason is that the field is relatively closed, the user from the thought to the language, it is not easy to play the problem; on the other hand, the data is sufficient.

justTo B's companies are easy to be regarded as "outsourcing". Because the customers are talking one by one, the projects are delivered one by one, which means slow growth, relying on people, and no exponential growth brought by compound interest. Everyone said that they are not happy.

This "helping artificial robot" business is a bit like "helping people build in the web era." The team that turned into To B is often questioned by capital: "You are a project, how to scale it?"

It should be known that the time for many investment institutions and investment managers in China to enter the business is the wave of mobile interconnection in China. “Scalability” or “high-speed growth” is one of the most important indicators in the system. The project is a Case by case. If you want to grow, you have to pile up people, and it is difficult to exponential growth. This is a bit embarrassing.

"You can rest assured that I have SaaS! Oh no, it's AIaaS. I can build a platform with a set of tools that allow customers to assemble robots themselves." 

However, these startups who want to be skill platforms have not succeeded. It is impossible to succeed in the short term.

Yann LeCun's view of Aiaas


The main logic is this: you provide the customer with the tools, but he needs the statue - a sculptor in the middle. The evidence is that those who try to open the "Dialog Box" to smaller developers, even service providers, help them "3 minutes to develop their own AI robots", specifically not named. I can't develop a satisfying product myself, but also want to abstract a paradigm to let others follow your (not work) framework?

However, I think that the long-term success of MLaaS is possible, but it is still too early for the industry to develop more mature. Specific analysis we talked about in later Part 5.

"The success of speakers and the failure of intelligence"

In the field of dialogue, another hot track is a smart speaker.

All major technology companies have their own smart speakers, Tencent Dangdang, Ali's Tmall Elf, Xiaomi Speaker, foreign Alexa, Google's speakers and so on. As a hardware category, this is actually a pretty good business, basically belonging to manufacturing.

Not only is the shipment not bad, but it is also expected to be an ecological business – the core logic seems to be imaginative:

  • HyperTerminal: In the post-mobile era, everybody wants to grab the user's entrance like an iPhone. As long as users are accustomed to using voice to get advice or services, even Xbox/ps can be used to lose money and software to make money;
  • Doing OS with voice: Developers build all kinds of voice skills, and then feed back the market share of this OS through a large number of "inseparable skills";
  • Provide a developer platform: Like Xcode, it provides developers with tools and distribution platforms for application development, and provides traffic for using services.

However, the actual use of these skills is this: 

Source: The report of Statista


  • The much anticipated killer app did not appear;
  • There are basically no commercial service applications;
  • Skill developers don't make money and don't know how to make money;
  • Most of the skills used in high-frequency use have no commercial value - the most used by users is "check the weather"
  • There is no difference: the difference in intelligence is basically nothing.

" Emperor's new artificial intelligence "

Looking back, let's look at the Saudi Arabian citizen, Sophia. Since so many companies have just mentioned so much money and scientists, they have done this, why can this Sophia be a blockbuster?

Because Sophia's "smart" is a scam.

You can directly quote Yann LeCun's comment on this, "This is completely a ghost." 

To put it simply, Sophia is a puppet with a horn – the speeches and interviews at various conferences are all written manually, and then the output is synthesized by everyone. However, it has been promoted as a self-conscious speech of its "artificial intelligence."

This can also take "citizenship", which may be the worst time for human citizens to be hacked. It feels like my orange cat was awarded a bachelor's degree in civil engineering by a 985 University.

In fact, in the dialogue system, writing content manually, or using a template to reply, this is the current state of technology (we will expand later).

But deliberately describing the "non-intelligent" product as a "smart" performance, this is not right.

Considering that most of the people who eat melons understand the current technological development through media channels, the media that follow the hype (such as the named Tech Insider) are the accomplice of this scam. Those who do not know whether they are ignorant or unscrupulous liberal arts students really do not do a good job in investigating the work of journalists.

Recently, this demon wind has also blown into the domestic leeks. 

Sophia appeared in Wang Lihong's AI MV; then 2018 11 ran to the big business platform.

Really, the small partners who are serious about doing things in the industry should stand up and let everyone know more clearly where AI is now – or where the boundaries of machine learning are. Otherwise, the fathers of Party A thought it was true, and suddenly pointed to sophia and told you, "Others can be so natural, and you give me the whole one."

Are you afraid that you can't have a real person to go in?

By the way, when it comes to this, it is true now: using people - to disguise as artificial intelligence - to simulate people and serve users.

Source: The Guardian


The domestic case is typically a lobby robot for banks, which is actually a remote voice (so-called tele presence). The United States has X.ai, which is based on email-based schedule management. Only this AI will get off work at 5 in the afternoon.

Of course, if I am the developer behind these scams, when I am questioned, I can forcibly pull back the artificial intelligence: "This is to accumulate the real dialogue data, and later used to do the training of the real AI dialogue system identification. ”

It may be that there is no flaw in the foreign trade. But those who are really doing things right in the industry should stand up like Fu Sheng.Point out that these practices are deceptive: "No one in the world can do it...If it can't, it must not be done".

People in Saudi Arabia regard AI as an adult. These routines treat people as AI. Then the public began to know what AI is.

" Artificial intelligence (tmd)What is meant?"

On the other hand, since AI is so stupid now, why does Elon Musk say that?"AI is very likely to destroy humanity"Hawking even said directly "AI may be the worst event in human civilization.".  

On the other side, Facebook and Google’s chief scientist are saying that the current AI is slag.Don't need to worry at allAnd should even overthrow the redo.

Who should you believe? On the one hand is the man who wants to go to Mars, and the man who may have gone to Mars; the other is the leader of the two current technology giants.

In fact, they are all right, because the "artificial intelligence" mentioned here is two different things.

The artificial intelligence that Ma Yilong and Hawking are worried about is the real intelligence created by artificial intelligence, namely AGI (Artificial General Intelligence) and even Super Intelligence.

Artificial intelligence by Yann LeCun and Hinton refers to the technology currently used to achieve "artificial intelligence effects" (statistic-based machine learning). The two views are that "using artificial intelligence in this way will not work."

The nature of the two is completely different, one refers to the result, and one refers to the (now) process.

So what are we talking about when we are talking about artificial intelligence?

John McCathy


John McCathy created the word AI in 1956 and Marvin Minsky, Nathaniel Rochester and Claude Shannon at Dart's seemingly symposium, but so far, there is no unified understanding in the academic world.

The most fundamental problem is that the current definition of "intelligence" by humans is not clear enough. Moreover, whether human beings are the best embodiment of intelligence is not necessarily true. Think about some people who deal with each day :)

On the one hand, in the eyes of the public, artificial intelligence is "artificial, human intelligence," such as Siri. At the same time, the level of an AI depends on how much it looks like. So when Sophia appears in the public eye, ordinary people can easily be blinded (even through the Turing test). 

Oracle's definition of AI is also "as long as it can make the computersimulationThe technology of human behavior is counted! ”

On the other hand, literally “Artificial Intelligence”, as long as it is a man-made intelligent product, is theoretically counted as artificial intelligence. 

In other words, a handheld calculator, although not like a person, should be considered an artificial intelligence product. But I believe that most people don't think of calculators as artificial intelligence they understand.

These different interpretations of understanding have led to many differences in the expectations and assessments of AI applications.

Coupled with the concept of "deep learning, neural networks, machine learning", artificial intelligence has emerged. But what each means, what is the relationship between them, the general public does not know much.

"It doesn't matter, you don't need to know the leek." But those who want to cut the leek, it's best to figure it out. Even some investors can't tell the difference themselves. How do you make judgments and how to invest in projects? Of course, it’s a big chest.

The above is the current state of artificial intelligence in the field of dialogue at the end of 2018: the assistant is still mentally retarded; most of the To B's artificial robots cannot be scaled up; there is no shocking product like AlphaZero in the field of Go. There are no signs of a large-scale commercial rise; some are drowning, and people who fish in troubled waters.

Why is this happening? Why is artificial intelligence making rapid progress in image recognition, face recognition, and Go, but it is so confusing in the field of dialogue intelligence?

Since you have seen it here, I believe that you are a good comrade who is willing to explore the essence. Then let us understand what the essence of dialogue is; and what is the nature of the current dialogue system.


Part 2: The essence of the current dialogue system: filling out the form

" AI thinks, man laughs "

Source:The Globe and Mail


A group of chicks were born on a farm and lived with peace of mind.

A scientist appeared in the flock, and it noticed a phenomenon: every morning, food automatically appeared in the trough.

As an excellent Inductivist, this scientific chicken is not in a hurry to give a conclusion. It began to fully observe and record, trying to find out whether this phenomenon was established under different conditions.

"This is the case on Monday, this is the case on Tuesday; this is the case when the leaves turn green, and the yellowing of the leaves; the weather is cold, the weather is hot too; the rain is like this, so is the sun!" 

Every day observation makes it more and more exciting. In the heart, it is getting closer and closer to the truth. Until one day, this scientific chicken never observed new environmental changes. On the morning of the same day, when the door of the chicken house was opened, it went to the trough and saw it, still eating!

The scientific chicken, for his little friend, is determined to announce: "I predict that every morning, there will be food in the trough. There will be tomorrow morning! There will be in the future! We don't have to worry about starvation!"

After several days, the little friends verified the prophecy, and the scientific chicken proudly and excitedly summed it up as "the early chicken has the eating theorem".

It was just that the farmer’s farmer passed by and saw an excited chicken squeaking. He smiled. “This chicken is very cute. It’s better to make it a chicken.”

Scientific chicken, died at lunch time.

In this example, this Russell Chicken (Bertrand Russell's chicken' only counts and summarizes the phenomenon, and does not reason about the cause.

The mainstream statistical-based machine learning, especially deep learning, is also through a large number of cases, by categorizing the characteristics of the text to achieve the effect of identifying semantics. This practice is Russell Chicken.

At present, this is the mainstream technical basis of conversational artificial intelligence. Its main application direction is the dialogue system, or Agent. The aforementioned assistants Siri, Cortana, Google Assistant and the intelligent customer service in the industry are all applications of conversational intelligence.

" Dialogue smart black box "

The way these products interact is the natural language of humans, not the graphical interface.

Graphical interface (GUI) products, such as web pages or APP product designs, are WYSIWYG, interfaces, and functions.

Conversational UI (CUI, Conversational UI) is a black box: the end user can perceive what he or she says (input) and the robot's answer (output) - but the process of this process is not felt. It's like talking to someone, you don't know what he thinks.

The black box of every dialogue system is the world where developers can play freely.

Although the black box in each family is different, the bottom line of ideas is always changing. The core is two points:Listening to people (identification)Speaker (conversation management).

If you are a practitioner, please answer a question: Is the dialogue management in your home filled in? If so, you can skip this section (what is the main science fillet), please go directly to the fifth section of this chapter.Limitations of the current dialogue system".

" AIHow to understand people?"

The dialogue system suddenly started to fire in 2015 year, mainly because of the popularity of a technology: machine learning, especially deep learning, and speech recognition.NLU(Natural language understanding) - The main solution is to recognize what people are saying.

The popularity of this technology has led many teams to master a set of key skills: intent recognition and entity extraction. what does this mean? Let's look at an example.

In life, if you want to book a ticket, people will have many natural expressions:

"book a flight";

“Is there a flight to Shanghai?”;

"Look at the flight and depart for New York next Tuesday";

"To travel, help me check the ticket";

Etc., etc

It can be said that "natural expression" has an infinite number of combinations (natural language) that are all in the intention of "booking a ticket." Those who hear these expressions can accurately understand that these expressions refer to the "booking of tickets".

To understand such many different expressions is a challenge to the machine. In the past, machines could only deal with "structured data" (such as keywords), which means that if you want to understand what people are talking about, you must enter the precise instructions.

So, whether you say "I want to travel" or "Help me see the flight to Beijing", as long as these words do not contain the keyword "book booking" set in advance, the system can not handle. Moreover, as long as there are keywords, such as "I want to unsubscribe ticket" also have these three words, will also be processed into the user wants to book a ticket.

After the emergence of natural language understanding skills, the machine can be distinguished from the expressions of various natural languages, and which words belong to this intention; and those expressions are not attributed to this category, and no longer rely on such rigid keywords. For example, after training, the machine can identify “help me recommend a nearby restaurant”, which is not an expression of the intention of “booking a ticket”.

And, through training, the machine can also automatically extract "Shanghai" in the sentence, these two words refer to the concept of the destination (ie, the entity); "Next Tuesday" refers to the departure time.

In this way, it seems that "the machine can understand people!".

Is this technology popular? mainlyBecause of the academic atmosphere in the field of machine learning, important papers are basically open.. What different teams have to do is to consider the cost of a specific project implementation.

The final effect is in the field of recognizing natural language.Every basic tool is almost the same. The accuracy of intent recognition and entity extraction is the difference in percentage points. Since the tool itself is not core competitiveness, even you can use other homes, you can choose a lot, but the key is what can you do with it?

“Due to the academic culture that ML comes from, pretty much all of the primary science is published as soon as it's created – almost everything new is a paper that you can read and build with. But what do you build? ”

- Benedict Evans (A16Z Partner)

In this respect, the most obvious value is to liberate your hands. The voice control products only need to understand the user's natural language, and then perform this operation: to turn on the lights at home, you can directly say "turn on the lights" instead of pressing the switch; in the car, say "open the skylight" The sunroof opens, without having to find the corresponding button.

The focus of this type of system is to clearly understand which user is talking about what it is. Therefore, the microphone array, the near-field far-field anti-noise, the voiceprint recognize the identity of the person speaking,ASR(Voice to text), and so on, the hardware and software technology will appear correspondingly, and continue to optimize towards the previous goal.

“Talking words” is not that important in this type of application. Usually the execution of the task, feedback with the results, such as the light should be on. The linguistic feedback is only an auxiliary function, and it is optional.

However, the dialogue intelligence of the task class is often more than a round of interaction such as voice control. If a user says, "Look at the ticket for tomorrow" - this is normal, but it cannot be performed directly. Because of the lack of necessary information for execution: 1) Where do you start? Where to go with 2)?

If we want the AI ​​Agent to perform this task, we must get both information. For people to complete this business, to get information, you have to ask this user question to get information. Many times, there are more than one such problem, which means that multiple rounds of dialogue are to be initiated.

The same is true for AI.

Be aware of where to go = Agent asks the user "Where are you going?"

Be aware of "where to start" = Agent asks the user "Where do you want to start?"

This involves the generation of a conversational language.

" AI How to talk about people?"

Deciding what to say is the core of the dialogue system.-- Whether silicon-based or carbon-based. But deep learning in this section does not play any role.

At present, the mainstream approach to dealing with the "what to say" problem is by the so-called "Dialogue management"The system decided.

Although the "conversation management" mechanism behind each dialogue system is different, each family has a variety of understandings, designs, but it is inseparable from its own - all current task-based dialogue systems, whether it is Google duplex some time ago, Still intelligent customer service, or intelligent assistant, the most central dialogue management method, there is only one: "filling the slot", that is, Slot filling.

If you don't know the technology, but you need to quickly know the level of a conversational AI, if there is any black technology (such as a friend who just started to look at the AI ​​field), you only need to ask him a question: "Is it right?" Filling the slot?"

  • If they (honestly) answer "yes", then you can let go of your mind, black technology has not yet appeared. Next, the scope of discussion can be nothing more than product design, engineering implementation, how to solve the experience and scale of the dilemma. Basically, the mentally retarded person will still be mentally retarded.
  • If they answer "not fill in the slot" and the product is still very good, then it is interesting, worth studying, or please contact me quickly :)

So what is this "filling groove"? Well, everyone who doesn't engage in development can simply understand it as "filling in the form": Just like you have to go to the bank to do a business, you must first fill out a form.

If the space on this form is not filled, the counter lady will not do it for you. She will give you a red pen to circle: "The space that must be filled is this, you can leave it alone." You fill it out and hand it to the young lady, she will go to handle the business for you.

Remember the example of the ticket just? The user said "Look at the ticket for tomorrow", in order to perform the "check the ticket", you have to do the following steps, but also in order: 

1. ASR: Convert the user's voice into text.


2. NLU Semantic Recognition: Identify the above text, which is the (previously set) intent, here is the "booking ticket"; then, extract the entity inside the text, "Tomorrow" as the booking date, is extracted La.


3. Fill in the form: The intention is to book a ticket, then select the “booking ticket” form to fill in; there are three empty spaces in the table, and the time is empty, it is put into “tomorrow”.


(At this time, the 3 fields in the table are required, and there are two differences: "departure place" and "arrival place")


4. Programs that were programmed before starting to run: If the difference is "departure place", go back to "Where do you go?"; if the difference is "destination", go back to "Where are you going?"NLGQuoted on the ground because it is not a natural language generation in the true sense, but a dialog template applied.)


5. TTS: Combine the reply text into a voice and play it out

In the above process, both 1 and 2 steps are identified by deep learning. If there is a problem in this link, there will be continuous errors.

Loop 1-5 process, as long as there is still space to fill in the table, continue to ask the user until all required fields are filled. So, the form can be submitted to Miss Sister (back-end processing).

The back end looks at the conditions to be checked and returns the status of the ticket that meets these conditions. The Agent then sends the query result back to the user with the previously designed reply template.

By the way, we often hear people say, "Our multi-round dialogue can support xx rounds, and at most, users can say xx rounds." Now everyone knows that in the task-based dialogue system, the "generation of the number of rounds" is determined by the number of times the form is filled, so the method of measuring the product level by "the number of rounds" is completely in this task-like dialogue. Meaningless.

It must be meaningful, and it should be: under the premise of achieving the goal without affecting the experience, the fewer the number of rounds, the better.

At present, as long as you do multiple rounds of dialogue in the task class, you can't basically fill the form.

During the month of 5, Google I/O released Duplex's recording demo. The scene was that Google Assistant instead of the user called to book the restaurant, communicated with the clerk to help the user pre-position. It's worth noting that this is not a Live demo.

Google's Assistant. CREDIT:GOOGLE

Then how does Google's Assistant (hereafter called IPA) know the specific needs of users? What can't be undone is that users have to fill out a form for Google Assistant and use dialogue to explain their specific needs, such as the following: 

On the left is a restaurant that uses Google Assistant to book a restaurant.Real case from The Verge.


" Limitations of the current dialogue system "

I just spent two thousand words to explain the general idea of ​​the dialogue system. Next, point out the problem with this approach.

Remember the "Don't Japanese Food" test mentioned earlier? We used this test kit on the "booking ticket" scenario. Try it out: "Look at the flight to Beijing tomorrow, you can do it outside of China Eastern Airlines," or follow the steps:

1. ASR voice to text, no problem;

2. Semantic recognition, looks like a problem

– Intent: to book a ticket, yes;

– Entity extraction: follow the previous training;

    – Time: tomorrow

    – Destination: Beijing

    – Starting point: The user didn’t say anything, you have to ask him later...

Wait, what he said is "can be outside of China Eastern Airlines", what is it? I have not trained in airline-related expressions before.

It doesn't matter, we can add the training of this expression: China Eastern Airlines = Air Division. Find more expressions, as long as the user said the name of each airline, they are trained to become the entity of the airline division.

In addition, we can also add an airline selection in the box to fill in the form, like this (yellow part):

 (Well, a lot of teams doing TO B are falling in this pit that can be added later.)

However, after such a natural training, the aircraft extracted by the entity is “China Eastern Airlines” – and the user is talking about “other than Eastern Airlines”. Which of the following is the airline?

"Or, let's do something Trick to take out the logic of 'outside' and handle it manually?" - If this problem can be handled so easily, do you think Siri and other things will look like this now?The difficulty is not that "outside" can't be extracted, but in the process of "other than this, which entity is outside?"

Currently, the deep learning-based NLU can only extract "entities" in the "entity extraction" technology.

And people can understand that in this case, the user refers to “excluding other options than Eastern Airlines”. This is because in addition to doing “entity extraction”, people also make a logical identification according to their context. : "Besides xx". Then, the processing of this logic, that is, reasoning, is automatically performed to further understand what the other party really refers to (ie, refer to).

The process of logical reasoning does not exist in the previously designed steps (from 1 to 5).

What is even more troublesome is that the emergence of logic not only affects "entities" but also "intentions":

“hi Siri, don’t recommend restaurants” – it will still recommend restaurants for you;

“hi Siri, what else can you recommend besides recommending a restaurant?” – it will still recommend a restaurant for you. 

Chinese and English are the same; Google assistant is the same.

To deal with this problem is not only to identify the "logic"; but also to correctly determine which entity is applied to the logic, or whether it is applied directly to an intent. How is this judgment made? What to do? Not within the scope of the current SLU.

The treatment of these problems, if concentrated in some relatively closed scenarios, can also solve a seven seven eight eight. However, if you want a fundamental, generalized process,I hope that all the problems of the scene will be solved in one treatment, and there is no solution until now.. In this regard, Siri is like this, Google Assistant is the same, any one, it is like this.

Why do you have no solution? Let's take a look at the test.

" Using the Turing test to measure the dialogue system is useless"

When it comes to testing artificial intelligence, most people's first reaction is the Turing test.

During the 5 month of the Google I/O conference, our team was serving a global 100 strong company to plan AI Agent-based services for them.

On the second day of the conference, I received a good reminder from the customer's Tech Office: Will Google, a real-life black technology, overturn the existing technology solution? My answer is not.

The demo of Google Duplex at the press conference is really impressive, and most people who watched the Demo can't tell if they are calling to make a reservation.

"This effect is, in a sense, passed the Turing test." 

The parent company of Google’s parent company said that google duplex can be considered a Turing test.


due toThe essence of the Turing test is "spoofing" (A game of deception,See Toby Walsh's paper for details.), so many people criticize it, this can only be used to test how much people deceive, not to measure intelligence. At this point, we will have more explanation in the nature of the following Part 4 dialogue.

The main reason people are fooled by this demo is because the synthesized speech is very real.

This is indeed the most arrogant place for Duplex: speech synthesis. I have to admit that the effects of simulating vocals, including tone, tone, etc., are truly amazing. Just, in terms of speech synthesis, even if it is the ultimate, inEssentially a parrot - up toFraud Alexa(So ​​you see how important life recognition is).

However, this dialogue system demonstrated by Google can't handle logical reasoning and refer to such problems. This means that it can't pass the Turing test.Winograd Schema Challengetest.

Compared to the Turing test, thisTesting is the key to direct learning. When humans grammatically analyze sentences, they use real-world knowledge to understand the objects they refer to. The goal of this test is to test the common sense reasoning skills that are currently lacking in deep learning.

If we use the Winograd Schema Challenge method to test the level of AI in the "restaurant recommendation" scenario, the topic would be something like this:

A. "Sichuan hot pot is better than Japanese because it is very hot"

B. "Sichuan hot pot is better than Japanese because it is not spicy"

AI needs to be able to point out accurately: In the A sentence, "it" refers to the Sichuan hot pot; and in the B sentence, "it" refers to the Japanese material.

Remember the "Do not Japanese food test" mentioned in the article Part 1? I really don't emphasize that "there are four ways to write back words" - the essence of this test is to test whether the dialogue system can use simple logic to make reasoning (what is referred to).

In the Winograd Schema Challenge, the world knowledge (including common sense) is used for reasoning:

If the system does not know the corresponding common sense (Sichuan hot pot is spicy; the Japanese material is not spicy), there is no basis for reasoning. Not to mention that reasoning still needs to be performed accurately.

Some people say that we can solve this problem through context processing. Sorry, the above common sense does not appear in the entire dialogue. Not in the "above", how to deal with it?

For a detailed explanation of this section, see the next chapter (The essence of the Part 3 conversation).

Although it refers to problems and logic problems, it seems that it is deadly enough in terms of application; but these are just some of the limitations of deep learning.

Even further, after a while, an AI took the 100% correct rate at the Winograd Schema Challenge, and we can't expect it to behave like a human in natural language processing because there are more serious and more fundamental problems. Waiting behind.

" The bigger challenge of the dialogue system is notNLU "

Let's see where the problem is.

Now we know that when people talk to the current AI, AI can recognize what you are saying, relying on deep learning to classify the natural language you speak, attribute it to the set intention, and find out what is in the text. entity.

When does AI answer you and when you ask you, basically depends on the various forms in the "Dialog Management" system behind it, and the required fields are not filled out. And if you ask, it is done manually by the product manager and the code brother.

So who made this watch?

Or, who decides, what should be considered in the matter of "booking a ticket"? What information do you want to get? What questions do I need to ask? How did the machine know?

It is human. It is the product manager, to be precise.

Just like the case of the "booking ticket" just now, when the user asked "Airline", the concept was not designed in the previous table, and the AI ​​could not handle it.

In order for AI to handle such new conditions, the “Airline” column (yellow part) will be added to the “Book Ticket” form. And this process has to be done manually: after the product manager is designed, the engineer can program the programming of this table.

Therefore, AI is not true. Through the case study, it automatically understands the “book booking” and what factors are included. As long as the table is still designed and programmed by people, at the product level, once the user talks a little about the content outside the table, the situation of mental retardation naturally arises.

So, when Google duplex appeared, I didn't care much about how Google duplex pronunciation and pauses were like a person—in fact, when I looked at any dialog system, I only care about 1 questions:

"Who designed the watch: man, or AI?"

However, deep learning in the dialogue system, all that can be done is to identify the part of the sentence spoken by the user - strictly in accordance with the artificial training (supervised learning). As for other aspects, such as what to say? When should I speak? It can't do anything about it.

But the process of real people's dialogue is not designed by the dialogue system mentioned above, and it is a hundred thousand miles away. How is the dialogue of people carried out? Where is this difference? Why is the difference so big? The so-called deep learning is difficult to get, is how people get it? After all, on this planet, we are 70 billion perfect natural language processing systems.

We need to understand the problem to be solved before we can solve the problem. In the field of dialogue, we need to know what the nature of people's dialogue is. The next chapter compares brain burning, we will discuss the "thinking" thing, how to lead people's dialogue.


Part 3: The essence of human dialogue: thinking

" The ultimate goal of dialogue is to synchronize thinking "

You are a 30 protagonist. Every morning at 9 half, you have to go through the revolving door of the office building, enter the lobby, then brush the elevator into the elevator, go to the 28 building, your office. Today is the 1 month 6 day, a bland day. You just got into the elevator. There is only one person in the elevator. When you are about to close the door, there is a man rushing in.

The courier brother who came in. When he entered the elevator, he saw only two of you, and he said "hello" and then looked down at the floor button.

You naturally responded: "Hello," and then turned to one side.

There is nothing to say on both sides - in fact, it is the two sides of the dialogue that believe that there is nothing in the situation that needs to be synchronized.

People use language to talk, and the ultimate goal is to keep both sides in sync with the current situation model. (Everyone first understands this concept is enough. More interested, see details Toward a neural basis of interactive alignment in conversation). 

The interactive-alignment model (based on Pickering and Garrod, 2004)


In the above picture, all the dialogues between A and B are developed to keep the two “Situation models” in the red box in sync. The Situation model can be easily understood here as an understanding of all aspects of the event, including Context.

Many friends who do dialogue systems think that Context refers only to the "context in the conversation". What I want to point out is that, in addition, the Context should also contain the scenes of people when the conversation occurs. This scene model covers all the perceived information except the plaintext at the moment of the conversation. For example, the weather situation when the conversation occurs, as long as it is perceived by people, will be put into the Context and affect the development of the dialogue content.

A: "What do you think about this thing?" 

B: "It seems to be raining this day, let's go in and say it" - although the conversation did not involve the weather.

For the same thing, the scene models built by different people in their minds are different. (To find out more, you can see Situation models in language comprehension and memory. Zwaan, RA, & Radvansky, GA (1998). 

So, if you are rushing into the elevator to your project boss, and assuming that he and you (mostly he is) are concerned about the recent progress of new projects, then there is a lot of dialogue you have to make.

In the elevator, you say hello to him: "Zhang Zong, early!", he will return to you "early, right yesterday..."

Don't wait for him to finish, excellent as you can guess "Zhang Zong". The content to be talked about later is about new projects. This is because you think that Zhang’s understanding of this "new project" is different from yours. necessary. Even, you can reason that you should reply to his specific questions about the project by the fact that he was out of the office yesterday and probably missed part of the project.

"You were not there yesterday, don't worry, the client has handled it. The money is also communicated well, 30 is done within days." - You see, you can't wait for Zhang to finish your question. on. Thanks to your judgment of his model is correct.

Once you have misjudged the other's scenario model, you may be completely "not hitting the middle point."

"I know, I went back to the company last night. Xiao Li told me. I want to say that when I came back to the office last night, why didn't you work overtime? Xiao Wang, you can't do this..."

Therefore, in the process of dialogue, people do not rely solely on what the other party said in the previous sentence (the information contained in the plaintext in the dialogue) to decide what to reply. This is very different from the response mechanism of the current dialogue system.

" Dialogue is the projection of thought from high dimension to low dimension "

We assume that in another parallel universe, you are still in the office building.

Today is 1 6 Day, but today, 2 years ago, you broke up with his girlfriend who had been in 5 years. After that, I have never forgotten about her and have not met new people.

As in the past, when you enter the elevator and just close the door, a person who hastily entered, the door to be closed is opened again. It is the ex-girlfriend who broke up before 2. When she entered the door, she saw only two of you. She looked up at you and then looked down at the elevator. She said, "Hello."

Do you have a lot of information in your head? What should I answer at this time? Is it similar to the feeling of "I don't know how to open it at the moment"?

This feeling comes from (you think) the scenario model between you and her is too different (breaking 2 years), and even you can't judge what information is missing. There is too much information to synchronize, but it is trapped by poor language.

To the extent of information richness,Language is barren, and thoughts are richer "Language is sketchy, thought is rich" (New perspectives on language and thought, Lila Gleitman, The Oxford Handbook of Thinking and Reasoning; For more related discussions, please see, Fisher & Gleitman, 2002; Papafragou, 2007)

Someone made a metaphor: language is a corner of the iceberg compared to the richness of thinking. I think it is much more than that:Dialogue is the projection of thought in low dimensions.

If it is an iceberg, you can also reverse the part of the water from the surface of the water. Belong to the same dimension, but the amount is different. But the problem of language is that only by hearing the text message, to reverse the thoughts of the person speaking, the distortion situation will be very serious.

To facilitate understanding of this dimensional difference, here are examples of 3D and 2D: thinking is a high dimension (the shape of a stereoscopic 3D), and the dialogue is a low dimension (the shadow on the plane of 2D). If we want to push back from the shape of the shadow on the plane, what is hanging on it is very difficult. The shapes of the two shadows are exactly the same, but the 3D objects above may be completely different.

For language, the shadow is like two "hello" literally the same, but the content of the thought is completely different. At the moment of the meeting, the difference is very large:

You are thinking (column): I haven't seen it for more than a year. Is she still okay?

The ex-girlfriend is thinking (ball): This person is familiar, as if knowing...

" Challenge: Express high dimensionality with low dimensionality "

How difficult is it to describe words in words? This is like, when you try to explain to another friend who is not on the scene, what kind of restoration can you do?

Try using words to describe how you passed this morning.

When you fully describe the text, I will be able to find a thing or a specific detail, which is outside the description of your text, but it does exist in the time and space of your morning.


For example, you may mention to a friend that you have a bowl of noodles for breakfast; but you will not specifically describe what spices are in the face. When the information is transmitted, these details (information) are missing, and when the audience hears the bowl of noodles, the expression in the mind must not be the "face of the bowl" that you eat in the morning.

This is like letting you reverse the shape of 2D with a shadow on the plane (3D). What you can do is to increase the description angle as much as possible, and provide the listener with different 2D material as much as possible to restore the 3D effect.

To explain the relationship between "language" and "thought" in the brain (synchronized with the reader's scenario model), I drew the above comparison chart to help convey the message. If you want to use words to describe them accurately, and try to keep the information intact, then I have to use much more text to describe the details. (As in the above description, the specific size, color, and the like of the area of ​​the shadow have not been mentioned yet).

This is just a description of objective things. When people are trying to describe a more emotional subjective feeling, it is more difficult to express it in concrete words. 

For example, when you seeA little girl like Angelina Jordan can sing a song like I put a spell on youWhen you try, use the language to accurately describe your subjective feelings. Is it very difficult? Can you speak something like "Goose Sisters"? How much of these words can represent the feelings in your brain? 1%?

I hope that at this time, you can understand more about the so-called "language is poor, and thinking is richer."

So, since the language loses so much information when it passes the information, why do people understand that it seems that they have not encountered too many problems?

" Why are people's conversations easy?"

Suppose there is a way to pass the feelings in your brain at this moment to another person with a completely undistorted effect. How much difference does the richness of this information convey compared to the text above?

Unfortunately, we don't have this kind of tool. Our main communication tool is language, which relies on dialogue to try to let the other person know their situation.

Then, since the language is so inaccurate, full of logical loopholes, and the amount of information is not enough, how can people understand and build a whole civilization based on this?

For example, in a restaurant, when the waiter said "the ham sandwich is going to pay the bill", we all know that this is the same thing as "the 20 table has to pay the bill" (Nuberg, 1978). What makes the expression of such a big difference, can also effectively convey information?

People can effectively understand language through dialogue, relying on the ability to interpret - more specific points, relying on the consensus of both sides of the dialogue and the ability to reason based on consensus.

When a person receives a low-dimensional language, it combines common sense and its own world model (detailed) to reconstruct a model of thinking, corresponding to the meaning of the language. This is not a new point of view. The familiar teacher who started the game, in the 1991 year when Apple engaged in speech recognition, was interviewing Rip, "Humans use common sense to help understand speech. "

When the two sides of the conversation think that the understanding of one thing is the same, or very close, they don't have to talk about it. What needs to be communicated is the part that is different from each other.

When you hear the word "apple", the dimensions of the model you have built in the past are quoted, including green or red, sweet taste, approximate fist size, and so on. If you hear the other person say "blue apple", this is different (color) from the model you built in the past about apple. Thinking will generate a reminder that you want to synchronize or update the model. "Why is Apple blue?"

Remember, the test we mentioned in Part 2 refers to the Winograd Schema Challenge of the relationship? The name of this test is based onAn example of Terry WinogradCome here.

“Members refused to issue licenses to protesters because they [Afraid/advocate] Violence. ” 


when [ScaredWhen they appear in the sentence, "they" should refer to the members of the parliament; when [promoteWhen they appear in the sentence, "they" refers to "protesters."


1. People can make judgments on a case-by-case basis because they make reasoning based on common sense, "MembersScaredViolence; protesterspromoteviolence. ”


2. The person who said this sentence thinks that this common sense should be a consensus for the audience, and it is omitted directly.

In the same way, the common sense mentioned in the previous part (Part 2) ("Sichuan hot pot is spicy; Japanese material is not spicy"), was also omitted in the expression. The total amount of common sense (and often the consensus of most people) is countless, and in general it will continue to increase as the development of human society evolves.

For example 1, if your world model already contains "Hua Nong Brothers" (you have seen and understood their stories), you will find my first example in Part 2, hiding a stalk (made as a chicken) . But because the "Hua Nong brothers" is not a common sense that most people know, but a consensus with a specific group of people, so when you see this sentence, you get more information than you. People who don't understand this stalk will not receive this extra information when they see it. Instead, they will feel that this expression is a bit strange.

For example, 2, a friend of the venture capital circle should have heard of Elevator pitch, which is 30 seconds, to make clear what you want to do. The usual cases are: "We are Uber in the restaurant industry," or "We are the office version of Airbnb." This typical structure is "XX version of YY". To make this sentence work, the premise is that the concepts of XX and YY have been incorporated into the audience's model before the conversation takes place. If I tell others that I am "McKinsey in the dialogue intelligence industry", to let the other party understand, the other party must understand both the intelligence of the conversation and what McKinsey is.

" Reasoning based on world model "

The scene model is based on a certain dialogue. The dialogue is different and the scene model is different. The world model is based on one person and relatively long-term.

The perception of the world, including sensory feedbacks such as sound, sight, smell, and touch, helps people build a physical understanding of the world. The understanding of common sense, including the perception of various phenomena and laws, helps people to generate a more complete model:World model.

Regardless of accuracy or right or wrong, each person's world model is not exactly the same. It may be that the observed information is different, or it may be that the reasoning ability is different. The world model affects the human mind itself, which in turn affects the projection of thinking in a low dimension: dialogue.

Let's start with an example: Suppose now let's work together as a less intelligent assistant. We hope that this assistant can recommend restaurants and bars to meet the following needs:

When the user says, "I want to drink something," how should the system answer this sentence? After Part 2, I believe everyone knows that we can train it to be a shop that aims to “find something to drink”, then retrieve the surrounding stores and reply to this sentence: “Find these choices near you” .

Congratulations, we have reached the level of Siri!

However, we just said at the beginning that we should be an assistant who is not so mentally handicapped. Is this “drinking shop” a milk tea or a coffee shop? Still give him all?

Well, this involves reasoning. Let's simulate one manually. Suppose we have the user's profile data and use this: If the favorite drink in his preference is coffee, give him a coffee shop. 

In this way, we can reply to him more “personalized”: “Find these coffee shops near you”.

At this time, our AI has reached the personal concept that many "intelligent systems" like to advocate - "Thousands of people"!

Then let's see how stupid this concept is.

A person likes to drink coffee, then he has to drink coffee at any time of his life? How do people deal with this problem? If the user asks at the 1 point in the afternoon, it is okay to return to him in this case; if it is 11 at night? Do we still recommend a coffee shop to him? Or should I recommend a bar to him?

Or, besides that, if today is his birthday, then should we give him something different? Or, today is Christmas, shouldn't I recommend hot chocolate to him?

You see, time is a dimension, and different values ​​in this dimension are affecting what is different for the user. 

The difference between time and user profile is:

1. There is an infinite number of values ​​in this dimension of time;

2. Each scale is still different. For example, although the birthday is the same date, the number of birthdays is not repeated;

In addition to the time dimension, there is space.

So we superimpose the dimension of space (to time). You will find that if the user asks this question at home on the weekend (maybe want to order milk tea to go home?), and ask him this question in the office at work time (may want to go out and change ideas), let us reply to him. It should also be different.

There are two infinite combinations of time and space., The logic of "if then" can't be written manually. The tools that we build robots are struggling to meet this demand.

Moreover, time and space are just two of the most obvious dimensions of the world model. There are more, more abstract dimensions that exist and directly influence the conversation with the user. For example, the relationship between characters; the experience of people; the change of the weather; the relationship between people and geographical location (often come on a business trip, is a local indigenous, is the first time to travel) and so on. Let's talk about it here, do you still have a chat system?Is it a bit like a chat recommendation system??

To be better, the factors of these dimensions are superimposed together for causal reasoning and then the results are given to the user.

At this point, the information that affects people's dialogue is that information (without reasoning) has at least these three parts: plaintext (with context) + scene model (Context) + world model.

Ordinary people can do this work effortlessly. But deep learning can only process information based on plain text. Perception, generation, model-based reasoning of scene models and world models,Deep learning is powerless.

This is why the current intensive deep learning cannot achieve the essential reason for true intelligence (AGI): causal reasoning cannot be done.

The effect of reasoning based on the world model is not only reflected in the dialogue, but also applied to all projects that are now AI, such as autonomous driving.

After a large number of trained self-driving cars, there is not enough training material in case of accidental conditions. For example, a stroller that suddenly appears on the road and a trash can that suddenly rolls on the road will be regarded as an obstacle, but in the case of a car that cannot be stopped, which one must be hit?

For another example, for Douglas Hofstardler, "driving" means when you want to go to a place in a hurry, you have to choose whether to speed or not; to get out of the traffic jam, or to follow the traffic slowly on the high speed …These decisions are all part of driving. He says:"Everything in the world is affecting the essence of "driving" . "

" The human brain has two systems: the systemsystem"

For more information on "System 1 and System 2", please read Thinking, Fast and Slow, by Daniel Kahneman, a very good book that provides an in-depth analysis of how people's cognitive work is unfolded. Here, I will introduce to friends who I don’t know yet, in order to assist the viewpoint before and after this article.

Psychologists believe that human thinking and cognitive work are divided into two systems to deal with:

  • System 1 is quick thinking: unconscious, fast, less mentally motivated, no reasoning
  • System 2 is slow thinking: you need to mobilize your attention, the process is slower, you have to brainstorm, you need reasoning
  • The system 1 first comes up, and when it encounters something uncertain, the system 2 will come forward to solve it.

The things that system 1 does include: judging the distance of two objects, tracing the source of the sound, cloze ("I love Beijing Tianan") and so on.

By the way, when playing chess, you can see at a glance that this is a good move. This behavior is also implemented by the system 1 - provided that you are a good player.

For Chinese students, you suddenly ask him: "7 multiplied by 7", he will say without thinking: "49!" This is the system 1 is working, because we will back 99 multiplication table in primary school. This 49 is not from the calculation results, but from the back (repeated repetition).

Correspondingly, if you ask: "How much is 3287 x 2234?", this time people need to call the multiplication rules in the world model and apply them (calculations). This is the work of the system 2.

In addition, in the world set by the system 1, the cat will not be barking like a dog. If something violates the world model set by the system 1, the system 2 will also be activated.

In terms of language,Yoshua Bengio believes that the system 1 does not do language-related work; the system 2 is responsible for language work.. For deep learning, it is more suitable for the work of the system 1.In fact, it does not have the functionality of the system 2 at all..

Regarding these two systems, it is worth mentioning that people can be trained to convert some of the things that 2 can do into system 1. For example, Chinese students have to go through the "painful memory process" to master the 99 multiplication table, instead of learning slowly with the natural experience of being born and growing up.

But here are some interesting features of 2:

When 1. becomes the system 1 to handle problems, it saves energy. People tend to believe in their own experience because brain power consumes a lot of energy, which is an energy-saving practice.


2. When it becomes a system 1, it sacrifices its dialectical power because the system 1 knows nothing about logic-related problems. "I have been doing this for decades." This kind of empirical thinking is a typical case.

Think about how your long-term accumulated cases are affecting your own judgment.

" It’s impossible to learn language by deep learning, not now, not in the future. "

In the artificial intelligence industry, you often hear people say, “Although the current technology can't achieve the ideal artificial intelligence, the technology will continue to evolve. As the data accumulates more and more, it will eventually be realized. Human artificial intelligence."

If this statement refers to the hope that only by deep learning, and constantly accumulating the amount of data, it will be able to turn over the disk - that is a big mistake.

No matter how you optimize the core technology of the "carriage" (such as stronger, more horses), you can't build a car.(pictured right). 

For the public, the evolving nature of technology is to look at the relationship between humans and technology from a macro perspective. But the evolution of the engine has nothing to do with the key technology of the carriage.

3 in the field of deep learning believes that this path of deep learning alone (can not lead to AGI). Interested friends can study in this direction: 

  • Yoshua Bengio's point of view"If you have a good causal model for the world you are in touch with every day, you can even abstract unfamiliar situations. This is very important... the machine cannot, because the machine does not have these causal models. We can make it by hand. These models, but this is far from enough. We need machines that can discover causal models."
  • Yann LeCun's point of view"A learning predictive world model is what we're missing today, and in my opinion is the biggest obstacle to significant progress in AI."

As for the role of deep learning in real intelligence in the future, here IQuote Gary Marcus"I don't think that deep learning won't play a role in natural understanding, only that deep learning can't succeed on its own."

" Explain artificial intellectual retardation products "

Now, we understand that the essence of people's dialogue is the exchange of ideas, and far more than just plain text recognition and recognition-based responses. The current artificial intelligence products are completely unable to achieve this effect. Then when the user takes the human world model and reasoning ability to interact with the machine and interact with natural language, it is easy to see the flaws.

  • Sophia is a technical scam (everything that Sophia is true AI, either does not understand or is flickering);
  • There is no real intelligence in the current AI (the reasoning ability does not exist, including Alpha go);
  • As long as it is deep learning or mainstream, there is no need to worry about AI ruling humans;
  • Dialogue products feel like mental retardation, because they want to skip thinking and directly simulate dialogue (and now only);
  • “The more you use, the more data you have, the stronger your intelligence will be, the better your product will be, the more you will use it.” For a task-based conversation product, this is a cool, actually unreliable view. ;
  • An AI agent, how many rounds can be talked, meaningless;
  • To C's assistant product does not work well because it can't solve the problem of "how to get the user's world model data and use it";
  • Why is the dialogue intelligence company to B difficult to scale? (because the scene model is manually generated)
  • Have intelligence first, then have language: To achieve a natural language dialogue in the true sense, at least to realize the reasoning ability based on common sense and the world model. If this can be achieved, then as human beings, we may really need to start worrying about the intelligence mentioned above.
  • Do not use itNLPEvaluate a conversational smart productAt the end of the year, some media have started to list various AI companies, and many of them have divided the companies under the NLP. This is like, don't use a touch screen to measure a smartphone. I don't mean that the touch screen or the NLP is not important here. On the contrary, because it is too important, this link has become a standard for every family, so that it has basically achieved its head in this respect, but the difference is 1%.
  • For a conversational product, the NLU, while important, should only account for about 5-10% of the overall accessory. Furthermore, even if the part of the intent identification and the entity extraction is used by the big manufacturer, the difference between the products is much smaller than that of the dialogue management part.What really determines the product is the rest90%system.

At this point, is there a feeling of despair? There is no solution for these academics and industry bulls, or even a sure idea. Is it a drama to do products like dialogue intelligence? Is this the upper limit?

No. For a technology, it may indeed bottom out; but for application and product design, it is not determined by one technology, but a combination of many technologies, there is still a lot of space.

As a product manager, let me change my perspective. Let's take a look at it. Since the tools in hand are these, what can we do with them?


Part 4: The potential of AI products lies in design

" AIReturnAIProduct return product "

The Prestige 2006, stills


There is a movie I like very much, The Prestige, which tells a magic about "moving in an instant." For the audience, it disappears from one place and then appears from another place in an instant.

The first magician succeeded in achieving this effect on the stage. He opened the door on the right side of the stage, and as soon as he entered, he came out from the door on the left side of the stage. For the audience, this is in line with their expectations.

The second magician was shocked after seeing the effect in the audience. He felt that there was no flaw at all. But he is a magician - as a product manager - he wants to study how this product is implemented. But in the magic industry, the most unappealing thing is the magic secret.

At the end of the film, he got the answer (spoiler warning): all the engineering agencies, lifts, and so on, were hidden under the stage as he expected. But the real core is that the first magician has been hiding his other twin brother. When he opened a door and jumped off the stage from the hole, the other of the twins immediately rose to the stage from the other side.

When you see this, everyone may suddenly realize: "It turned out to be like this, twins!"

Is this feeling a bit like a deja vu? In this article Part 2, we talked about opening the black box of the dialogue system, which is when filling a table, is there a similar feeling? The product of conversational artificial intelligence (conversation system) is like a magic, it is a black box, and the user judges the value by perception.

"I thought there was any black technology. I am a twin. I can."

In fact, this is not easy. Let's not talk about the engineering design inside the magic stage. The hardest part of this magic is how to make another twin disappear completely in the public's field of vision in the magician's life. If the audience knows that the magician is a twin, it is very likely that the magic on the stage is performed by two people. Therefore, this twin must not appear in the "world model" of the public.

In order to let the other twins disappear into the public's field of vision, the two brothers paid a lot of price, and the mind and the body were not acceptable to the average person, such as sharing the same wife.

This is my suggestion:When the technology is not enough, the design is to make up. Students who are doing AI products, don't expect to give you intelligence. If you really have intelligence, what else do you need? Artificial intelligence product managers need to design a large system, including the standard practice of filling out forms and of course including intent recognition and entity extraction brought by deep learning, including various possible dialog management, context processing, Logic refers to and so on.

These parts are the space for product design and engineering power.

" The basis of the design idea "

I need to emphasize that here, we are talking about the idea of ​​AI products, not the realization of AI.

For the design of conversational products, based on the current deep learning,Semantic understanding should only account for 5%-10% of the entire product.Others, I want to do everything I can to simulate the effect of "transfer" - after all, we all know that this is a magic. If it is just identification, it will take up a lot of effort in your product. Others will not open the difference. Basically, it is mentally retarded.

In terms of product development, if the R&D team can provide tools that are mixed with multiple technologies, it will definitely increase the development team and design space. This practice is alsoDL (Deep Learning) + GOFAI (Good Old Fashioned AI) Combination. GOFAI isFirst proposed by John HaugelandThat is, the symbolic AI before the deep learning fire, that is, the expert system, that is, the "if then..." that most people in the AI ​​field look down on.

The premise of DL+GOFAI is the basis of all current follow-up product design ideas..

" Design Principle: Being is perceived "

"Presence is perceived"The legend of the 18 century philosopher George Berkeley. The name of the University of California at Berkeley is also to commemorate this idealist master. This means that if you can't be perceived, you just don't exist!"

I think that "existence is perceived" is the Design principle of conversational AI products.The intelligence behind the conversation product is perceived by the user.. Until one day AI can replace the product manager, before that, all the design should be around, how to make the user feel that the AI ​​talking with himself is valuable, and then smart.

Be very clear about your purpose,Designed for AI products, not AGI itself. Just like the designer of the magic, giving you limited basic technical conditions, you can assemble a product, and the experience is hard to think of.

At the same time, we must also deeply understand the limitations of the product. Magic is magic, not reality.

This means that the magic on the stage, if you change some important conditions, it will not be established. For example, if you let the audience run to the top of the stage and look at the magic from top to bottom, you will find holes in the stage. Or "moving in an instant" is not one of the twins, but a viewer ran up and said, "Let me try to move in an instant" and wear it.

Narrow AI's products are the same. If you design a Domain, no matter what the experience, as long as the user ran outside the boundaries of the Domain, it crashed. First set the product boundary, design the "feedback to the user when crossing the border", and then in the field, simulate the magic effect as much as possible.

Assuming that the boundaries of the Domain have been set clearly, what aspects can be greatly increased by the power of design and engineering?

In fact, the thinking-related part of the "Part 3 Dialogue Essence" can be used as a starting point for design under the premise of limiting the domain: you can use GOFAI to simulate the world model, or you can simulate the scene model, you Fake logic can be reasoned, Fake context can be used as long as they are all qualified in the Domain.

" Choose the right oneDomain "

The cost (the amount of engineering and design) and the value to the user are not always proportional, but also different for different domains.

For example, I think that all the chattering robots now have little value. Open the Domain, there are no goals, no limits and boundaries, and for the user, you can think of anything to talk about. However, its own "scene model" is blank, and it has no knowledge of the common sense that users know. Cause the user to try it a bit and hit the wall. I call this user experience "easy to encounter setbacks every time I try."

Possibly, some Domains don't value the content of the reply. It does not require such a strong scene model and reasoning mechanism to generate the reply content.

Let's assume that we can make a "tree-hole robot" that can define the product as a good listener and let the user confide in the stress of the heart.

Human Counseling. Source: Bradley University Online


The boundaries of this product need to be very specific and hardened into the user's scene model when the user first comes into contact. Mainly the system encourages users to continue to speak through feedback from some languages. And don't encourage users to expect the dialogue system to output a lot of correct and valuable words. After the user makes some statements, he can keep up with some of the lesser, more general terms of the "scene model."

"I have never thought about this before, why do you think so?"

"What do you know about this person?"

"Why do you think he is like this?"

……

In this way, the demand on the product greatly reduces the dependence on "natural language generation". Because of the value of this product, it is not accurate whether the specific content of the reply is valuable or not. This also reduces the need for high-dimensional modules such as "scene models," "world models," and "common sense reasoning." The material of training is the case of a counselor's conversation in a particular branch area (such as the workplace, family, etc.). In terms of product definition, this is a Companion type product that does not really play a role in physiotherapy.

Of course, the above is not a real product design, just an example to illustrate, different Domains have different requirements for the language interaction behind them, and thus have different requirements for the "thinking ability" behind. When choosing a product's Domain, try to stay away from scenarios that rely heavily on world models and common sense reasoning to engage in conversations.

Some people may say that this is not the practice of Sophia? No. What needs to be emphasized here is that the core issue of Sophia is deception. Product developers want to fool the public, they really make intelligence.

Here, what I advocate is to clearly tell the user that this is the dialogue system, not really creating intelligence. This is also why, in my own product design, if you encounter real people and AI while serving the user (called Hybrid Model on the product), we will always prefer to let the user know when, when the real person is in service, When is the robot in service? The advantage of doing this is to control the user's expectations to avoid the user going outside the designed domain; the downside is that you may "listen" not so cool.

So, when I say "existence is perceived", it emphasizes the perception of value; not the perception of "like human beings."

" The core value of conversational intelligence: in content, not in interaction "

Years ago, while still studying in the UK, I worked in a very famous secret association with a long history. I was impressed by the big butler who took care of the needs of the members at the time. You can imagine her as a super concierge for the American Express Black Card Service. She has two super powers:

1. Resourceful, the wonderful needs of the members can all be achieved: a member of the Frankfurt team encountered an emergency in the middle of the night, temporarily want to return to London as soon as possible, no flights in the middle of the night, call to find the big butler for help. Finally, the big butler found another member's friend borrowed a private jet, sent him a ride, and returned to London in the early hours.

2. Mind-reading, what the member wants, no need to say anything:

"Oliver, I want to drink something..."

"Of course no problem, I will send it to you later." She also does not need to ask what to drink, or where to send it.

Everyone wants a housekeeper like this. Batman needs Alfred; Iron Man needs Javis; Theodore needs Her (although this buddy has gone wrong); iPhone needs Siri; this is back to what we mentioned in Part1, AI's to C ultimate product is the assistant .

But the root cause of the need for this assistant is because people need its ability to talk? There are already 70 billion natural language dialogue systems in the world (that is, people). Why do we still need to create more dialogue systems?

What we need is the ability to think behind the dialogue system and the ability to solve problems.. Dialogue is just the Conversational User Interface. If you can be smart enough to solve the problem ahead of time, the user doesn't even want to say anything.

Let's look at an example. 

I know that many product managers have already smashed the things that the iPhone released in the first generation. However, this is a very good example here: let's talk about why the iPhone replaces the physical keyboard with a virtual keyboard.

Ordinary users, from the most intuitive perspective, can conclude that the screen is bigger! Appears when the keyboard is needed, and disappears when it is not needed. It also simplifies the design of a product that looks quite complicated and looks better. Even many product managers think so. In fact, this is not a problem with hardware design at all. The reason is shown in the figure below. 

In fact, Jobs also said very clearly at the time: the core problem of the physical keyboard is that (as an interactive UI) you can't change it. The physical interaction (keyboard) does not change depending on the software.

If you want to load a variety of content on your phone, if you want to create a variety of software ecosystems, these different software will have their own different UI, but the interaction method depends on the same kind (the physical keyboard can not be changed), This will not work.

Therefore, the actual replacement of these physical keyboards is not a virtual keyboard, but the entire touch screen. Because the iPhone (then) will be equipped with rich eco-software content in the future, it must have an interactive way that is compatible with these ideas that have not yet appeared.

In my opinion, all of the above is for rich content services. Once again, the interaction itself is not the core, the content behind it is.

But when I first saw this conference, I really didn't get to this point. At that time, it was really hard to imagine that so many apps that were born in the entire mobile Internet era had their own different UIs to carry a variety of services.

Think about it, if you use these physical keyboards to let you operate public comments, open maps, Instagram or other apps you are familiar with, what kind of experience is it? More likely, as long as this is the way to interact, the APPs just mentioned are not designed at all.

At the same time, this also raises the question: If there is no variety of software and content ecology on the device, should the physical key be designed as a touch and virtual way? For example, should an excavator interact with a touch screen? Even the conversation interface?

" Dialogue intelligence solves rethinking "

Similarly, the core value of a conversational intelligent product should be on the ability to solve problems, rather than staying on the surface of interaction. How is this “content” or “ability to solve problems” reflected? 

The great value that the industrial revolution has brought to mankind lies in the solution of "repetitive manual labor."

According to economist Tyler Cowen, “The more people in the industry, the more subversive the job will create greater business value.” He described in the book Average Is Over:

“At the beginning of the 20 century, the most employed people in the United States were farmers; the industrialization after the Second World War, the development of the tertiary industry, and the women’s liberation movement, the most labor-intensive jobs became auxiliary business writers such as the Secretarial Assistant Call Center ( Clerk, information input). 1980/90 personal computers, as well as the popularity of Office, a large number of secretaries, assistant class work disappeared."

The work mentioned here is a lot of repetitive work. And keep on evolving,From repeated physical strength, to repeated brain power.

From this point of view, the AI ​​products that are not controlled by the "thinking ability" behind a scene will be replaced soon. The first to bear the brunt is the intelligent customer service in the typical sense.

In the market, there are a lot of such intelligent customer service teams, they can do dialogue system (see Part 2), but they don't know much about professional thinking in these fields.

I call "smart customer service" "the front desk lady" - no intention to offend, but the main job and professional skills of the front desk lady are not related. Their most important skill is dialogue. To be precise, they use dialogue to "route" - understand what users need, filter out inappropriate requirements, and then pass the requirements to experts to solve.

But for a company, customer service is only the mouth and the ear, and the expert is the brain, the content is the value. How much is customer service? Think about the large number of call centers that have been outsourced.

Corresponding to this kind of customer service robot product is the expert robot. An expert must have the ability to identify the needs of the user, and vice versa. Can you imagine how much salary a company pays to a customer service and how much is paid to an expert? How much time does an expert need to train and prepare to get on the job, and the customer service sister? at the same time,Professional competence is the core of this organization, and customer service is not.

Because of this, many people think that the manual call center will be replaced by the AI ​​call center in the future; and I think that using AI as a call center is a very short transitional solution. Quickly replace the manual call center, or even replace the AI ​​call center, is an expert AI center with interactive capabilities. Here, the meaning of "experts" is greater than "calls."

After experiencing the productivity rampage and scale effect brought by instrumentation, they cost almost the same, but they are professional. For example, he directly links the back-end supply system, and also has the reasoning ability of the professional field, and can also directly interact with the user.

What NLP solves in the dialogue system is the problem of interaction.

In the field of artificial intelligence products, teams that give a certain amount of time and master professional skills will be able to talk to the system;It is difficult for a team mastering the dialogue system to master professional skills.. Imagine a few years ago, when the mobile Internet just appeared, it would be the app developer to help the bank do the app; and after a few years the bank will develop the app itself, and the developer can't do the banking.

In this example, as a friend of the AI ​​product definition, your product is best to replace (or assist) a domain expert; not to target transitional positions, such as customer service.

From this perspective,The core value of conversational intelligence products is to further rethink the user.Work on the mind not the mouth. Even if it is already solving the problem of the head,Also try to replace the user system2Work, not just the system1work.

In your product, add professional-level reasoning; help users to transform between abstract concepts and figurative details; help users to judge those problems that appear in his model, but he has not mentioned it verbally; consider his current The environmental model, the physical space and time in which the dialogue was initiated, the past experience; speculation about his mentality, his world model.

Solve the problem of thinking first, and then transform it into language as much as possible.


Part 5: AIPM

" What is missing?"

At the end of 2018 10, I did on site support for corporate customers in Munich. During the period, we will communicate with the various BUs of the customers, the market bosses and their own R&D team. As one of the world's top car brands, they are also actively seeking AI's application in their products and services.

  • There is no shortage of technical talent.Although elephants as traditional industries may be regarded as not good at AI, they are not lacking in NLP research and development. When I talked to their NLP team, I found that there were basically PHDs from world famous schools. Moreover, at the closed supplier conference, all the major technology companies and consulting companies in the world are present. Even if it can't be done, there are a lot of people who want to help them.
  • Strong willingness to innovate. Among the big companies I have contacted, especially among the traditional 100 in the traditional world, this giant company attaches great importance to innovation. After the mobile Internet era, lost positions, they really want to grab back a little bit and try to lead the industry, rather than follow others' practices. It is not just a POC that does not hurt like traditional “big business innovation” to complete the KPI of the innovation department. They are really aggressive in promoting the commercialization of AI and have the courage to try to change the relationship between the past and the Tech provider. This impressed me, limited to the confidentiality clause, and skipped the details here. (It is also a very interesting topic to do subversive innovations with the hands of start-up teams of international giant companies borrowing new technologies. A new Topic will be opened in the future.)
  • More data.Then the advantage of the traditional giant is that it has real business scenarios and actual data. Every product sold is their terminal, and it is fully networked and intelligent. In addition, various offline channels and massive customer service, in fact, they have the ability and space to collect more complete user life cycle data.

Of course, as the other side of the coin, the century-old brand will naturally have a serious historical relationship. In-house compliance, procurement processes, data control, data between BUs, and administrative barriers are also unstoppable. Trade off of these links has greatly affected the use of these advantages.

But the most missing is the product definition capability.

If the product definition of Dialogue Intelligence fails, the subsequent execution is perfect, and the effect is mentally retarded. Some banks' AI robots are examples: six months for project approval, half a year for bidding, one year for development, and one month for running on the line because they are too stupid to go offline.

But this is actuallyNot a feature of the traditional industry, but the problem of all current players-- Internet or technology companies' dialogue AI products can't escape. Maybe Internet companies still feel good about themselves. In this product design part, talents are the most indispensable - after all, "everyone is a product manager". But at the moment, the products that we have seen by Internet companies are almost the same. We have already introduced enough in Part 2.

Let's see where the difficulty lies.

How to define AI products? That is, what kind of products are needed to achieve business needs. The technical department tends to focus on technology implementation, not on the KPI of business results. Colleagues in the business department have limited understanding of AI, and it is easy to ask for inappropriate needs.

The point is, when doing product definitions, you want to describe "I want an AI like this, it can be said..." You will find that because of the dialogue interface, you can't exhaust the possibility of this product. One specific detail is how to write the product documentation, which is enough to challenge.

" dialogueAIProduct management method "

First, let's conclude: If you still want to manage the conversational intelligence products using the methodology of managing GUI products, this is impossible.

From an industry perspective, there will be no assembly lines without a large number of successful cases; there is no pipeline-based project management without an assembly line.

That is to say, the first modern car appeared from 1886 year, and the first assembly line appeared in 1913 year - there is a span of 27 years in the middle. Later, Toyota proposed The Toyota Way to quickly iterate (like agile development) with Lean Management to avoid waste, ieKaizen (improvement)This is already the 2001 year.

During these two days and other peers who are also talking to big companies, I heard many unsuccessful product cases. It comes down to almost all because "the definition of product Scope is unknown", which leads to the end of the project. . And because the coupling between the functions is tight, the connection can't be up. (When the task that the context dialog depends on, the intermediate link is missing, it can't go through the process). These are all signs of early maturity in the industry.

" dialogueAIProductDesign Principle Not yet appearing "

There are several differences in the characteristics of the visually oriented products in the conversational intelligence field:

1) is far less mature than visual AI;

2) Although the role played by deep learning in the entire system is important, it is still rare, far from enough to support a valuable dialogue system;

3) products are black boxes, and there is currently no relatively common design standard in the industry.

The development of APP to the back, with the formation of user habits, and the "intercommunication" with the successful cases in the industry, gradually formed some design consensus, such as the following row, the "red" in the red circle on the far right: 

However, from the 2007 iPhone release, to the design specifications of these mobile products, it took nearly 6, 7 years, and does not mention this is a graphical interface.

Up to now, the product design standards on such mobile devices have matured. If the designer does not follow some design ideas, it will cause users not to get used to it. It’s just a design specification for the dialogue system. It’s still too early to talk about it.

Here, combined with the above two points (the management method and design specifications of the dialogue AI products are not mature), it can be explainedWhy are smart speakers not smart?. Because behind the smart speakers is a set of "skills to create a framework", to developers, I hope that developers can use this framework to create a variety of "skills."

And"Dialogue skills platformI can’t get through at the moment.. Any scene that involves explicit text recognition needs to be modeled for specific tasks and functions, and then merged into the scenes of multiple rounds of dialogue management. With the current product maturity, it cannot be abstracted into effective design specifications. What can be abstracted now is very simple context management (remember the "filling in the table" in Part 2?).

Let me give you an example. Most of the skill platforms have no concept of "user lifecycle management" at all. This is a different matter from the service process and one of the many reasons for many robot mental retardation. Because it involves too much detail and professional parts, let's not start.

There are also exceptions: the skills are all voice-controlled, such as "turn off the lights and turn on the lights" "open the air conditioning 25 degrees." This type of skill, which relies mainly on plain text recognition, can indeed achieve better results with the framework. But the problem with this is that openness to developers is meaningless: such skills do not require multiple productizations; developers can't make money at all from such developments – almost no commercial value.

The other exception is that large companies do MLaaS platforms, which are still very valuable.It can solve the needs of developers for deep learning, such as intent recognition, word segmentation, entity extraction and other lowest-level needs.But the entire recognition part, as I mentioned in Part 3&4, should only account for 10% of the task dialogue system, and nothing more.The remaining 90% of the work is also the work that really determines the value of the product, and it has to be done by the developers themselves.

What will they experience? I just give a few simple examples (friends outside the industry can ignore):

  • If you need to train an intent to generate 1000 sentences to make material, then the training effect of "finding 100 individuals, each writing 10 sentences" is much better than "finding 10 individuals, each writing 100 sentences";
  • Is it to use the scene to divide the intention, use the semantics to divide the intention and use the predicate to split the intent, how to choose? This not only affects whether the robot can efficiently support the jump between “tasks”, but also affects training efficiency and development cost;
  • Sometimes the training of the intention is wrong, the trainer puts the contents of his brain into it;
  • The importance of speech not only affects the user's discomfort, but also determines the possibility of his reply - and the possibility of replying to the reply - after all, the words behind each sentence he needs to be identified need to be identified. Reply again;
  • If you want to make a movie for a movie theater, it's best to use a graphical interface instead of a language to choose a seat: "The empty seats are now, the first row of 1, 2, 3, 4...."

There are countless experiences and techniques in these areas, and it is still the most shallow and furry part. As you can imagine, there are still many ways to go through the design rules of dialogue intelligence. Remember, each product is still a black box, and it has a good effect, and you can't see how it is designed.

" a suitableAIPM "

When real artificial intelligence is implemented, all the product managers need to think about it will be replaced by AI. Therefore, true artificial intelligence may be the last invention of mankind. Before that day, the work of the conversational intelligent product manager was to use all kinds of power to create the feeling of being intelligent.

AIPM must be very clear in the mind "AI's return to AI, product return to product." The starting point for making tools and tools is completely different. You should use AI for the purpose of making products; don't have the illusion that "AIPM is to achieve AI." 

We are all familiar with the fact that PM needs to stand at the “crossroads of humanities and technology” to design products. Then the intelligent AIMM may be more extreme in this respect, so that even 2 individuals are required to work closely with the product group. I think an excellent dialogue intelligent product manager needs to perform well in these three. : 

1. Understand business: just understand value.

The value of the conversation product must not be in the dialogue, but through the interactive way of dialogue (CUI) to complete the tasks behind or solve specific problems. An application that is already very strong, don't think about doing it again with dialogue. On the contrary, some APP/WEB have not solved the problem well. You can spend more time studying it.

This aspect is in Part 4The core value of dialogue intelligencePart, which is elaborated, will not be repeated here.

2. Understand technology: understand the tools in your hands (deep learning + GOFAI

A chef should be familiar with the characteristics of the ingredients; a musician should be familiar with the characteristics of the instrument; a sculptor should be familiar with the chisel in his hand. Everyone's tools are similar, and the results are entirely up to the artist.

Now that AIPM has deep learning in hand, you should understand what it is good at and what it is not good at. To avoid asking too much ridiculous demand, the development classmates attack you. Understanding the characteristics of deep learning will directly help us determine which product directions are more likely to produce results. For example, it is much harder to make an AI that recommends a restaurant than to make an AI that plays Go.

The success of Go's products does not require humans to understand the process and accept this result. When recommending a restaurant to a user, you must simulate the person's thinking and then vote for it.

When people want to recommend a restaurant, they can understand his needs through dialogue. (You can't ask too much, especially obvious problems. For example, when he was at 5, you asked him if he wanted some time.)

For Go, the possibility of each (single) input is no more than 19×19=361 possibilities on the board; although the process of a game of chess is ever-changing, we can give it to the black box of deep learning; finally decide the winner or loser The required information is all presented on the moves on the chessboard. Although the amount is large, it has nothing to do with information other than the moves. It is all in the black box, but the black box is very large. Finally, there are only two possibilities for the outcome of the output: winning or winning.

For recommended restaurants. The information entered each time does not actually contain all the information needed for the decision (there is no way to express all relevant influencing factors in the language, refer to the World Model section in Part 3); and the output is open because the recommended restaurant, both Can not be quantified, and there is no absolute right or wrong.

After understanding the characteristics of CUI, you should not use dialogue to force dialogue interaction; some use dialogue is very costly, and it is not Robust, and user value and frequency of use are very low, we must consider avoidance - we It is to make products, not to achieve real AI, to be clear.

3. Understand people: psychology and language

This may be the most important part of current conversational products and a core part of the design of the open and other products. It may also be the second spring of middle-aged people doing products.

Understanding of psychology, refers to the understanding of the model in his brain when the user is talking. In English, “Read the room” means to observe the situation of the surrounding audience before the speech, to try to understand their psychology, and then to speak properly.

For example, when speaking, is the audience starting to look at the watch repeatedly? This will directly affect the process of the conversation. Have you ever encountered a conversation with someone who feels comfortable? This person is not only capable of language organization, but more importantly, he has a grasp of the dialogue process in your brain, as well as the scene model, and even the confidence of your world model. He also knows how to word, it will be easier for you to accept, or even to guide (Manipulate) your abandonment of some topics, or strengthen.

The design of the dialogue system is the same. What are the points mentioned above? What types of references can be simulated? If it is a text interface, will the user pull back to see the previous content? If it is a voice interface, can you still remember the user? If you remember to live, you also stressed that you will feel duplicate; if you can't remember it and don't repeat it, you will feel confused.

Understanding of language, is the understanding of the characteristics of the spoken language. I know what Frederick Jelinek said. "Every time I fire a linguist, the accuracy of Speech recognition will increase." However, there is no real natural language generation (NLG) at all.Because there is no real thinking generation.

Therefore, the content of the dialogue of the task class will not be generated naturally, nor can it be generated by deep learning. For AIPM, there are still many language specific issues to consider. In a reply, will the content be too long? What are the main points? Is the predicate clear and is the user clearly told what to do? What is the condition? How many possible inquiries can such a response be triggered? Is the content wording misleading (for example, because the background of the audience is different, there may be different interpretations)?

From this perspective, a good dialogue system must come from a very communicable person or team. Can be considered for others, thoughtful, and the ability to use language is efficient, deep understanding of people's psychological changes. Familiar with the business, can gain insight into the changes in the user's Context, and its style helps users control the rhythm of the conversation to ultimately solve specific problems.


Part 6: The visible future is a continuation of the status quo

" Transition technology"

A few weeks ago, I discussed the future of the industry with another CEO in the industry who was talking. When I talked about the attitude of “deep learning to make dialogues that are far from effective”, he asked me: “If it is pessimistic, how can the team hope to continue moving forward?”

in factI am not pessimistic, it may just be more objective..

Since deep learning is inconsistent in nature, is it a transitional technology to implement the dialogue AI? This is a good question.

I think,Using the current technology to make AI products will last a long time until the arrival of true intelligence.

If it is a technology that will be replaced or subverted, then it should not be added. If you can foresee the future, no one wants to join Kodak in the early days of the rise of digital cameras; or before the popularity of LED TVs, invest heavily in the development of rear-projection TVs. And it is hard to predict not only technology, but also the development trend of the market. For example, in China, as a cashless payment method, the credit card has not been able to cover enough payment scenarios, and the mobile payment has been broken.

The technology used in dialogue intelligence is far from this stage.

Clayton M. Christensen describes the three phases of each technology in The Innovator's Dilemma:

  • The first stage, slowly climbing the slope;
  • The second phase began to develop rapidly, but by the time of the development of the highlands (deceleration of progress), another disruptive technology may have quietly sprouted and repeated the development of the first technology;
  • In the third stage, it enters the development bottleneck and is eventually subverted by new technologies.

The black part of the picture below is the original picture in the book:

The current technology of dialogue AI, in the first stage (blue flag position), is not a high-speed development, but also in the early stage of exploration. The black box situation will make this cycle (first phase) probably longer than the mobile era.

In view of the current technological development direction, combined with the progress of academia and industry, the second technology has not yet appeared.

But also because deep learning plays only a small part of the role in the dialogue system, most of the space is left for everyone to explore and grow. In other words, there is still a lot of potential for development.

The premise is that we are discussing the products of the conversation class, not the AI ​​itself. However, the dialogue AI at this stage does not reach the level that people can see in the movie, and can communicate freely in human language.

 2) The opportunity for service providers to rise 

Because of the above technical development characteristics, in the short term,Data and design are barriers to conversational intelligence products, technology is not.

Just the data mentioned here, not the data used for training. It is the data that the provider can complete the service; the data that can take care of the user's entire life cycle; the data other than the plaintext of the user when the conversation occurs; the environmental model that affects the user's brain, and the common sense that affects the execution of the task. Inferential data, and so on.

With the development of IOT, service providers, as the party directly dealing with users online, are most likely to master these data. They can deploy these IOT devices at various Touch points to collect environmental data. And, they decide whether or not to provide this data to the platform side.

However, players in these industries often have a long history and slow action. Its organization is huge, and the organizational structure is not designed for innovation, but around how to make a huge body without thinking, high-speed execution. And this is also the opportunity for Internet companies and startups.

 3) Hyper Terminal and Portal 

The products of the conversational intelligence class must be mounted on the hardware terminal. A lot of related hardware attempts are gambling on which device can become the next hyper-terminal after the mobile phone. Just like a smartphone as a computing device, it replaces the status of a PC.

After all, in the mobile era, grabbed the HyperTerminal and grabbed the entrance to the user. On the basis of the entrance, it is the application.

If the dialogue intelligence develops into a good enough experience and can cover more service areas, which terminal is more likely to become the next HyperTerminal? Smart speakers, speakers with screens, in-vehicle devices and even car machines, wearables, etc. can be equipped with conversational intelligence. In the era of 5G, more computing was left to the cloud, leaving less energy-intensive OS and infrastructure on the local device, and I/O was handed over to the microphone and audio playback.

Credit:Pixabay

thereforeAny networked device may have the ability to interact and deliver services, further weaken the existence of HyperTerminal. That is to say, as an individual user, on any networked device, as long as it has voice interaction and networking capabilities, it is possible to obtain services. In particular, some business services that depend on the scene, such as hotels, hospitals, offices, etc.

With the advent of these portals, the traffic-centric business model in the mobile era may no longer be established. And the new model may be born, imagine that every company, every brand will have its own AI. One or more, generated according to different businesses; serving or assisting internal employees, and also receiving external customer service, managing the entire life cycle from the users registered as the company, to the last (unfortunately) interruption Service so far.

It's just that the order of development is that there is service first, then there is a dialogue system - just like people, there are thoughts in the head, and then use dialogue to express.

Conclusion

In this article, all discussions related to technology and products emphasize one point: a product is a combination of many technologies. I don't want to convey the wrong idea, like "deep learning is not important"; on the contrary, I hope that every kind of technology is correctly recognized. After all, we have a distance from the real artificial intelligence, and we can use it. valuable.

As an AI practitioner, there will be irrational hopes in the heart, which will soon witness the arrival of artificial intelligence. After all, if true intelligence emerges, the product manager (and many other jobs) may be completely liberated (or destroyed).

This may be the last invention of mankind.

This article began in Munich and was eventually drafted in Beijing, with intermittent time-consuming 3 months. During the period, I communicated with many big companies, entrepreneurs in the industry, and some capital students. Thanks to this, it is not a bit of a name.

This article is transferred from the public number S,Original address