The last few years have seen a dramatic increase in the number of applications that expect people to interact with machines through a voice interface “” that is to say, you tell it what to do by talking to it. Most likely, you don’t do this with your “computer.” It is far more likely that you do it with your phone, which is just a computer that you can put into your pocket. There are even products like Amazon Echo or the recently announced Google Home, where voice is intended to be the primary means of interaction. Recently I placed a call to a customer support line and spent quite a bit of time in dialog with a machine. It wasn’t simply a voice menu system. The interaction followed the same type of script that humans in call centers have used for years and did a really good job of understanding me.
What is really significant about this is that only a few years ago, voice interfaces were rather broadly ridiculed. There was a reason for this “” they weren’t very good. That voice system on the phone really stood out to me because it was so much better than what I had previously experienced, where the software barely understood me saying numbers for menu options. The voice recognition on my phone, which I use occasionally to send text messages when it isn’t convenient to type, also generally does a very good job. The error rate is definitely much lower than it was just a few years ago.
So what’s changed? It turns out that the difference isn’t improvements in voice recognition specifically, so much as improvements in Artificial Intelligence (AI) and machine learning in particular.
AI isn’t new. We are in what one might call the third golden age of AI. The first came in the 1950s when digital computers were still young, and many people believed that human intelligence was just a collection of logical rules. They believed that if they could just figure out the right rules and put them into a computer, they could reproduce human-like intelligence. In the 1980s, AI became a hot topic again, and there was a lot of work on techniques like neural networks. Both of these previous golden ages died when reality failed to live up to their assumptions.
The most recent rise in AI has been fueled by machine learning, which is feeding off the vast quantities of data that we create on the internet and through our various connected devices. The specific form of machine learning that is behind the ability for you to talk to your computer is called “deep learning,” and it uses constructs called “deep neural networks.” Neural networks are based on the structure of neurons in biological brains. The recent addition to them is the “deep” part. Basically, people have made the neural networks a lot bigger. Neural networks are generally “trained” by feeding them inputs where we know what the output should be, and adjusting the connections between neurons to reflect whether the output is right or wrong. Bigger networks require more data to train on.
If you want to train a neural network to understand what people are saying, you have to feed it a lot of audio clips where you know what is being said. Getting a hold of a large amount of such information might have been challenging in the past, but it has gotten a lot easier. How many hours of captioned video are available on YouTube alone?
The turning point for deep learning wasn’t actually related to giving voice commands to computers. It was in 2012, when a program using a deep neural network dramatically outperformed other approaches in the ImageNet challenge, which was a competition to recognize the objects in millions of digital images. To help advance the field of image processing, the creators of the ImageNet challenge created a database with over 10 million labeled images that could be used for training AIs. The 2012 winner had an error rate of 16 percent for identifying images it had never “seen” before; compare this to the 25 percent error rate of the previous year’s winner. By comparison, people have an error rate of 5.1 percent.
With the success in the ImageNet challenge, many technology companies have begun to invest heavily in their own deep learning projects. Earlier this year, a deep learning AI from Google made headlines for beating the world champion in Go. Microsoft’s 2015 entry into the ImageNet challenge used a deep learning AI to achieve an error rate under five percent.
The technology is used in more practical applications as well, including many products that you likely use today. Google Translate uses an AI that was trained on the vast amounts of text data in multiple languages available on the internet. Autonomous cars are largely driven by machine learning software that is constantly getting smarter as it drives. The work in image recognition has grown to include video recognition that not only identifies what objects are in the videos, but also what activities they are performing.
Deep learning also powers other applications, like IBM’s Watson. When Watson won Jeopardy in 2011, it wasn’t using deep learning. Now, nearly all the components in Watson use deep learning, and IBM has been investing heavily in the technology. Watson is training to do medical diagnosis andhas already proven to be more capable than human doctors in some areas. It is also trained for call center use. For all I know, that might have been a Watson-based system I was talking to when I called that customer service line.
As with all new technologies, it isn’t clear what the full range of applications are for deep learning, but my guess is that we are only scratching the surface, and the full implications of this technology could be quite profound.
Mark Lewis is a professor of computer science.