How were the fundamental improvements in Voice Recognition achieved?

  • Thread starter berkeman
  • Start date
  • Tags
    Fundamental
In summary, the keynote speaker discussed how machine learning has progressed over the years and how it has helped in the advancement of voice recognition. They also discussed how voice recognition has become more accessible to the public and how it is important for software to be able to understand different voices in order for it to be universally adopted.
  • #1
berkeman
Mentor
68,200
21,805
TL;DR Summary
There has been a fundamental improvement in Voice Recognition over the past couple of decades. What was the key to these breakthroughs?
I remember about 20 years ago a colleague had to start using the Dragon Voice Recognition software on his EE design PC because he had developed really bad carpal tunnel pain, and he had to train the software to recognize his voice and limited phrases. That was the state of the art not too long ago.

But now, Voice Recognition software has advanced to the point where multiple speakers can speak at normal speed and devices like cellphones and Alexa and Siri can usually get the interpretation correct. What has led to this huge step? I remember studying Voice Recognition algorithyms back in the early Dragon days, and marvelling at how complex things like phoneme recognition were. Were the big advances due mostly to increased computing power? Or some other adaptive learning techniques?

This article seems to address part of my question, but I'm still not understanding the fundamental leap that got us from Dragon to Alexa... Thanks for your insights.



1569795330735.png


1569795426308.png
 
  • Like
Likes sysprog, 256bits and Greg Bernhardt
Computer science news on Phys.org
  • #2
I would not be surprized if the details of the algorithms are highly proprietary.
 
  • #3
I used to be an enthusiastic user of Google Voice Search (they called it google 411). It was the perfect machine learning platform because people with many accents in many backgrounds would ask similar questions. Then, it would get immediate feedback from the users if the recognition was correct or not. If the user replied "yes connect the call", it was successful. If not, then the user would retry.

From that, I assumed that it was just a neural net they were training. Or perhaps multiple cooperating nets. One working on words, and the other on semantics. "Please give me the number for pizza in Conway South Carolina." The question has a predictable structure. It can be expected to have object words "pizza", location words "conway", and noise words, "please give me". So the word guesses could be reinforced by the semantic structure.

But after several years, they abruptly discontinued the service. A press statement said that they had enough data.

"Breakthroughs" in machine learning are usually of the nature that they learn faster. But if you have enough data, enough money, enough time, even pre-breakthrough learning projects can succeed.
 
  • Like
Likes 256bits and berkeman
  • #4
berkeman said:
Summary:: There has been a fundamental improvement in Voice Recognition over the past couple of decades. What was the key to these breakthroughs?

Were the big advances due mostly to increased computing power? Or some other adaptive learning techniques?
Sure, computing power is one of the key factors. But R&D in the area of recurrent neural networks had probably even greater influence on the recent successes in the speech recognition applications. Especially the discovery of the LSTM architecture deserves to be highlighted:

As of 2016, major technology companies including Google, Apple, and Microsoft were using LSTMs as fundamental components in new products.[12] For example, Google used LSTM for speech recognition on the smartphone,[13][14] for the smart assistant Allo[15] and for Google Translate.[16][17] Apple uses LSTM for the "Quicktype" function on the iPhone[18][19] and for Siri.[20] Amazon uses LSTM for Amazon Alexa.[21]
 
  • Informative
  • Like
Likes 256bits and berkeman
  • #5
Machine learning , yes that had jumped ahead.
But what is the machine learning - audio frequencies, phonetics, phrases, whole words?
There still has to be some electronics between the spoken word and the AI, and what is it that they are picking out to tell the difference between the words say "one" and "two", with very little time delay it would seem.
Fast Fourier transform, or something different to analyze the sound.

( PS. I just thought of this - has anybody done any tinkering with animal sounds - a cat's meow, dogs bark, pig's grunt , or any of the other everyday sounds we here - door slam, car engine, police siren . )
 
  • #6
phinds said:
I would not be surprized if the details of the algorithms are highly proprietary.
Closely guarded secret.
Have they done any patents - others would be able to find out what they are really up to then.
 
  • #7
It is a combination of many factors, including cloud computing (in the case of smart speakers at home, for example, the recognition is taking place back in the cloud rather than locally), machine learning, Bayesian training, specific heuristics, and last but not least, crowdsourcing (used by Google's language learning algorithms). Quite a bit of details of how Google does it can be learned if you go to their tech talks at conventions.

Note also that before voice recognition could become ubiquitous, language reading, recognition and understanding by software had to increase. It all takes a lot of processing power, and it must be fast, which is why smart speakers and phones encourage you to "train" your recognizer for your particular voice.
 
  • Like
Likes sysprog
  • #8
Try this: Keynote talk: "https://www.isca-speech.org/archive/interspeech_2014/i14_3505.html," Interspeech, September 2014 (by https://en.wikipedia.org/w/index.php?title=Li_Deng&action=edit&redlink=1).
 
  • Like
Likes berkeman
  • #9
harborsparrow said:
It is a combination of many factors, including cloud computing (in the case of smart speakers at home, for example, the recognition is taking place back in the cloud rather than locally), machine learning, Bayesian training, specific heuristics, and last but not least, crowdsourcing (used by Google's language learning algorithms). Quite a bit of details of how Google does it can be learned if you go to their tech talks at conventions.

Note also that before voice recognition could become ubiquitous, language reading, recognition and understanding by software had to increase. It all takes a lot of processing power, and it must be fast, which is why smart speakers and phones encourage you to "train" your recognizer for your particular voice.
The Google assistant app recognizes "Hey Google" locally. It has a Settings interface to train the necessary local recognition in case the default doesn't work well enough. The app listens constantly into a buffer a few seconds long, and sends those few seconds preceding "Hey Google" along with the few seconds of speech thereafter that precede a pause long enough that it determines end of message. If I don't say "Hey Google", the app doesn't send anything, but what it does with whatever I do send is not open to my scrutiny. I prefer the Google browser page behavior, wherein Google listens to nothing until I press the mic button. but so far, Google refuses to provide an option for that behavior on its phone app, so if I want that behavior from Google on my phone, I go to the Google page from a browser.
 
  • Informative
  • Like
Likes harborsparrow and berkeman
  • #10
lomidrevo said:
Sure, computing power is one of the key factors. But R&D in the area of recurrent neural networks had probably even greater influence on the recent successes in the speech recognition applications. Especially the discovery of the LSTM architecture deserves to be highlighted:

Yes, it was deep learning. The only minor point of disagreement is whether it was mpre deep learning than computing power, since deep learning and LSTMs are old. One additional factor is more data for training.

The technology before deep learning was hidden markov models, and the first deep learning successes in speech recognition used hybrid deep learning - HMM architectures. I believe the current algorithms are pure deep learning algorithms.
 
  • Like
Likes lomidrevo
  • #11
Also very important for voice recognition technology (including MyTalk, which won the Computerworld Smithsonian Award for the first commercially successful voice recognition consumer product), and for many other technologies we now take for granted, were the many innovations of General Magic, a company which was the subject of a critically acclaimed 2018 documentary.
 
  • #12
atyy said:
Yes, it was deep learning. The only minor point of disagreement is whether it was mpre deep learning than computing power, since deep learning and LSTMs are old. One additional factor is more data for training.
You are right that neural networks are old. LSTM itself is also old but not so much, some updates to the architecture were done in 2000. And its modified version GRU was introduced only in 2014. Another point is that today's neural networks can be compound of many many hidden layers, and therefore perform better. That was not possible decades ago, as to train such networks would take ages. The increased computer power indeed helped to overcome this issue, but the success came together (or in parallel) also with some advances in deep learning (for example, ReLu became to replace sigmoid as activation units only after 2009, which had a big impact on the overall performance, afaik).

Maybe I was too quick to say which of the factors had the greatest influence. I think we could agree that it was a combination of more factors, especially computer power and advances in deep learning. Regarding the other factor, amount of data for training, it is true that in general there is much more labeled data today than before so deep learning models can generally be trained with smaller generalization error. But in the case of speech recognition, I am not sure whether this was the key factor. I believe that many audio datasets including transcriptions were available already decades ago. For example TIMIT dataset, was created already in 1986, and today is still used as a benchmark when speech recognition solutions are compared. Just my opinion.
 
  • #13
256bits said:
Machine learning , yes that had jumped ahead.
But what is the machine learning - audio frequencies, phonetics, phrases, whole words?
There still has to be some electronics between the spoken word and the AI, and what is it that they are picking out to tell the difference between the words say "one" and "two", with very little time delay it would seem.
Fast Fourier transform, or something different to analyze the sound.

If I understand it right, the current methods are end-to-end deep learning methods, so there is no extra audio processing (like DSP for example) needed, maybe except the Fourier transform at the beginning to construct spectrogram from the audio sequence, see this paper:
https://arxiv.org/abs/1303.5778

I don't know how exactly it is implemented in practice for real-time applications, but some version of FFT, maybe STFT, is probably involved. The neural networks are pretty fast to calculate prediction (once they are trained), so there should not be an issue here. Just my opinion...
 
  • Informative
Likes 256bits

FAQ: How were the fundamental improvements in Voice Recognition achieved?

1. What is voice recognition and how does it work?

Voice recognition is a technology that allows a computer or device to interpret human speech and convert it into text or commands. It works by using algorithms and machine learning to analyze and interpret the patterns and frequencies of human speech.

2. What are the fundamental improvements that have been made in voice recognition?

The fundamental improvements in voice recognition include increased accuracy, faster processing speeds, improved language and accent recognition, and better noise cancellation capabilities. These advancements have been made through the use of more advanced algorithms and the integration of artificial intelligence.

3. How has machine learning contributed to the improvements in voice recognition?

Machine learning has played a critical role in the improvements in voice recognition by allowing the technology to continuously learn and adapt to different accents, languages, and speech patterns. This has led to increased accuracy and better performance in real-world scenarios.

4. What challenges have been faced in achieving these improvements?

One of the biggest challenges in improving voice recognition has been the ability to accurately interpret and understand different accents and languages. Another challenge is overcoming background noise and distinguishing between speech and other sounds. Additionally, improving processing speeds and reducing errors has also been a challenge.

5. What is the future of voice recognition technology?

The future of voice recognition technology is very promising. With advancements in artificial intelligence and natural language processing, voice recognition will become even more accurate and versatile. It will also continue to be integrated into various devices and applications, making it an essential part of our daily lives.

Similar threads

Replies
1
Views
1K
Replies
26
Views
43K
Replies
6
Views
5K
Replies
6
Views
2K
Replies
7
Views
3K
Replies
4
Views
7K
Replies
9
Views
6K
Back
Top