Once a sci-fi pipedream, Automatic Speech Recognition (ASR) is now a reality and is quickly coming to be an integrated part of lives across the globe. Recent trends suggest uptake of ASR devices is growing, as consumers and businesses alike see the benefits in a once-novel but now increasingly refined technology. How far ASR can progress, however, and in which direction, remains to be seen. To understand its possible future, let’s look at where ASR is now, how it works, and what can still be done.
What’s the state of ASR in 2021?
Most people reading this will be within arm’s reach of at least two ASR devices. One on their phone, and likely one on their laptop. If they’re at home, they might also be within talking-distance of a smart-speaker, or voice-activated appliance. They all operate on the same fundamental technology: receiving audio, converting the relevant bits to text, interpreting that text, and deciding whether to take some sort of action based on its meaning.
In the West, ASR technology is dominated by the usual big-tech suspects, Google, Apple, Amazon and Microsoft. In China, a homegrown ASR industry features its own heavyweights, Alibaba, Tencent, Baidu and Huawei.
Each of these companies offers products aimed to smoothen the simple processes associated with modern living – shopping, arranging schedules, listening to music, seeking answers to questions swimming around our heads at 3.00am – without once having to interact with another human, or type a command. And they’re pretty good at it. Most ASR software now hovers around a 5% comprehension error rate, roughly equivalent to that of an average human.
But while the upmarket Amazon Echo, or budget-friendly Xiaomi XiaoAI, may occupy a space in many of our living rooms, we’re yet to encounter a pervasive, society-wide uptake. We’re not yet embodying Joaquin Phoenix in the movie Her. Concerns remain about accuracy, safety and data privacy.
The idea that our devices are always listening to us, and multiple scandals over the last few years over the use of human contractors to parse private conversations, have increased scepticism towards the ability of tech companies to integrate into our private lives responsibly. Moreover, with the technology not yet perfected, dangers are perceived in the use of ASR in technologies that might harm us or others – cars, for example.
Nevertheless, trends show that ASR is here, and it is here to stay. Around one in five US households now has a voice-activated assistant, and the market for smart-home products in China is growing at around 20-30% per year. We are now seeing light switches, ovens, thermostats and even vacuum cleaners operated by voice.
Automotive companies are aggressively competing to offer advanced ASR systems to consumers. Not only do they provide convenience, but they are also seen as a safe alternative to activities that might distract drivers, such as looking away from the road to answer a call or adjust the stereo.
In business, ASR is revolutionizing laborious, expensive processes. Transcription is the big-hitter here. Services like Amazon Transcribe are designed for businesses to record and then automatically analyze customer service calls, meetings and even medical appointments. Integrating ASR with workflows is also transforming accessibility. People with physical impairments that might restrict their ability to use traditional digital interfaces are increasingly seeing this barrier removed.
So why isn’t ASR more pervasive, and what can we realistically expect of it in the coming years? Clues lie in technology itself. The reality is that language and speech are both incredibly complicated. Not only are there different languages, dialects and accents to interpret, but we often form sentences with implied and contextual meaning. Couple this with the fact that we’re often in noisy environments, where the audio of our speech must be identified over surrounding sound, and you have a monumental challenge for technology to deal with.
Learn about Google translation in our “When to use Google translation and when NOT to use it” guide
How does it deal with it? Put simply, by taking a humongous amount of data, and picking out patterns. Human brains learn from experience – from the moment we’re born, most of us are exposed to constant audio-visual stimulus and compare each moment of it against others to decipher meaning. ASR is guided by machine-learning, which does much the same. The amount of data, and processing power, required for this is monumental. As the computational capacity we have on-hand increases, so too will the ability of ASR to increase its accuracy and useability. We’re just not there yet.
With the possible applications of ASR already laid out, and the trend towards uptake steadily increasing, it is simply a matter of time before we see ASR more integrated in our processes and devices. The Amazon Echo was released just seven years ago. With steady advances in computing technology, and a competitive marketplace, over the next seven years we can reasonably expect ASR to become a permanent fixture in our work and home lives.
HI-COM is a multilingual translation agency dedicated to providing professional translation and interpreting services to companies all over the world. Working in over 40 languages, HI-COM is the localization partner for hundreds of companies and brands. Contact us today for your free consultation!