Local Governments Must Move Toward Vocal Computing to Improve Services (Industry Perspective)

Advances in speech recognition, coupled with machine learning that ultimately enables computers to provide context, will transform how we compute.

by Shawn DuBravac / December 15, 2016

For many of us, engaging with our local government offices means long waits, inscrutable forms and stacks of paperwork for basic services. It’s a last holdout to the digital revolution, but even city hall can embrace a digital destiny.
 
To better serve residents in this increasingly digital world, local governments are optimizing their IT infrastructure and maximizing their constrained budgets. And they’re transitioning their systems at a key point in human-machine history, as we phase out keyboards and the swipe-and-scroll interface of touchscreen devices and gear up for the next computing interface — namely, vocal computing. 
 
When I speak with municipalities about what they should be doing to better facilitate the flow of information to their residents and improve services across their jurisdictions, they are often thinking within the confines of an early 21st-century technological paradigm.
 
Today we still rely heavily on browsers and apps to deliver services and information. But the future of computing will be something decidedly different. It won’t be visual. It won’t be touch. We have likely witnessed some of the last visual interfaces of computing. The next technological interface will be vocal.
 
The digital computer age was born in 1942, with the introduction of the first digital computer. In the beginning, the way we communicated with computers was drastically different than how we communicated with one another. We had to speak the computer’s language, so we used techniques that computers could understand — things such as punch cards or command prompts.
 
Each subsequent decade has brought innovations and advancements that have made computers more user-friendly and intuitive.

In 1984, Steve Jobs brought the graphical user interface to the masses and gave us the folders and file structure we are accustomed to today. In 1991, Tim Berners-Lee introduced the first Web browser, which once again improved the way we interact with computers. With each innovation came greater ease of use, moving computers up the continuum of the communication hierarchy, getting ever closer to how humans interact.
 
Interactive voice response was a successful precursor to vocal interface. We might not like them, but the phone bots that ask us to “Please say ‘flight status’ or ‘reservations’” have worked for many years. But they worked because they simply had to pattern-match on a specific vocabulary.

But reliable vocal-computing technology has eluded us, even after early appearances starting around 1994. These initial iterations of technology were laughably inaccurate.
 
The accuracy of voice recognition is measured by word error rate (WER), which was nearly 100 percent in those early days — that is, computers interpreting virtually every word incorrectly. Computers were “hearing” the wrong words and as a result couldn’t perform the requested functions or interact with any measure of reliability.

The technology simply didn’t work. But over time, the technology improved. By 2013, the WER hovered around 25 percent. Progress, yes, but still not good enough for everyday use.
 
And then something remarkable happened. In the past three years, the technology has improved so significantly that Microsoft — the same company that brought us some of the first vocal-computing prototypes in 1994 — declared that the technology has achieved “human parity.”

The past 30 months have brought greater improvement in vocal-interface technology than the first 30 years. 

This technology is already in many homes through our cable boxes, many of which have embedded vocal-interface technology into remote controls, letting you search by speaking.

But my vision of vocal computing extends beyond human parity. For this technology to gain mainstream appeal, it must be as simple to converse with a computer as it is to converse with your spouse, neighbors or close friends.
 
Imagine a dinner table discussion with your spouse, in which you start with a simple question that elicits a response. From there, a series of questions and answers flows, based on the questions asked and the answers given. A key element of this series at this dinner table discussion is context. If you ask your spouse what time a certain movie starts, for example, he or she will probably intuitively know which theater you’re talking about.
 
Advances in speech recognition, coupled with machine learning that ultimately enables computers to provide context, will transform how we compute. Going back to the movie example, imagine if your spouse could understand you nearly perfectly, read context clues and also access all of the information housed on every website everywhere in every language?
 
Vocal computing offers added security elements too. Voice is a unique identifier, so in a vocal-computing environment, logging into secure sites could be done seamlessly and painlessly. No longer will you need to remember or reset passwords. You will simply speak your commands.
 
Planning for a trip to a foreign country? Vocal computing will allow you to speak what you are looking for and receive answers to your questions, even when the underlying source is available in a language different than the one you speak.

Vocal computing has the potential to be both easier and more efficient because it’s more intuitive.
 
And computers will soon be able to merge multiple call-and-response answers into a single voice reply. Eventually, vocal-computing technology will be implanted in all of the environments where we want access to information — our offices, our vehicles and throughout our homes.
 
As local governments prepare themselves for the future that is vocal computing, it will forever change how their services are rendered. You’ll be able to check your driving record with the DMV, update your information on file and even pay tickets without sitting in front of a screen and logging in.

You’ll register to vote using vocal computing and, in return, hear your polling location along with other relevant information, such as poll hours or historic wait times throughout past election days. And you can do all of this while you chop vegetables and steam rice for dinner.

In the very near future, I will want to ask my computer about government services such as where I can volunteer this weekend, current emergency-room wait times or when the next snowplow will service my street. 
 
Rather than simply offering a bunch of forms on their websites, municipalities must embrace a vocal-computing system that lets people fold the administrative to-dos on their lists into their daily routines and interactions.

Only then will they be positioned to provide residents with the best possible services and equip them for a 21st-century tomorrow.
 
Shawn DuBravac is chief economist of the Consumer Technology Association and the author of Digital Destiny: How the New Age of Data Will Transform the Way We Live, Work, and Communicate.