Speech Recognition vs. Speaker Recognition

"Wreck a nice beach?" or "Recognize speech?"

by / November 30, 2000 0
If you say both phrases quickly, you know the difference. But how can devices, specifically computers, keep track of which is which -- or is that witch?

Voice recognition -- the process of converting spoken words into computer text -- has been a viable technology since the early 1980s. But the difficulties surrounding this technology have kept its uses to a minimum -- until now. Today voice recognition technology is moving forward, and that's good news for people who spend the majority of their workday in front of a computer.

Voice recognition can be separated into two main categories: speech recognition and speaker recognition. Each has its own set of technologies and uses. However, in some implementations, these distinct functional categories work hand-in-hand to provide a rich set of "speaker-dependent" speech recognition.

Speech recognition is the ability for a device to recognize individual words or phrases from human speech. These words can be used to command the operation of a system -- computer menus, industrial controls or direct input of speech into an application -- as is the case with dictation software. Speech recognition systems can be speaker independent, typically with a limited vocabulary, or speaker dependent. The former is used when a limited vocabulary is expected to be used within a known context. The latter allows for greater vocabulary size, but at the cost of "training" the system for each specific user. This training typically consists of a user uttering a specific series of words and phrases so the system can learn the user's pronunciation techniques and speech patterns. It then creates a template specifically for each user.

Voice and Telephony
If you have placed a call to directory assistance recently to inquire about a number in a large city, your request was probably handled by an ASR-based IVR system. What's with all the initials? IVR stands for Interactive Voice Response. If you've ever worked with a voice mail system or called a large company, you have interacted with one of these systems. Typically, IVR systems rely on interactivity via the touch-tone keypad on your telephone. They provide the user with the ability to search for employees, enter account numbers, answer multiple-choice questions, etc. ASR stands for Automatic or Adaptive Speech Response. As its name implies, ASR allows for more natural human interaction with the telephony system, as its function is to recognize human speech. As previously mentioned, many directory assistance systems utilize ASR to facilitate the handling of incoming calls. Callers are requested to say the name of the city and the listing for which they require a telephone number. The system then attempts to interpret the spoken words into city and name. Due to the open-endedness of the required vocabularies -- look at the number of names in a large telephone book -- coupled with a limited frequency response of telephone lines and unpredictable noise levels, this is perhaps the most complex use of ASR/IVR technology in telephony systems. It relies heavily upon voice models that must cover the majority of the population while providing the ability to reject spurious noise.

A more practical and attainable approach to ASR is implementation of a speech-driven, menu-based, IVR system. The limited-vocabulary, fixed-context nature of this type of system substantially decreases its complexity and cost. A number of vendors currently have this type of system available on the market. They range in capability from systems that, out of the box, can recognize spoken numbers so the user is not required to press touch-tone buttons on the telephone, to highly customizable systems which provide integration capabilities into existing call centers and back-end data centers.

Between these extremes are systems that will recognize words from fixed or programmable dictionaries. These dictionaries are created to include words that are specific to the task being performed, and can include employee and department names, answers to survey questions, etc. Several states are currently using this type of system to facilitate call handling into their offices, specifically for unemployment claims processing.

Difficulty of Dictation

Dictation is perhaps the most difficult task for speech recognition systems to perform. Free-form speech is usually understandable by human beings. However, the mere nature of the way we communicate with one another, using accent, inflection and emotion, makes it more difficult for the computer to discriminate the words being spoken. Nonetheless, numerous products have appeared on the market, indicating that, while the technology may not be perfect, with some training on the part of the user and the computer, these systems can be highly effective and useful in a controlled environment.
One example of a dictation system is Lernout & Hauspie's Voice Xpress. This product is available as packaged software, as well as preinstalled on systems such as Aqcess Technology's Qbe Personal Computing Tablet product. Other dictation system products include ViaVoice from IBM and Naturally Speaking from Dragon Systems (now a unit of Lernout & Hauspie). ViaVoice is currently the only solution available for PC, Macintosh and Linux platforms as well as for embedded applications.

Voice recognition using such products gets substantially better as the system is trained to understand an individual. Also, during the dictation process, if the system does not understand a word or utterance, the user may be prompted to type the word or spell the misunderstood word verbally.

Microphone quality also plays a substantial role in defining the overall quality of the user experience. Poor-quality microphones, including many that are built into monitors and laptop computers, yield less-than-desirable results in dictation systems. The preferred type of microphone for this environment is a headset microphone, which can be placed into fixed position in front of the user's mouth to provide consistent audio quality. A microphone with some form of noise cancellation is also preferred. Noise cancellation is the ability for a microphone to ignore unwanted noises. This type of microphone is typically sensitive in only one direction, and sounds reaching it from other than the speaker are largely ignored or canceled.

But even the best microphones cannot compensate for too much background noise. For a dictation system to perform acceptably, ambient noise levels must be kept to a minimum, as must be spurious noise. If the noise level increases beyond the system's threshold, dictation accuracy diminishes rapidly.

Speaker Recognition

As both physical and data security become of greater concern, new methods of uniquely identifying people are emerging. These technologies rely largely upon the physical differences that make us individuals. Because every human being has a unique voice, voice can be used as a form of biometric user verification to physically secure an area, limit access to personnel files or verify a claimant's identity over the telephone for unemployment insurance processing.

This type of system works by enrolling each new user. The enrollment process consists of directing the user to repeat a series of numeric or verbal prompts. Once this is complete, the system generates a model of the user's vocal patterns. This model is unique to that individual. When used in conjunction with other forms of identification, such as username and password, a physical key or combination, voice biometrics provides a very high degree of confidence in verifying a user's identity.

Companies such as VeriVoice, T-Netix, Keyware and others offer products that perform the task of speaker recognition. These companies offer applications that handle:

* physical access;

* time and attendance;

* network and data security;

* securing Web-based applications and data; and

* custom application development.

As such, speaker recognition technology can let you into the building, clock you in at the start of the workday, give you access to your files on the company network, allow you to modify your 401K enrollment via the human resources internal Web site, secure your work for the day and clock you out when it's time to go home. When tied into the company's telephone system, voice verification can even validate that you are "punching the clock" from your own desk.

The Improving Power of Voice

From the mere twinkle in the eyes of engineers from IBM and AT&T, to becoming an integral part of our day-to-day lives, the state of voice recognition art has come a long way in the last 50 years. Whether or not we realize it, we use the technology on a regular basis. The array of products and applications that have recently emerged have demonstrated that these systems have improved to the point where they are usable for day-to-day dictation, control, logistics management and telephony applications.

Dictation systems have improved to the point that they can be used by individuals who may not be as effective with a keyboard as they could potentially be with their own voices. Virtually all dictation software also provides the user the ability to control the general operations of their computers. This capability offers physically or visually challenged people the ability to interact with the computer, as well. Systems for improving the quality of life for physically challenged people are relatively easy to implement if other electronic and electromechanical devices are already in place. In many cases, these existing systems can be augmented with voice-recognition capabilities.

Telephony applications are becoming more and more pervasive. Many companies and local governments are implementing systems that accept speech as well as "Touch-Tone" keys. Better recognition systems also allow voice-based input for forms completion, directory assistance, etc.

Speech recognition technology is making its way into our daily lives as we complete activities such as filling in unemployment insurance claim forms over the phone, controlling the living room lamp, opening the office door, controlling access to confidential records, "voice-surfing" the Web, taking dictation and inventorying pencils in the supply closet.

The products and vendors mentioned in this article by no means constitute all that is available on the market today. A simple search on the Internet for "speech recognition" yields a vast array of products and services geared toward the application of speech recognition technology. Chances are, if you have an idea of what you would like to do with speech technology, there's a product out there waiting to fit your requirement.

Voice recognition technology is rapidly improving and, as a result, is being used in more and more daily activities.


Pete Hermsen is president of Variant Technology Consulting, LLC, a firm that specializes in voice-over IP, e-commerce systems architecture and wireless LAN and WAN architecture and implementation.


Build Your Own

I f you are inclined to build your own voice recognition system, building blocks for these systems are available from the following vendors:

Dialogic Corp. offers CT Media, various telephony interface hardware plug-in cards and continuous speech processing hardware for on-board speech recognition. Call 800/755-4444 or 973/993-3030.

Brooktrout Technology Inc. also provides various telephony interface hardware plug-in cards as well as New Network platform (CTI platform). Call 781/449-4100.

Natural Microsystems, Inc. has various telephony interface hardware plug-in cards and NaturalRecognition available. Call 800/533-6120 or 508/620-9300.

Lernout & Hasupie sells ASR 1500 -- a speech recognition development kit -- and Real-Speak Text-To-Speech Development tools. Call 781/203-5000.

IBM offers a number of development tools for telephony: IBM DirectTalk; IBM Message Center; and IBM CallPath. Call 800/IBM-4YOU or 914/499-1900.

Numerous additional companies, such as Ericsson, Locus Dialog, Nuance, Philips, Speech Works and others offer hardware and software building blocks for developing ASR-based telephony platforms.