November 30, 2000 By Peter M. Hermsen
Voice recognition -- the process of converting spoken words into computer text -- has been a viable technology since the early 1980s. But the difficulties surrounding this technology have kept its uses to a minimum -- until now. Today voice recognition technology is moving forward, and that's good news for people who spend the majority of their workday in front of a computer.
Voice recognition can be separated into two main categories: speech recognition and speaker recognition. Each has its own set of technologies and uses. However, in some implementations, these distinct functional categories work hand-in-hand to provide a rich set of "speaker-dependent" speech recognition.
Speech recognition is the ability for a device to recognize individual words or phrases from human speech. These words can be used to command the operation of a system -- computer menus, industrial controls or direct input of speech into an application -- as is the case with dictation software. Speech recognition systems can be speaker independent, typically with a limited vocabulary, or speaker dependent. The former is used when a limited vocabulary is expected to be used within a known context. The latter allows for greater vocabulary size, but at the cost of "training" the system for each specific user. This training typically consists of a user uttering a specific series of words and phrases so the system can learn the user's pronunciation techniques and speech patterns. It then creates a template specifically for each user.
Voice and Telephony
If you have placed a call to directory assistance recently to inquire about a number in a large city, your request was probably handled by an ASR-based IVR system. What's with all the initials? IVR stands for Interactive Voice Response. If you've ever worked with a voice mail system or called a large company, you have interacted with one of these systems. Typically, IVR systems rely on interactivity via the touch-tone keypad on your telephone. They provide the user with the ability to search for employees, enter account numbers, answer multiple-choice questions, etc. ASR stands for Automatic or Adaptive Speech Response. As its name implies, ASR allows for more natural human interaction with the telephony system, as its function is to recognize human speech. As previously mentioned, many directory assistance systems utilize ASR to facilitate the handling of incoming calls. Callers are requested to say the name of the city and the listing for which they require a telephone number. The system then attempts to interpret the spoken words into city and name. Due to the open-endedness of the required vocabularies -- look at the number of names in a large telephone book -- coupled with a limited frequency response of telephone lines and unpredictable noise levels, this is perhaps the most complex use of ASR/IVR technology in telephony systems. It relies heavily upon voice models that must cover the majority of the population while providing the ability to reject spurious noise.
A more practical and attainable approach to ASR is implementation of a speech-driven, menu-based, IVR system. The limited-vocabulary, fixed-context nature of this type of system substantially decreases its complexity and cost. A number of vendors currently have this type of system available on the market. They range in capability from systems that, out of the box, can recognize spoken numbers so the user is not required to press touch-tone buttons on the telephone, to highly customizable systems which provide integration capabilities into existing call centers and back-end data centers.
Between these extremes are systems that will recognize words from fixed or programmable dictionaries. These dictionaries are created to include words that are specific to the task being performed, and can include
You may use or reference this story with attribution and a link to