Beginpc.com - Speech Recognition

Speech Recognition

If your not a fan of typing or just find typing a tedious activity, then this section is for you.

Speech recognition in the world is not radically new, and has been used in some industries for a while. But it has only been the past year or two that the general public has started using speech recognition for word processing needs. This is mainly what most people use speech recognition software for, this could be dictating essays, letters or memos. To use speech recognition you must possess a microphone that can be plugged into the relevant socket on the back of your sound card.

Once you have purchased you speech recognition software, the first thing after the installation will be to go through, what is known as an enrolment. This is where you have to dictate text that is already known to the computer for 30 minutes to an hour. The purpose of this enrolment is so that the software can create a table of vocal references, specifically for the user. The table of vocal references for the user are the ways in which your pronunciation of phonemes varies from models of speech based upon samples of hundreds of people. Phonemes are the smallest sound units that combine words, such as �duh�, �aw�, and �guh� in dog. There are forty eight phonemes in the English language. The most important factors in speech recognition are the quality of the microphone and the computers processing power. Once enrolled you can start using the speech recognition software, you do this by speaking into the microphone just like you would speak in a conversation.

Once you have started talking, about every 15 milliseconds the analogue signal that is generated by the microphone is processed by an analogue to digital converter (ADC). The convertor translates the analogue sound waves into binary code. The code is dependant upon the measurements of several factors, pitch, volume, frequency, length of phonemes, and silences. The binary code is then compressed, this allows the system to process the code faster.

After compression the code is then adjusted by the speech software's engine, adjustments are made to the phoneme measurements, and from the table of your vocal references. The code is then sent to the acoustic recognizer which compares the phonemes to a binary tree database of compressed phonemes measurements, which have been compiled from sampling the speech of hundreds of individuals. The acoustic recognizer then tries to find as near a match for each measurement like pitch, volume, length, tremor etc. This matching process continues until the engine finds the model phoneme that most closely matches the sample phoneme across the entire range of measurements.

Once all the phonemes have been recognised the speech engine compares groups of successive phonemes against a database of known words this database is known as the lexicon. The results of the lexicon are turned over to the natural language engine, which predicts the most likely combination of the words from the lexicon, based on the relative frequency of all the groups of three words in the language. For example it knows �going to go� is more frequent the �going too go�, and �going two go�, and therefore displays on screen the word combination that has the highest probability of matching what you said, into the microphone only a few seconds ago.