Speech-To-Text - Simple voice recognition
Speech-To-Text - Simple voice recognition
Hi there,
I know this topic has been discussed a few times, but unfortunately a real solution has never been found. Apparently it was misunderstood that a speech-to-text engine was beeing looked for and not a text-to-speech, which is rather straightforward.
What I am looking for in specific is a voice recognition system that can respond to a few spoken commands. These spoken commands are previously "pre-entered" in the computer, like: "Please say such and such a word now." - "Please say it again." - "Thank you. I will now be able to recognise this command." and so on. So I think it will also be possible to do that language-independent.
Any idea how this could be accomplished?
I know this topic has been discussed a few times, but unfortunately a real solution has never been found. Apparently it was misunderstood that a speech-to-text engine was beeing looked for and not a text-to-speech, which is rather straightforward.
What I am looking for in specific is a voice recognition system that can respond to a few spoken commands. These spoken commands are previously "pre-entered" in the computer, like: "Please say such and such a word now." - "Please say it again." - "Thank you. I will now be able to recognise this command." and so on. So I think it will also be possible to do that language-independent.
Any idea how this could be accomplished?
The truth is never confined to a single number - especially scientific truth!
-
Froggerprogger
- Enthusiast

- Posts: 423
- Joined: Fri Apr 25, 2003 5:22 pm
- Contact:
The main problem is how to define a comparison-procedure, that compares the recorded audiodata with the database. It's like with pictures: there are different procedures to calculate the distance between two pictures. Here you have to calculate the distance between two audiorecords.
Surely it's not possible to compare samplevalue by samplevalue, at this low scale there will never be a senseful comparison possible. More promising is to transfer the audiodata via FFT into spectrum and to save the averaged volumeprogression for several frequencies in time for later comparison. This should let you do first comparisons between different recordings of the same word, but it will not be very tolerable.
A further approach could then be to 'normalize' not just the volume, but the main-frequencies, too, so the comparison should be more resistant against speaking in different pitches (At least you could give it a try - though the formants (the characteristic and loudest frequencies) of a human voice are always the same, even if you try to speak higher/lower, so speaking in this way destroys the frequency-balance of the record.)
I think modern voice-recognition-systems are trying to recognize just tone syllables, not whole words. And they are 'learning' during speaking by creating a voice-profile to adjust the voice to the recorded reference-voice.
I think programming an excellent voice-recognition-system is very competive, but a more simple one should be able to implement in the above mentioned way.
Surely it's not possible to compare samplevalue by samplevalue, at this low scale there will never be a senseful comparison possible. More promising is to transfer the audiodata via FFT into spectrum and to save the averaged volumeprogression for several frequencies in time for later comparison. This should let you do first comparisons between different recordings of the same word, but it will not be very tolerable.
A further approach could then be to 'normalize' not just the volume, but the main-frequencies, too, so the comparison should be more resistant against speaking in different pitches (At least you could give it a try - though the formants (the characteristic and loudest frequencies) of a human voice are always the same, even if you try to speak higher/lower, so speaking in this way destroys the frequency-balance of the record.)
I think modern voice-recognition-systems are trying to recognize just tone syllables, not whole words. And they are 'learning' during speaking by creating a voice-profile to adjust the voice to the recorded reference-voice.
I think programming an excellent voice-recognition-system is very competive, but a more simple one should be able to implement in the above mentioned way.
%1>>1+1*1/1-1!1|1&1<<$1=1
I also think that it would be best to reduce the recorded words to a minimum of data. This would usually deteriorate quality (i.e. someone else could make your computer think he is you) but make it much more reliable. Maybe it would help to decrease the audio definition (i.e. 8.000 kHz for spoken langauge) and decrease the bitrate, too?
The truth is never confined to a single number - especially scientific truth!
-
Froggerprogger
- Enthusiast

- Posts: 423
- Joined: Fri Apr 25, 2003 5:22 pm
- Contact:
No, I don't think so. You would loss useful information for the FFT.Maybe it would help to decrease the audio definition (i.e. 8.000 kHz for spoken langauge) and decrease the bitrate, too?
I would say something like this:
1. record with high quality
2. do the fft
3. from 'start' of the word until 'end' save for some selected fq-ranges their averaged/smoothed volume-envelopes during this time.
4. save this information inside a datastructure
But I never tested something like that! It's just an idea. But it should work in this way at least rudimental.
%1>>1+1*1/1-1!1|1&1<<$1=1
"Fast Fourier Transform"
Better look through Google. and if you search for a definition of a word or something else, put in the google search this: "define:" and then your word, for example: "define:FFT" then you can find definitions.
http://www.google.com/search?hl=fi&q=define:FFT&lr=
simple, clever, handy
Better look through Google. and if you search for a definition of a word or something else, put in the google search this: "define:" and then your word, for example: "define:FFT" then you can find definitions.
http://www.google.com/search?hl=fi&q=define:FFT&lr=
simple, clever, handy
-
dell_jockey
- Enthusiast

- Posts: 767
- Joined: Sat Jan 24, 2004 6:56 pm
an FFT is a calculation method, that assumes that a signal (any time series that is) is composed of multiple sine waves, each with different frequencies, amplitudes and phase shifts. FFT is a rather fast method - hence the name - to find the signals (and their properties) that created the time series that you're studying.
FFT has some limits, for instance: you need a rather large sampling window and it assumes that the composing signals do not change over time, ie. the content of the current sampling window would sort of be similar to the content of a sampling window some time further down the time series...
Other methods were derived for applications that need to analyze shorter time samples with changing base signal content, the most notably being MESA (Phd. Thesis by John Parker Burg, Stanford University, 1975). MESA is short for Maximum Entropy Spectrum Analysis.
FFT has some limits, for instance: you need a rather large sampling window and it assumes that the composing signals do not change over time, ie. the content of the current sampling window would sort of be similar to the content of a sampling window some time further down the time series...
Other methods were derived for applications that need to analyze shorter time samples with changing base signal content, the most notably being MESA (Phd. Thesis by John Parker Burg, Stanford University, 1975). MESA is short for Maximum Entropy Spectrum Analysis.
Would it not be possible in theory to get the computer to learn not to recognise certain pictches, but changes in pitch. For example, when someone speaks a owrd there will likely be a volume change in it, a pitch change and a frequency change over that word, they will never be the same, but there will be a change. You could get the program to learn to recognise certain patterns, such as:
Freq: Higher Lower Higher Lower Lower Lower Higher
Pitch: Higher Lower Higher Higher Lower Lower Higher
Something like that would surely work? Then if the program detects certain sequences (give or take one error or so) then it could decode that as a specific word? Its worth a try no?
Freq: Higher Lower Higher Lower Lower Lower Higher
Pitch: Higher Lower Higher Higher Lower Lower Higher
Something like that would surely work? Then if the program detects certain sequences (give or take one error or so) then it could decode that as a specific word? Its worth a try no?

