Speech-To-Text - Simple voice recognition

merendo · Post by **merendo** » Sat Nov 05, 2005 6:23 pm

Hi there,

I know this topic has been discussed a few times, but unfortunately a real solution has never been found. Apparently it was misunderstood that a speech-to-text engine was beeing looked for and not a text-to-speech, which is rather straightforward.

What I am looking for in specific is a voice recognition system that can respond to a few spoken commands. These spoken commands are previously "pre-entered" in the computer, like: "Please say such and such a word now." - "Please say it again." - "Thank you. I will now be able to recognise this command." and so on. So I think it will also be possible to do that language-independent.

Any idea how this could be accomplished?

Froggerprogger · Post by **Froggerprogger** » Sat Nov 05, 2005 8:09 pm

The main problem is how to define a comparison-procedure, that compares the recorded audiodata with the database. It's like with pictures: there are different procedures to calculate the distance between two pictures. Here you have to calculate the distance between two audiorecords.

Surely it's not possible to compare samplevalue by samplevalue, at this low scale there will never be a senseful comparison possible. More promising is to transfer the audiodata via FFT into spectrum and to save the averaged volumeprogression for several frequencies in time for later comparison. This should let you do first comparisons between different recordings of the same word, but it will not be very tolerable.

A further approach could then be to 'normalize' not just the volume, but the main-frequencies, too, so the comparison should be more resistant against speaking in different pitches (At least you could give it a try - though the formants (the characteristic and loudest frequencies) of a human voice are always the same, even if you try to speak higher/lower, so speaking in this way destroys the frequency-balance of the record.)

I think modern voice-recognition-systems are trying to recognize just tone syllables, not whole words. And they are 'learning' during speaking by creating a voice-profile to adjust the voice to the recorded reference-voice.

I think programming an excellent voice-recognition-system is very competive, but a more simple one should be able to implement in the above mentioned way.

merendo · Post by **merendo** » Sat Nov 05, 2005 8:23 pm

I also think that it would be best to reduce the recorded words to a minimum of data. This would usually deteriorate quality (i.e. someone else could make your computer think he is you) but make it much more reliable. Maybe it would help to decrease the audio definition (i.e. 8.000 kHz for spoken langauge) and decrease the bitrate, too?

Froggerprogger · Post by **Froggerprogger** » Sat Nov 05, 2005 11:39 pm

Maybe it would help to decrease the audio definition (i.e. 8.000 kHz for spoken langauge) and decrease the bitrate, too?

No, I don't think so. You would loss useful information for the FFT.
I would say something like this:

1. record with high quality
2. do the fft
3. from 'start' of the word until 'end' save for some selected fq-ranges their averaged/smoothed volume-envelopes during this time.
4. save this information inside a datastructure

But I never tested something like that! It's just an idea. But it should work in this way at least rudimental.

merendo · Post by **merendo** » Sun Nov 06, 2005 1:10 pm

What the heck is an FFT?

josku_x · Post by **josku_x** » Sun Nov 06, 2005 1:13 pm

"Fast Fourier Transform"

Better look through Google. and if you search for a definition of a word or something else, put in the google search this: "define:" and then your word, for example: "define:FFT" then you can find definitions.

http://www.google.com/search?hl=fi&q=define:FFT&lr=

simple, clever, handy

dell_jockey · Post by **dell_jockey** » Sun Nov 06, 2005 10:35 pm

an FFT is a calculation method, that assumes that a signal (any time series that is) is composed of multiple sine waves, each with different frequencies, amplitudes and phase shifts. FFT is a rather fast method - hence the name - to find the signals (and their properties) that created the time series that you're studying.

FFT has some limits, for instance: you need a rather large sampling window and it assumes that the composing signals do not change over time, ie. the content of the current sampling window would sort of be similar to the content of a sampling window some time further down the time series...

Other methods were derived for applications that need to analyze shorter time samples with changing base signal content, the most notably being MESA (Phd. Thesis by John Parker Burg, Stanford University, 1975). MESA is short for Maximum Entropy Spectrum Analysis.

kawasaki · Post by **kawasaki** » Mon Dec 25, 2006 9:26 pm

Just interface with the Microsoft SAPI.. It has speech recognition and text to speech built in.

Hydrate · Post by **Hydrate** » Sun Jan 14, 2007 6:51 pm

Would it not be possible in theory to get the computer to learn not to recognise certain pictches, but changes in pitch. For example, when someone speaks a owrd there will likely be a volume change in it, a pitch change and a frequency change over that word, they will never be the same, but there will be a change. You could get the program to learn to recognise certain patterns, such as:

Freq: Higher Lower Higher Lower Lower Lower Higher
Pitch: Higher Lower Higher Higher Lower Lower Higher

Something like that would surely work? Then if the program detects certain sequences (give or take one error or so) then it could decode that as a specific word? Its worth a try no?

dige · Post by **dige** » Mon Jan 15, 2007 9:29 am

@kawasaki: sounds interessting. do you know more about m$'s speech recognition? Url, Samples etc. ?

KarLKoX · Post by **KarLKoX** » Mon Jan 15, 2007 9:37 am

http://www.purebasic.fr/french/viewtopi ... ght=speech