Speech-To-Text - Simple voice recognition

Everything else that doesn't fall into one of the other PB categories.
merendo
Enthusiast
Enthusiast
Posts: 449
Joined: Sat Apr 26, 2003 7:24 pm
Location: Germany
Contact:

Speech-To-Text - Simple voice recognition

Post by merendo »

Hi there,

I know this topic has been discussed a few times, but unfortunately a real solution has never been found. Apparently it was misunderstood that a speech-to-text engine was beeing looked for and not a text-to-speech, which is rather straightforward.

What I am looking for in specific is a voice recognition system that can respond to a few spoken commands. These spoken commands are previously "pre-entered" in the computer, like: "Please say such and such a word now." - "Please say it again." - "Thank you. I will now be able to recognise this command." and so on. So I think it will also be possible to do that language-independent.

Any idea how this could be accomplished?
The truth is never confined to a single number - especially scientific truth!
Froggerprogger
Enthusiast
Enthusiast
Posts: 423
Joined: Fri Apr 25, 2003 5:22 pm
Contact:

Post by Froggerprogger »

The main problem is how to define a comparison-procedure, that compares the recorded audiodata with the database. It's like with pictures: there are different procedures to calculate the distance between two pictures. Here you have to calculate the distance between two audiorecords.

Surely it's not possible to compare samplevalue by samplevalue, at this low scale there will never be a senseful comparison possible. More promising is to transfer the audiodata via FFT into spectrum and to save the averaged volumeprogression for several frequencies in time for later comparison. This should let you do first comparisons between different recordings of the same word, but it will not be very tolerable.

A further approach could then be to 'normalize' not just the volume, but the main-frequencies, too, so the comparison should be more resistant against speaking in different pitches (At least you could give it a try - though the formants (the characteristic and loudest frequencies) of a human voice are always the same, even if you try to speak higher/lower, so speaking in this way destroys the frequency-balance of the record.)

I think modern voice-recognition-systems are trying to recognize just tone syllables, not whole words. And they are 'learning' during speaking by creating a voice-profile to adjust the voice to the recorded reference-voice.

I think programming an excellent voice-recognition-system is very competive, but a more simple one should be able to implement in the above mentioned way.
%1>>1+1*1/1-1!1|1&1<<$1=1
merendo
Enthusiast
Enthusiast
Posts: 449
Joined: Sat Apr 26, 2003 7:24 pm
Location: Germany
Contact:

Post by merendo »

I also think that it would be best to reduce the recorded words to a minimum of data. This would usually deteriorate quality (i.e. someone else could make your computer think he is you) but make it much more reliable. Maybe it would help to decrease the audio definition (i.e. 8.000 kHz for spoken langauge) and decrease the bitrate, too?
The truth is never confined to a single number - especially scientific truth!
Froggerprogger
Enthusiast
Enthusiast
Posts: 423
Joined: Fri Apr 25, 2003 5:22 pm
Contact:

Post by Froggerprogger »

Maybe it would help to decrease the audio definition (i.e. 8.000 kHz for spoken langauge) and decrease the bitrate, too?
No, I don't think so. You would loss useful information for the FFT.
I would say something like this:

1. record with high quality
2. do the fft
3. from 'start' of the word until 'end' save for some selected fq-ranges their averaged/smoothed volume-envelopes during this time.
4. save this information inside a datastructure

But I never tested something like that! It's just an idea. But it should work in this way at least rudimental. :wink:
%1>>1+1*1/1-1!1|1&1<<$1=1
merendo
Enthusiast
Enthusiast
Posts: 449
Joined: Sat Apr 26, 2003 7:24 pm
Location: Germany
Contact:

Post by merendo »

What the heck is an FFT?
The truth is never confined to a single number - especially scientific truth!
josku_x
Addict
Addict
Posts: 997
Joined: Sat Sep 24, 2005 2:08 pm

Post by josku_x »

"Fast Fourier Transform"

Better look through Google. and if you search for a definition of a word or something else, put in the google search this: "define:" and then your word, for example: "define:FFT" then you can find definitions.

http://www.google.com/search?hl=fi&q=define:FFT&lr=

simple, clever, handy :lol:
dell_jockey
Enthusiast
Enthusiast
Posts: 767
Joined: Sat Jan 24, 2004 6:56 pm

Post by dell_jockey »

an FFT is a calculation method, that assumes that a signal (any time series that is) is composed of multiple sine waves, each with different frequencies, amplitudes and phase shifts. FFT is a rather fast method - hence the name - to find the signals (and their properties) that created the time series that you're studying.

FFT has some limits, for instance: you need a rather large sampling window and it assumes that the composing signals do not change over time, ie. the content of the current sampling window would sort of be similar to the content of a sampling window some time further down the time series...

Other methods were derived for applications that need to analyze shorter time samples with changing base signal content, the most notably being MESA (Phd. Thesis by John Parker Burg, Stanford University, 1975). MESA is short for Maximum Entropy Spectrum Analysis.
cheers,
dell_jockey
________
http://blog.forex-trading-ideas.com
kawasaki
Enthusiast
Enthusiast
Posts: 182
Joined: Thu Oct 16, 2003 8:09 pm

Post by kawasaki »

Just interface with the Microsoft SAPI.. It has speech recognition and text to speech built in.
Hydrate
Enthusiast
Enthusiast
Posts: 436
Joined: Mon May 16, 2005 9:37 pm
Contact:

Post by Hydrate »

Would it not be possible in theory to get the computer to learn not to recognise certain pictches, but changes in pitch. For example, when someone speaks a owrd there will likely be a volume change in it, a pitch change and a frequency change over that word, they will never be the same, but there will be a change. You could get the program to learn to recognise certain patterns, such as:

Freq: Higher Lower Higher Lower Lower Lower Higher
Pitch: Higher Lower Higher Higher Lower Lower Higher

Something like that would surely work? Then if the program detects certain sequences (give or take one error or so) then it could decode that as a specific word? Its worth a try no?
.::Image::.
dige
Addict
Addict
Posts: 1417
Joined: Wed Apr 30, 2003 8:15 am
Location: Germany
Contact:

Post by dige »

@kawasaki: sounds interessting. do you know more about m$'s speech recognition? Url, Samples etc. ?
KarLKoX
Enthusiast
Enthusiast
Posts: 681
Joined: Mon Oct 06, 2003 7:13 pm
Location: France
Contact:

Post by KarLKoX »

"Qui baise trop bouffe un poil." P. Desproges

http://karlkox.blogspot.com/
Post Reply