Speech processing - we use a modified version of the free open-source
package
'Speex'
,which has been specifically designed to handle speech files. We only use this to clean up our bluetooth recordings here though. The star of the show is
LPC10
,an aggressive speech compression tool. (and over 16 years old!)
A run through of how it works
here is a run through of what's going on when the program is working. I wont
go into too much detail or get too technical on it.
A top-level algorithm would be:
1) Open a connection to headset 1, play a welcome message and record a spoken
phrase
2) do some initial cleaning up of the recording to remove noise, hiss etc.
3) break up the recording into several individual words.
3) for each word:
3a) compress the recording so that is still understandable, but so we have
less data to look at.
3b) compare each (compressed data) frame with known frame sounds from a
codebook to get a list of the 'phoneme' sounds that make up the word
3c) find the best match word based on a 'likelyhood' rulebook for our
particular language
3d) Add this word in the translated language to the ouput phrase
4) switch to using the other headset, play back the whole translated phrase and record the reply.
5) loop back to 2
It would be nice to be able to be recording/translating/playing back on both
headsets at the same time (i.e. 'full duplex'), but because of
current limitations in the current bluetooth drivers, we only allow
recording/playback on one headset at a time ('half duplex'), so the 2 people
speaking have to take turns to speak. Full duplex might be do-able on a more
powerful setup though. (and once the bluetooth stack supports eSCO properly)
I'll go into detail on some of these items:
2 Clean up recording - You will notice there is a rather nasty buzz
present on recordings made with this setup. Suspect it's an issue with
the bluetooth dongle I'm using, but I don't know if it's a headset
problem, a Bluetooth adapter problem, a driver problem or what. It's a pest
though, if anybody is getting better quality recordings than the examples
on this site, please let me know how. It's possible to clean up the recording
prior to processing it, which is a good idea anyway whether there is a buzzing
present or not.
The way to do this is to use the '--denoise' flag in
speexenc - the Speex compression tool - to do some Speex/FFT based super
filtering on our recording. It gives really good results.
An alternative/faster approcch for is instead to use quick-and-dirty
time-domain modeling and
filtering of the buzz with a short C program (filter.c) instead. Doesn't
sound quite as good but runs fast as hell.
3a,3b,3c - the speech recognition bit gets its
own page
3d Swap recognized word for translated word - This is the simplest way
to do speech translation, on a word-for-word basis. It works on a basic level, but leads to crappy translation for longer phrases, for example the classic
'The vodka is good but the meat is rotten' translation of 'the spirit is strong but the flesh is weak'. phrase-level and sentence-level translation over and
above simple word swapping would help here and wouldn't be too difficult to
implement.
The speech synthesis side is just stringing the
required word samples together and playing the result back.
The setup in practice
To keep things simple, I'm only doing translation of numbers between English and
French for now. Having a small pool of words to work with makes the recognition
and translation a lot easier and more reliable.
It should be easy to add in other European languages later, and
increase the pool of words later on.
Installing the Hardware/Software
Detailed instructions on doing the install on a gumstix are on
This page
(On a Linux PC with working bluetooth, just need to download and unzip the package
into your home directry and run it from there.)
Current/Future Work on the project
Currently I'm working on improving the recognition engine (have only spend a
couple of hours knocking up the rough versions of the codebook and rulebook
files used just now). Will also look at using pocketsphinx as a recognition
engine instead, it seems to be coming on in leaps and bounds although seems to
have a pretty steep learning curve.
Feedback
If you have any feedback or comments on the project, feel free to email me:
brian@shapeseeker.com. This address
is also my PayPal tip jar :o)
(I'm also in the job market soon... )