You will need to develop a VAD (voice activity detection) !
I have developed some with satisfactory results, the methods I know and have tested are:
-
Zero crossing Rate - It consists of detecting how many times the voice signal has crossed the X axis if it has low occurrence of crosses the speech is present, with high occurrence without speech found.
- Energy - Consists of detecting decibels / rms, it is one of the simplest but serious false-positive ways.
- band Pitch Filter- Apply filters to the signal to capture only the voice range of the human being, the human voice is capable of reproducing sounds between 80 and 1100Hz, ie it is a wide spectrum of frequencies which makes things more complicated .
- In addition to applying filters, it is important to capture the frequencies of each Pitch Track, this will help you and in many decisions you can refine your results when faced with the results of other techniques.
Many algorithms use only Zero crossing rate information, see a plot of this technique:
Itisvisiblethecomparisonbetweentheamplitudeofthesignalwiththeoutlineoftheaxiscrossing,intheimagerealizetheZCR(Zerocrossingrate)peaksareexactlywherethespeechisnotpresentthisistotallyreciprocalwiththeamplitudethatisintimatelyconnectedtothesignalenergy.
Ifyoucombinethetechniquesdescribedherewillachievegoodresults,youwillneedtosetthresholdsfornoises,frequencies,axiscrossingsandtimeinsecondsormillisecondsofconsiderablesilence(thepersonmaybespeakingaphrasewithpausesbetweeneachword).
Ofcoursewearetalkingaboutreal-timeprocessing,foreachframeprocesseditisnecessarytoapplythreeormoretechniques,thegreatadvantageisthattheyarenotcomplex,theyarecomputationallyefficientwhichwillallowyoutoknowwheretocutthebeginningandendofeveryphraseorword.
Justsoyouknowgooglecanunderstand"OK google" by having a speech recognition algorithm or everything that is spoken is transcribed into text, this is another story much more complex ....