How to detect when the person starts speaking using SpeechRecognition () in Javascript

Question

How to detect when the person starts speaking using SpeechRecognition () in Javascript

Navigation

#1 by (8 votes)
#2 by (3 votes)

8

I'm using SpeechRecognition (native to every browser) to be able to do voice searches on a website and I noticed that Google can identify when the person starts talking (both in "Ok google" and when the person clicks the button To talk). I tried to look at the codes but they are very compressed and 'scrambled', I can not understand anything and I wonder if anyone knows how to identify the person's voice when they are speaking into the microphone.

The idea would be to detect after the start

javascript algoritmo

asked by anonymous 23.07.2014 / 00:06

2 answers

3

I think you can use this plugin for what you want.

<script src="//cdnjs.cloudflare.com/ajax/libs/annyang/1.1.0/annyang.min.js"></script>
<script>
    if (annyang) {
      // vamos definir o primeiro comando, que no seu caso seria o Start
      var commands = {
        'start': function() {
          $('#algo').animate({bottom: '-100px'});
        }
      };

      // adicionando os comandos ao annyang
      annyang.addCommands(commands);

      // começa a ouvir aguardando os comandos.
      annyang.start();
    }
</script>

link

23.07.2014 / 00:17

Web Components - What is the difference between Polymer and ReactJS? Capture Computer Name and User Name

score 8 · Accepted Answer

You will need to develop a VAD (voice activity detection) !

I have developed some with satisfactory results, the methods I know and have tested are:

Zero crossing Rate - It consists of detecting how many times the voice signal has crossed the X axis if it has low occurrence of crosses the speech is present, with high occurrence without speech found.
Energy - Consists of detecting decibels / rms, it is one of the simplest but serious false-positive ways.
band Pitch Filter- Apply filters to the signal to capture only the voice range of the human being, the human voice is capable of reproducing sounds between 80 and 1100Hz, ie it is a wide spectrum of frequencies which makes things more complicated .
In addition to applying filters, it is important to capture the frequencies of each Pitch Track, this will help you and in many decisions you can refine your results when faced with the results of other techniques.

Many algorithms use only Zero crossing rate information, see a plot of this technique:

Itisvisiblethecomparisonbetweentheamplitudeofthesignalwiththeoutlineoftheaxiscrossing,intheimagerealizetheZCR(Zerocrossingrate)peaksareexactlywherethespeechisnotpresentthisistotallyreciprocalwiththeamplitudethatisintimatelyconnectedtothesignalenergy.

Ifyoucombinethetechniquesdescribedherewillachievegoodresults,youwillneedtosetthresholdsfornoises,frequencies,axiscrossingsandtimeinsecondsormillisecondsofconsiderablesilence(thepersonmaybespeakingaphrasewithpausesbetweeneachword).

Ofcoursewearetalkingaboutreal-timeprocessing,foreachframeprocesseditisnecessarytoapplythreeormoretechniques,thegreatadvantageisthattheyarenotcomplex,theyarecomputationallyefficientwhichwillallowyoutoknowwheretocutthebeginningandendofeveryphraseorword.

Justsoyouknowgooglecanunderstand"OK google" by having a speech recognition algorithm or everything that is spoken is transcribed into text, this is another story much more complex ....