Build a voice recognition app with Javascript!

If you're like me the coolest thing an app can do is to be able to talk back to you and that functionality is exactly what we'll be building today!

This is a very entertaining project, specially if you're getting into JavaScript and want to see how fun it is. I have a live version of a bit more complex use of this but with the same basics we'll see in this tutorial. You can check the live version and repository here.

Let's dive right into it!!

Hoper simpson jumpin in pool

Let's set up our HTML document and our CSS!

HTML document

We need to be able to display if the browser supports speech synthesis, a button we can click to activate the page to 'listen' and a div to register our commands.

<header>
    <h1 id="supportMsg"></h1>
    <button class="converse"> Converse </button>

</header>
    <div class="words" contenteditable=""></div>

⚠️ Don't forget to link your css file and your script.js

CSS file

Let's do some basic styling. Feel free to change this however you'd like!

body {
    background-color: rgba(193, 159, 238, 0.973);
    font-family: 'helvetica neue';
    font-weight: 400;
    font-size: 20px;
    font-family: 'M PLUS 2', sans-serif;
}

header{
    display: flex;
    align-items: center;
    justify-content: center;
    flex-direction: column;
    margin-top: -10px;
}
.words {
    max-width: 400px;
    margin: 50px auto;
    background: white;
    border-radius: 20px;
    box-shadow: 10px 10px 0 rgba(0,0,0,0.1);
    padding: 1rem 2rem 1rem 2rem;
    background-size: 90% 2rem;
    position: relative;
    line-height: 2rem;
    pointer-events: none;
    font-size: 17px;
}

p {
    margin: 0 0 3rem;
}

.converse{
    width: 80px;
    height: 30px;
    border-radius: 20px;
    border-color: transparent;
    box-shadow: 10px 10px 0 rgba(0,0,0,0.1);
    background-color: #ffff;
    font-family: 'M PLUS 2', sans-serif;
    font-weight: 800;
}

.converse:hover{
    box-shadow: 5px 5px 0 rgba(0, 0, 0, 0.575);
}

.converse{
    margin-top: 5px;
}
h3{
    font-size: 15px;
    padding: 10px;
    text-align: center;
}

But why are we styling .p when we don't have it in our HTML document? We're going to be adding a 'p' element using JavaScript later, that's where our commands are going to be displayed.

It should be looking like this: Screenshot (146)

Time for the fun part!! Let's start working with Javascript.

Script.js

Like usual we need to 'select' the elements we're going to be adding functionality to from our HTML document, in this case our converse button, our words and the supportMsg:

const converse = document.querySelector('.converse')
const supportMsg = document.getElementById('supportMsg')
const saidWords = document.querySelector('.words')

Voice recognition functionality

We'll be using window.SpeechRecognition which lives in the browser. Let's create a new SpeechRecognition object and assign it to our recognition variable and the language it should listen to.

window.SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition 
const recognition = new SpeechRecognition()
recognition.lang = 'eng-US'
//recognition.lang = 'es-MEX' <- for Spanish

Now we need to create our p element and append the words we're saying to our newly created p.

let p = document.createElement('p')
saidWords.appendChild(p)

For the actual listening part, we need a couple of things:

We need to check for the 'result' (that will come from the SpeechRecognitionEvent in our console).
We need to turn the the data we're getting into an array and we're assigning it to our transcript variable, mapping over it and joining those two pieces of data together. I talk more about MAP here.
Lastly we need to assign the text content of our p element to be our transcript, keep in mind that for it not to keep overwriting itself we need to check if the last word we said is the final one and if so that's when we want to append it to our lovely p child.
Also start the speech recognition adding an event listener to our button.

Let's see the code:

recognition.addEventListener('result', e=> {
    const transcript = Array.from(e.results)
    .map(result => result[0])
    .map(result => result.transcript)
    .join('')

    p.textContent = transcript
    //check if the result is final for it not to rewrite itself
    if(e.results[0].isFinal){
        p = document.createElement('p')
        saidWords.appendChild(p)
    }
    console.log(transcript)    
})

  converse.addEventListener('click', () => {
    recognition.start();
  })

But Yuri!!! It's not talking to me yet ! :((((

Do not fret, friend! Let's start working on the functionality for it to talk to us.

Talking functionality

Is not much different with what we did for our speech recognition, but first we need to check if it's supported by the browser and display a message accordingly. Let's do it with a simple if statement.

if('speechSynthesis' in window){
    supportMsg.innerHTML = "Your browser supports speech synthesis! Rejoice! You can converse with me. Say hi :)";
  } else {
    supportMsg.innerHTML = "Sorry, your browser does NOT support speech synthesis. You can't converse with me :(";
  }

We need to go over what it was just said, same as our transcript logic with an important distinction and that is using readOutLoud function and passing the text variable, this function we haven't done yet will take care of checking the speech(what it was said) and matching it with any other commands we might program it to respond to later.

  recognition.onresult = function(event){
    const current = event.resultIndex;
    const text = event.results[current][0].transcript 

    readOutLoud(text)
  }

We're getting there! Now a couple of last things:

For an actual response we need to set our readOutLoud function to take a message we have already stablished to be our text variable.
We will also assign to our speech variable a new SpeechSynthesisUtterance() which is a speech request. As in the content it should read and how(volume, rate, pitch, lang) are just some we'll be working with.
For it to respond we're using a simple if condition where the speech.text(what it will respond with) is what we condition the page to answer to.
Don't forget to call window.speechSynthesis.speack(speech) passing it our speech for it to actually talk to us.

  function readOutLoud(message){
    const speech = new SpeechSynthesisUtterance();

    if(message.includes('hi') || message.includes('hey') || message.includes('hello')){
      const finalText = "Hello"
      speech.text= finalText // i was being extra you can totally do this in one variable =  speech.text= "Hello"
    }
    speech.volume = 1;
    speech.rate = 1;
    speech.volume = 1;
    speech.pitch  = 1;
    speech.lang = 'en-US';
    window.speechSynthesis.speak(speech);

  }

This is our final code:

window.SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition 
const converse = document.querySelector('.converse')
const supportMsg = document.getElementById('supportMsg')
const saidWords = document.querySelector('.words')

const recognition = new SpeechRecognition()
recognition.lang = 'eng-US'
let p = document.createElement('p')
saidWords.appendChild(p)

recognition.addEventListener('result', e=> {
    const transcript = Array.from(e.results)
    .map(result => result [0])
    .map(result => result.transcript)
    .join('')

    p.textContent = transcript
    //check if the result is final for it not to rewrite itself
    if(e.results[0].isFinal){
        p = document.createElement('p')
        saidWords.appendChild(p)
    }
    console.log(transcript)    
})

// Converse with page option
// Check if its supported
if('speechSynthesis' in window){
    supportMsg.innerHTML = "Your browser supports speech synthesis! Rejoice! You can converse with me. Say hi :)";
  } else {
    supportMsg.innerHTML = "Sorry, your browser does NOT support speech synthesis. You can't converse with me :(";
  }

    recognition.onresult = function(event){
    const current = event.resultIndex;
    const text = event.results[current][0].transcript 

    readOutLoud(text)
  }

  //Actual response
  function readOutLoud(message){
    const speech = new SpeechSynthesisUtterance();

    if(message.includes('hi') || message.includes('hey') || message.includes('hello')){
      const finalText = "Hello"
      speech.text= finalText
    }
    speech.volume = 1;
    speech.rate = 1;
    speech.volume = 1;
    speech.pitch  = 1;
    speech.lang = 'en-US';
    window.speechSynthesis.speak(speech);   
  }

  converse.addEventListener('click', () => {
    recognition.start();
  })

You can make it answer to different commands, trigger animations with it or even hook up an API and make it answer more complex questions! The possibilities are endless, my friend!

I really hope you enjoyed it and learned something new today!!

Don't hesitate to contact me and let me know if you'd like to add something else in the comments.

Thank you for reading! :)

☕If you enjoy my content Buy me a coffee It'll help me continue making quality blogs💕