fbpx Skip to content

Aquent | DEV6

Speech Recognition in JavaScript Apps

Written by: Hasan Ahmad

Voice User Interfaces (VUI’s) are increasingly popular lately. There are a number of reasons why:

  • Recent advances in Machine Learning (ML) have made Natural Language Processing (NLP) a lot better
  •  The increasing ubiquity of smart personal computing devices (phones, watches, headphones, VR headsets, I.o.T. home automation, etc.)
  • Due to an aging population who have had decades of experience with increasing daily internet use, a higher percentage of end-users having reduced or limited vision, and the government standards mandating accessibility that come with this

The Web Speech API is actually a 2-part API, and you can kind of think of them as separated by Voice Input, and Voice Output. By Voice Input and Output, I’m talking about the difference between your application parsing voice recordings as a user input, and your application actually “speaking” out strings of text as an output back to the user. The full spec can be found here: https://w3c.github.io/speech-api/speechapi.html

The Voice Input side of things is encapsulated by the Speech Recognition API and the Voice output API is abstracted by the Speech Synthesis API. In this post, I’m just going to focus on the Speech Recognition API.

Let’s break down the basics of this API:

SpeechRecongnition – The entry point into the API

SpeechGrammar – Words/patterns we are looking for in input

SpeechRecognitionResult – Results of matching one or more words

SpeechRecognitionAlternative – Single matching word, with a level of confidence

The SpeechRecongition object is actually ready to start interpreting speech out of the box. Let’s take a look at this JavaScript code snippet:

var SpeechRecognition = SpeechRecognition || webkitSpeechRecognition
var recognizer = new SpeechRecognition();
 
function listen() {
    recognizer.onresult = function (event) {
        var transcript = event.results[0][0].transcript +
        ' (' +
        (event.results[0][0].confidence * 100).toFixed(2) +
        '%)';
        document.querySelector('#record-transcript').innerHTML = transcript;
    recognizer.start();
}

And the accompanying HTML:

<!doctype HTML> 
<head><title>Speech Recognition Example</title></head>
<body>
    <button onclick="listen()">
        Listen To Me
    <button>
    <div>
        <h3>What I heard:</h3>
        <h2 id="record-transcript"></h2>
    </div>
</body>

It is an asynchronous API that can start recording the user’s voice through the microphone (if the user grants your page permission to record audio). It takes a callback that can stream the results back to you through SpeechRecognitionResult and SpeechRecognitionAlternative objects. If the speech recognizer can hear multiple different words, you’ll get multiple results back, with a level of confidence. In the above example, as long as the microphone is recording, we are capturing the results back and appending the words interpreted by the API to the DOM, with the level of confidence the speech recognition algorithm has that it heard that word.

You can customize the recognizer with a grammar to focus on certain words over others, and assign weights to the possible phrase interpretations. This might help make your speech recognition work better for a specific domain, locale, or language. The SpeechGrammar object represents a standard JSGF (JSpeech Grammar Format) string that the speech recognizer can take as an input. (These days, grammars are often generated by big data-trained machine learning algorithms) The recognizer will use this grammar as a guide to determine how to interpret the speech that it recognizes. A very simple speech grammar can, for example, associate certain words with the concept of drinks in English. Here’s a simple example:

#JSGF V1.0;
 
grammar numbers;
 
public <drinks> = coffee | tea | milk;

You can build increasingly sophisticated grammars through recursive nesting, to understand more complex types of phrases. For example:

public <request> = <want|need|request>
public <order> = I <request> a <drink>

For the purpose of keeping this post focused, I won’t go further into detail about constructing language grammars here. (it’s a very big topic). For more information take a look at the JSGF format docs here: https://www.w3.org/TR/jsgf/. Also, here’s a link to the Wikipedia article on Formal grammars: https://en.wikipedia.org/wiki/Formal_grammar

It’s an exciting time as a software developer. When the personal computing industry went through a revolution driven by mouse + keyboard graphical user interfaces (GUI’s), we saw the emergence of home desktop operating systems, personal productivity apps, and much of the technology experiences the world depends on today. Another dimension of possibilities was enabled with the combination of touch screens and mobile phones, leading to today’s smartphone platforms like iOS and Android. The emergence of VUI’s represents another opportunity to add yet another dimension of capabilities and enhancements to the ongoing evolution of personal computers. Hardware manufacturers are racing to capture market-share in the smart speaker space, offering voice interfaces that can extend your daily apps hand-free, and location-aware capabilities.

That’s really all you need to get started with speech recognition! It’s not as hard as it sounds. There’s no reason you can’t experiment with your web apps and try out some voice-powered features right now.