fbpx Skip to content

Aquent | DEV6

Let’s Talk About Speech

Written by: Alain Thibodeau

SpeechSynthesis is a component of the Web Speech API that allows you to add text-to-speech to your web app. It can come in handy when you need to enhance your Web app with voice and it has come a long way. For example, I have used it to read strings served by socket events. This allowed the user to get information from the application hands-free without looking at the screen.

SpeechSynthesis is still experimental and not all browsers support it, or all its features. However, for the browsers that do support it, it’s really easy to get your Web app talking. So let’s take a look at a few examples and discuss some caveats.

Browser, speak!

In its simplest form, we can get the SpeechSynthesis to say something in one line!

window.speechSynthesis.speak(new SpeechSynthesisUtterance('Hello world'));

Pretty easy huh? But what is going on here?

The browser has access to the device’s speech synthesiser and exposes access to it via the window’s speechSynthesis interface. The speechSynthesis ‘speak’ method expects an instance of an object called “SpeechSynthesisUtterance”. We can call the speak method multiple times with new instances of utterances, each will be added to a queue and spoken one at a time. If you need it to stop speaking and clear the queue, you can call the ‘cancel’ method of the speechSynthesis.

You might want to take it a step further and change the voice that you want to use, or maybe even speak in another language. So let’s take a look at that.

Getting Voices

Every device has a list of pre-installed voices and we can access them using SpeechSynthesis.getVoices(). This method returns an array of voices, which are instances of SpeechSynthesisVoice that looks like this:

{voiceURI: “Alex”, name: “Alex”, lang: “en-US”, localService: true, default: true}

Heads up, different browsers return a different list of voices even when the browsers are on same computer/device. In my experience I have noticed that Chrome usually gives access to more voices. But take into consideration that some of these extra voices may not be local. To find out, you can inspect the “localService” property of the voice. If it is ‘false’, then the voice is remote and may impact performance and use bandwidth.

Where are my voices?

You may have tried getVoices() already and noticed you are getting an empty array. This is because the getVoices() is asynchronous — we need to wait until the voices are loaded from the system.

One strategy is to use the “onvoiceschanged” event. To handle this, we can use what the documentation recommends:

function populateVoiceList() { 
  var voices = window.speechSynthesis.getVoices();
  if (voices.length !== 0) { 
    console.log(voices);
        // do something with the voices. 
 }
}  
 
populateVoiceList(); 
if (speechSynthesis.onvoiceschanged !== undefined) {
  speechSynthesis.onvoiceschanged = populateVoiceList;
} 

Unfortunately, Firefox and Safari don’t support this event. Usually the voices are ready almost instantaneously, but you can use an interval timer instead to be certain:

var timer = setInterval(function () {
   var voices = speechSynthesis.getVoices();
   if (voices.length !== 0) {
       console.log(voices);
       // do something with the voices.
       clearInterval(timer);
   }
}, 200);

By now I am sure you have noticed the “lang” property of the voice object. Let’s see next what this is all about.

Look ma, je parle français

The SpeechSynthesisUtterance handles speaking languages other than English, which is nice for international apps. Out of the box, this is also pretty straightforward. You tell the utterance instance what language you want it to speak (using the i18n key) and give it the text in the target language. The utterance then finds a voice that supports the language and reads it.

It is as easy as this:

const utterance = new SpeechSynthesisUtterance();
utterance.lang = 'fr-CA';
utterance.text = 'Bonjour';
window.speechSynthesis.speak(utterance);

Do I have that voice?

If the system has no voice to support the language you need, the utterance uses the system’s default voice . This might end up sounding like a drunken sailor uttering words in the corner of a bar.

So, you might want to check if the system supports the language you want it to speak. The language is specified via the ‘lang’ property of the voice object, which is expecting a language tag (en-US)

Building upon our previous examples, let’s get the voices, then iterate over the list of voices to grab the first one that supports the locale we need. Next step is to create an instance of an utterance with the voice and the localized text. Finally, we send this utterance instance to the speechSynthesis to speak.

This all looks like this:

I’ve created a Plunk that demonstrates the above code working, check it out here: https://plnkr.co/edit/6k3mVG

Note that some devices like Android specify the language tag with underscore.  For example, en_US rather than en-US.

My iPad won’t speak?!

Safari mobile (and possibly others) doesn’t allow the use of speechSynthesis programmatically. The speechSynthesis must be triggered by a user-initiated action. If you need your app to speak upon a button press, then you will not have any issues. If you need to speak text coming from a web socket, like I needed, then this is a problem.

var voices;
 
 
function getLanguageVoice(locale) {
   const voices = window.speechSynthesis.getVoices();
   for (var i = 0, len = voices.length; i < len; i++) {
       var voice = voices[i];
       if ((voice.lang.replace('_', '-')) === locale) {
           return voice
       }
   }
   return null;
}
 
 
 
function speak() {
 console.log(voices);
 const voice = getLanguageVoice('fr-CA');
   if (voice) {
       const utterance = new SpeechSynthesisUtterance();
       utterance.voice = voice;
       utterance.text = 'Bonjour';
       window.speechSynthesis.speak(utterance);
    } else {
       console.log('language not supported')
   }
}
 
 
function getVoiceList() {
   voices = window.speechSynthesis.getVoices();
    if (voices.length !== 0) {
       speak()
   }
}
 
 
 
getVoiceList();
 
 
if (speechSynthesis.onvoiceschanged !== undefined) {
   speechSynthesis.onvoiceschanged = getVoiceList;
}

Not all hope is lost. The fine print of this rule is that the speechSynthesis only needs to be user-initiated the ”first time”. So, what we can do is set up a checkbox to toggle the voice. When the checkbox turns the voice ‘on’ for the first time we then tell the speechSynthesis to speak an empty string. The toggle button function could look something like this:

let isVoiceEnabled;
toggleVoice() {
  if (isAudioEnabled === undefined) {
     window.speechSynthesis.speak(new SpeechSynthesisUtterance(''));
   }
  //From this point on we can call speechSynthesis from socket events 
   isVoiceEnabled = !isVoiceEnabled;
     }

Phew, this satisfies the requirement and allows the speechSynthesis to speak more text coming from the web socket.

Anything else?

What has been mentioned only scratches the surface. I encourage you to read the documentation and explore the other features that SpeechSynthesis offers.