Building a Text to Speech App Using AVSpeechSynthesizer

Last updated on June 28th, 2015

⏱ Reading Time: 3 mins

iOS is an operating system with many possibilities, allowing to create from really simple to super-advanced applications. There are times where applications have to be multi-featured, providing elegant solutions that exceed the limits of the common places, and lead to a superb user experience. Also, there are numerous technologies one could exploit, and in this tutorial we are going to focus on one of them, which is no other than the Text to Speech.

Text-to-speech (TTS) is not something new in iOS 8. Since iOS 7 dealing with TTS has been really easy, as the code required to make an app speak is straightforward and easy to be handled. To make things more precise, iOS 7 introduced a new class named AVSpeechSynthesizer, and as you understand from its prefix it’s part of the powerful AVFoundation framework. The AVSpeechSynthesizer class, along with some other classes, can produce speech based on a given text (or multiple pieces of text), and provides the possibility to configure various properties regarding it.

text-to-speech-app

The AVSpeechSynthesizer is the responsible class for carrying out the heavy work of converting text to speech. It’s capable of initiating, pausing, stopping and continuing a speech process. However, it doesn’t interact directly with the text. There’s an intermediate class that does that job, and is called AVSpeechUtterance. An object of this class represents a piece of text that should be spoken, and to put it really simply, an utterance is actually the text that’s about to be spoken, enriched with some properties regarding the final output. The most important of those properties that the AVSpeechUtterance class handles (besides the text) are the speech rate, pitch and volume. There are a few more, but we’ll see them in a while. Also, an utterance object defines the voice that will be used for speaking. A voice is an object of the AVSpeechSynthesisVoice class. It always matches to a specific language, and up to now Apple supports 37 different voices, meaning voices for 37 different locales (we’ll talk about that later).

Once an utterance object has been properly configured, it’s passed to a speech synthesizer object so the system starts producing the speech. To speak many pieces of text, meaning many utterances, doesn’t require any effort at all; all it takes is to set the utterances to the synthesizer in the order that should be spoken, and the synthesizer automatically queues them.

Along with the AVSpeechSynthesizer class comes the the AVSpeechSynthesizerDelegate protocol. It contains useful delegate methods that if they get used properly, they allow to keep track of the progress of the speech, and the currently spoken text too. Tracking the progress might be something that you won’t need in your apps, but if you will, then here you’ll see how you can achieve that. It’s a bit of a tricky process, but once you understand how everything works, all will become pretty clear.

If you want, feel free to take a look at the official documentation about all those classes. In my part, I stop this introductory discussion here, as there are a lot of things to do, but trust me, all of them are really interesting. As a final note, I need to say that all the testings of the demo app must be done in real device (iPhone). Unfortunately, text-to-speech doesn’t work in the Simulator, so plug your phones and let’s go.

Read the full tutorial on Appcoda

New Video on YouTube

Dear reader,

check out my new video and learn how to add video playback capabilities to SwiftUI based apps in seconds!