Speech Synthesis Markup Language (SSML)

What is SSML ?

You can always use plain text but using SSML allows for more customization. SSML is a markup language that allows you to control various aspects of text-to-speech output. It enables you to fine-tune how text is converted to speech, including pronunciation, pitch, speed, volume, and more.

Basic Structure

The <speak> tag is the root element and must wrap around all SSML content.

<speak>
    <!-- Your text and SSML tags go here -->
</speak>

The following example shows an example of SSML markup and the Text-to-Speech synthesizes the text. We will discuss each tag in detail later in this document.

<speak>
  Here are <say-as interpret-as="characters">SSML</say-as> samples.
  I can pause <break time="3s"/>.
  I can speak in cardinals. Your number is <say-as interpret-as="cardinal">10</say-as>.
  Or I can speak in ordinals. You are <say-as interpret-as="ordinal">10</say-as> in line.
  Or I can even speak in digits. The digits for ten are <say-as interpret-as="characters">10</say-as>.
  I can also substitute phrases, like the <sub alias="World Wide Web Consortium">W3C</sub>.
  Finally, I can speak a paragraph with two sentences.
  <p><s>This is sentence one.</s><s>This is sentence two.</s></p>
</speak>

Common SSML Tags

Add Pauses

I can pause <break time="5s"/>
Another pause <break strength="medium"/>

The time attribute can be set by seconds or milliseconds (e.g. “3s” or “250ms”). The strength attribute can have values x-weak, weak, medium, strong and x-strong.

Sentence and Paragraph Elements

<p><s>This is sentence one.</s><s>This is sentence two.</s></p>

Adjust Speed, Pitch, and Volume

<prosody rate="slow" pitch="high" volume="loud">This is slow speech.</prosody>

Rate controls the speed of speech. Values can be “x-slow”, “slow”, “medium”, “fast”, “x-fast”, or a percentage.

Pitch adjusts the pitch of the voice. Values can be “x-low”, “low”, “medium”, “high”, “x-high”, or a percentage.

Volume values can be “silent”, “x-soft”, “soft”, “medium”, “loud”, “x-loud”, or a decibel value.

Change Voice

The <voice> tag allows you to use more than one voice in a single SSML request.

And then she asked, <voice language="fr-FR" gender="female">qu'est-ce qui
t'amène ici</voice>

Say As

<say-as> interprets text as a specific type.

<say-as interpret-as="characters">SSML</say-as>
Your number is <say-as interpret-as="cardinal">10</say-as>.
You are <say-as interpret-as="ordinal">10</say-as> in line.
The digits for ten are <say-as interpret-as="characters">10</say-as>.
<say-as interpret-as="telephone" google:style='zero-as-zero'>1234 0000</say-as>
<say-as interpret-as="telephone">1234 0000</say-as>
<say-as format="yyyymmdd" detail="1">2024-08-10</say-as>
<say-as format="hms24" detail="1">20:05</say-as>
<say-as format="hms12" detail="2">20:05</say-as> 

Emphasize Words

<emphasis> is used to add or remove emphasis from text contained by the element.

<emphasis level="moderate">This is an important announcement</emphasis>

Values for level can be “strong”, “moderate”, “reduced” or “none“.

Additional Resources

Google Cloud SSML Documentation