Speech Synthesis Markup Language (SSML)
What is SSML ?
You can always use plain text but using SSML allows for more customization. SSML is a markup language that allows you to control various aspects of text-to-speech output. It enables you to fine-tune how text is converted to speech, including pronunciation, pitch, speed, volume, and more.
Basic Structure
The <speak>
tag is the root element and must wrap around all SSML content.
<speak>
<!-- Your text and SSML tags go here -->
</speak>
The following example shows an example of SSML markup and the Text-to-Speech synthesizes the text. We will discuss each tag in detail later in this document.
<speak>
Here are <say-as interpret-as="characters">SSML</say-as> samples.
I can pause <break time="3s"/>.
I can speak in cardinals. Your number is <say-as interpret-as="cardinal">10</say-as>.
Or I can speak in ordinals. You are <say-as interpret-as="ordinal">10</say-as> in line.
Or I can even speak in digits. The digits for ten are <say-as interpret-as="characters">10</say-as>.
I can also substitute phrases, like the <sub alias="World Wide Web Consortium">W3C</sub>.
Finally, I can speak a paragraph with two sentences.
<p><s>This is sentence one.</s><s>This is sentence two.</s></p>
</speak>
Common SSML Tags
Add Pauses
I can pause <break time="5s"/>
Another pause <break strength="medium"/>
The time attribute can be set by seconds or milliseconds (e.g. “3s” or “250ms”). The strength attribute can have values x-weak, weak, medium, strong and x-strong.
Sentence and Paragraph Elements
<p><s>This is sentence one.</s><s>This is sentence two.</s></p>
Adjust Speed, Pitch, and Volume
<prosody rate="slow" pitch="high" volume="loud">This is slow speech.</prosody>
Rate controls the speed of speech. Values can be “x-slow”, “slow”, “medium”, “fast”, “x-fast”, or a percentage.
Pitch adjusts the pitch of the voice. Values can be “x-low”, “low”, “medium”, “high”, “x-high”, or a percentage.
Volume values can be “silent”, “x-soft”, “soft”, “medium”, “loud”, “x-loud”, or a decibel value.
Change Voice
The <voice>
tag allows you to use more than one voice in a single SSML request.
And then she asked, <voice language="fr-FR" gender="female">qu'est-ce qui
t'amène ici</voice>
Say As
<say-as>
interprets text as a specific type.
<say-as interpret-as="characters">SSML</say-as>
Your number is <say-as interpret-as="cardinal">10</say-as>.
You are <say-as interpret-as="ordinal">10</say-as> in line.
The digits for ten are <say-as interpret-as="characters">10</say-as>.
<say-as interpret-as="telephone" google:style='zero-as-zero'>1234 0000</say-as>
<say-as interpret-as="telephone">1234 0000</say-as>
<say-as format="yyyymmdd" detail="1">2024-08-10</say-as>
<say-as format="hms24" detail="1">20:05</say-as>
<say-as format="hms12" detail="2">20:05</say-as>
Emphasize Words
<emphasis> is used to add or remove emphasis from text contained by the element.
<emphasis level="moderate">This is an important announcement</emphasis>
Values for level can be “strong”, “moderate”, “reduced” or “none
“.
Additional Resources