Text-to-Speech - RemoteSpeech interface
Overview
SmartBody is responsible for acquiring the audio that is associated with a character utterance. This audio is then preferably played back by the game engine, but it can also be played back by SmartBody itself. To acquire the audio, SmartBody can either:
- read in an existing prerecorded speech audio file
- send a request to a text-to-speech engine
This document focuses on the latter.Â
SmartBody will send a RemoteSpeechCmd message to the TtsRelay module, requesting for a line of text to be converted into audio. The message contains what voice to use and where to put the generated file. TtsRelay will send back a RemoteSpeechCmd message, containing the exact file location, a viseme schedule with detailed timing information for lip-synching and word boundary timing information for synchronization of nonverbal behavior as specified through BML.Â
TTS Engines
Rhetorical (RVoiceRelay)
Voice Codes:
set character doctor voice remote M021 <- Saso Doctor's voice
set character elder voice remote M009 <- Saso Elder's voice
Cerevoice (CerevoiceRelay)
Voice Codes:
set character doctor voice remote star
set character doctor voice remote katherine
set character doctor voice remote starconv
Cepstral (CepstralRelay)
MSSpeech (MSSpeechRelay)
Voice Codes:
set character doctor voice remote BradVoice
Festival (FestivalRelay)
Voice Codes:
set character doctor voice remote BradVoice
RemoteSpeech Interface
To trigger a TTS call:
sbm bml char doctor speech "Hello world. Testing Text to Speech"
Sent by Smartbody to TTS Engine:
RemoteSpeechCmd speak doctor 1 M021 ../../data/cache/audio/utt_20110528_175743_doctor_1.aiff <?xml version="1.0" encoding="UTF-8"?> <speech type="text/plain"> Hello world. Testing Text to Speech </speech>
RVoiceRelay Example:
Actual message sent to Rhetorical:
<?xml version="1.0" encoding="UTF-8"?> <speech type="text/plain">Hello world. Testing Text to Speech</speech>
Sent by TTS Engine:
RemoteSpeechReply doctor 2 OK: <?xml version="1.0" encoding="UTF-8"?> <speak> <soundFile name="d:\edwork\saso\core\beavin\..\..\data\cache\audio\utt_20110528_180148_doctor_2.aiff"/> <viseme start="0.0" type="_"/> <word end="0.4049886621315193" start="0.049977324263038546"> <viseme start="0.049977324263038546" type="Ih"/> <viseme start="0.14498866213151929" type="Ih"/> <viseme start="0.2" type="D"/> <viseme start="0.2549659863945578" type="OW"/> </word> <word end="0.8099773242630386" start="0.4049886621315193"> <viseme start="0.4049886621315193" type="OO"/> <viseme start="0.5199546485260771" type="Er"/> <viseme start="0.5849886621315192" type="R"/> <viseme start="0.6649886621315193" type="D"/> <viseme start="0.7699773242630386" type="D"/> </word> <viseme start="0.8099773242630386" type="_"/> <viseme start="0.860498866213152" type="_"/> <viseme start="1.060498866213152" type="_"/> <word end="1.5854875283446712" start="1.1104761904761904"> <viseme start="1.1104761904761904" type="D"/> <viseme start="1.1574603174603175" type="Ih"/> <viseme start="1.2354648526077097" type="Z"/> <viseme start="1.3304761904761904" type="D"/> <viseme start="1.3824943310657596" type="Ih"/> <viseme start="1.4374603174603175" type="NG"/> </word> <word end="1.8724716553287981" start="1.5854875283446712"> <viseme start="1.5854875283446712" type="D"/> <viseme start="1.6424943310657596" type="Ih"/> <viseme start="1.7174603174603174" type="KG"/> <viseme start="1.7674829931972789" type="Z"/> <viseme start="1.8374603174603175" type="D"/> </word> <word end="1.927482993197279" start="1.8724716553287981"> <viseme start="1.8724716553287981" type="D"/> <viseme start="1.9024943310657596" type="Ih"/> </word> <word end="2.408480725623583" start="1.927482993197279"> <viseme start="1.927482993197279" type="Z"/> <viseme start="2.0224943310657597" type="BMP"/> <viseme start="2.1174603174603175" type="EE"/> <viseme start="2.207482993197279" type="j"/> </word> <viseme start="2.408480725623583" type="_"/> <viseme start="2.4584580498866213" type="_"/> </speak>
MSSpeechRelay Example:
Actual message sent to MSSpeech:
<speak version="1.0" xml:lang="en-US">Hello world. Testing Text to Speech .</speak>
(note the added period at the end)
Sent by TTS Engine:
RemoteSpeechReply doctor 1 OK: <?xml version="1.0" encoding="UTF-8"?> <speak> <soundFile name="d:\edwork\vhtoolkit\data\cache\audio\utt_20110528_180527_doctor_1.wav"/> <viseme start="0" type="_"/> <viseme start="0.003" type="Oh"/> <viseme start="0.047" type="Ih"/> <viseme start="0.098" type="D"/> <viseme start="0.258" type="Oh"/> <viseme start="0.418" type="Oh"/> <viseme start="0.479" type="Er"/> <viseme start="0.54" type="R"/> <viseme start="0.601" type="D"/> <viseme start="0.695" type="D"/> <viseme start="0.745" type="_"/> <viseme start="1.367" type="_"/> <viseme start="1.37" type="D"/> <viseme start="1.461" type="Ih"/> <viseme start="1.546" type="Z"/> <viseme start="1.6" type="D"/> <viseme start="1.654" type="Ih"/> <viseme start="1.729" type="KG"/> <viseme start="1.804" type="D"/> <viseme start="1.9" type="Ih"/> <viseme start="2.022" type="KG"/> <viseme start="2.087" type="Z"/> <viseme start="2.16" type="D"/> <viseme start="2.233" type="D"/> <viseme start="2.297" type="Oh"/> <viseme start="2.341" type="Z"/> <viseme start="2.425" type="BMP"/> <viseme start="2.509" type="Ih"/> <viseme start="2.606" type="j"/> <viseme start="2.73" type="_"/> </speak>
CerevoiceRelay Example:
Actual text sent to cerevoice engine:
<?xml version="1.0" encoding="UTF-8"?> <speech type="text/plain">Hello world. Testing Text to Speech </speech>
(note the space, also note that cerevoicerelay removes punctuation because of an apparent bug in cerevoice)
Sent by TTS Engine (CerevoiceRelay Example) (hand-formatted):
RemoteSpeechReply doctor 1 OK: <?xml version="1.0" encoding="UTF-8"?> <speak> <soundFile name="d:\edwork\saso\data\cache\audio\utt_20110621_192933_doctor_1.wav"/> <viseme start="0.000000" type="_"/> <mark name="sp1:T0" time="0.010975"/> <mark name="sp1:T1" time="0.010975"/> <word end="2.468209" start="0.010975"> <viseme start="0.010975" type="Ih"/> <viseme start="0.090975" type="Ih"/> <viseme start="0.120952" type="D"/> <viseme start="0.231157" type="Oh"/> <viseme start="0.430088" type="OO"/> <viseme start="0.527008" type="Er"/> <viseme start="0.663673" type="D"/> <viseme start="0.723719" type="D"/> <viseme start="0.768662" type="D"/> <viseme start="0.848662" type="Ih"/> <viseme start="0.948662" type="Z"/> <viseme start="1.113696" type="D"/> <viseme start="1.173651" type="Ih"/> <viseme start="1.223510" type="NG"/> <viseme start="1.357624" type="D"/> <viseme start="1.431655" type="Ih"/> <viseme start="1.511610" type="KG"/> <viseme start="1.566621" type="Z"/> <viseme start="1.636644" type="D"/> <viseme start="1.696644" type="Oh"/> <viseme start="1.833379" type="Z"/> <viseme start="1.958231" type="BMP"/> <viseme start="2.028209" type="EE"/> <viseme start="2.188209" type="j"/> </word> <mark name="sp1:T2" time="2.468209"/> <mark name="sp1:T3" time="2.468209"/> <viseme start="2.468209" type="_"/> </speak>
new output 11/7/11:
RemoteSpeechReply doctor 1 OK: <?xml version="1.0" encoding="UTF-8"?> <speak> <soundFile name="d:\edwork\saso\core\TtsSpeechRelay\bin\data\cache\audio\utt_20110528_175743_doctor_1.wav.wav"/> <viseme start="0.000000" type="_"/> <mark name="sp1:T0" time="0.010975"/> <mark name="sp1:T1" time="0.010975"/> <word end="0.353100" start="0.010975"> <viseme start="0.010975" type="Ih"/> <viseme start="0.099709" type="Ih"/> <viseme start="0.126943" type="D"/> <viseme start="0.252789" type="Oh"/> </word> <mark name="sp1:T2" time="0.353100"/> <mark name="sp1:T3" time="0.353100"/> <word end="0.762222" start="0.353100"> <viseme start="0.353100" type="OO"/> <viseme start="0.446472" type="Er"/> <viseme start="0.532245" type="D"/> <viseme start="0.602222" type="D"/> </word> <mark name="sp1:T4" time="0.762222"/> <mark name="sp1:T5" time="0.762222"/> <viseme start="0.762222" type="_"/> <mark name="sp1:T6" time="0.000000"/> <mark name="sp1:T7" time="0.962222"/> <viseme start="0.000000" type="_"/> <mark name="sp1:T8" time="1.162222"/> <mark name="sp1:T9" time="1.162222"/> <word end="1.617595" start="1.162222"> <viseme start="1.162222" type="D"/> <viseme start="1.254784" type="Ih"/> <viseme start="1.340280" type="Z"/> <viseme start="1.419229" type="D"/> <viseme start="1.479229" type="Ih"/> <viseme start="1.509215" type="NG"/> </word> <mark name="sp1:T10" time="1.617595"/> <mark name="sp1:T11" time="1.617595"/> <word end="2.077460" start="1.617595"> <viseme start="1.617595" type="D"/> <viseme start="1.747483" type="Ih"/> <viseme start="1.827483" type="KG"/> <viseme start="1.927438" type="Z"/> <viseme start="2.037460" type="D"/> </word> <mark name="sp1:T12" time="2.077460"/> <mark name="sp1:T13" time="2.077460"/> <word end="2.227483" start="2.077460"> <viseme start="2.077460" type="D"/> <viseme start="2.197460" type="Ih"/> </word> <mark name="sp1:T14" time="2.227483"/> <mark name="sp1:T15" time="2.227483"/> <word end="2.847438" start="2.227483"> <viseme start="2.227483" type="Z"/> <viseme start="2.347483" type="BMP"/> <viseme start="2.427438" type="EE"/> <viseme start="2.587438" type="j"/> </word> <mark name="sp1:T16" time="2.847438"/> <mark name="sp1:T17" time="2.847438"/> <viseme start="2.847438" type="_"/> </speak>
FestivalRelay example:
Actual text sent to Festival:
<?xml version="1.0" encoding="UTF-8"?> <speech type="text/plain">Hello world. Testing Text to Speech </speech>
(note that this gets edited by FestivalRelay and eventually gets sent out as 'Helloworld.TestingTexttoSpeech'
Sent by TTS Engine (FestivalRelay Example) (hand-formatted):
RemoteSpeechReply doctor 7 OK: <?xml version="1.0" encoding="UTF-8"?> <speak> <soundFile name="d:\edwork\vhtoolkit\bin\FestivalRelay\data\cache\festival\utt_20110722_185051_doctor_7.wav"/> <viseme start="0.000000" type="_" /> <mark name="T0" time="0.080000"/> <word end="0.640000" start="0.080000" > <viseme start="0.080000" type="Ih" /> <viseme start="0.160000" type="Ih" /> <viseme start="0.240000" type="D" /> <viseme start="0.320000" type="Oh" /> <viseme start="0.400000" type="Er" /> <viseme start="0.440000" type="R" /> <mark name="T1" time="0.480000"/> </word> <mark name="T2" time="0.080000"/> <word end="0.640000" start="0.080000" > <viseme start="0.480000" type="D" /> <viseme start="0.560000" type="D" /> <mark name="T3" time="0.640000"/> </word> <mark name="T4" time="0.640000"/> <word end="0.880000" start="0.640000" > <viseme start="0.640000" type="D" /> <viseme start="0.720000" type="Ao" /> <viseme start="0.800000" type="D" /> <mark name="T5" time="0.880000"/> </word> <mark name="T6" time="0.880000"/> <word end="2.160000" start="0.880000" > <viseme start="0.880000" type="D" /> <viseme start="0.960000" type="Ih" /> <viseme start="1.040000" type="Z" /> <viseme start="1.120000" type="D" /> <viseme start="1.200000" type="Ih" /> <viseme start="1.280000" type="NG" /> <viseme start="1.360000" type="D" /> <viseme start="1.440000" type="Ih" /> <viseme start="1.520000" type="KG" /> <viseme start="1.600000" type="Z" /> <viseme start="1.680000" type="D" /> <viseme start="1.760000" type="Ao" /> <viseme start="1.840000" type="Z" /> <viseme start="1.920000" type="BMP" /> <viseme start="2.000000" type="EE" /> <viseme start="2.080000" type="j" /> <mark name="T7" time="2.160000"/> </word> <viseme start="2.160000" type="_" /> </speak>
new output 11/7/11:
RemoteSpeechReply doctor 1 OK: <?xml version="1.0" encoding="UTF-8"?> <speak> <soundFile name="d:\edwork\saso\core\TtsSpeechRelay\bin\data\cache\audio\utt_20110528_175743_doctor_1.wav"/> <mark name="T0" time="0.210000"/> <word end="0.795159" start="0.210000" > <viseme start="0.367043" type="D" /> <viseme start="0.704177" type="D" /> <viseme start="0.756153" type="D" /> <mark name="T1" time="0.795159"/> </word> <mark name="T2" time="0.795159"/> <word end="1.013328" start="0.795159" > <viseme start="0.795159" type="D" /> <viseme start="0.953081" type="D" /> <mark name="T3" time="1.013328"/> </word> <mark name="T4" time="1.013328"/> <word end="2.455301" start="1.013328" > <viseme start="1.013328" type="D" /> <viseme start="1.210314" type="Z" /> <viseme start="1.282180" type="D" /> <viseme start="1.358164" type="Ih" /> <viseme start="1.394886" type="NG" /> <viseme start="1.452691" type="D" /> <viseme start="1.608044" type="KG" /> <viseme start="1.690684" type="Z" /> <viseme start="1.788436" type="D" /> <viseme start="1.962315" type="Z" /> <viseme start="2.065681" type="BMP" /> <viseme start="2.312202" type="j" /> <mark name="T5" time="2.455301"/> </word> </speak>
NPCEditor/NVBG Example
Utterance #20 in Toolkit
RemoteSpeechCmd sent by SBM
RemoteSpeechCmd speak brad 1 BradVoiceFestival ../../data/cache/audio/utt_20110809_151922_brad_1.aiff <?xml version="1.0" encoding="utf-16"?> <speech id="sp1" ref="tech_sapiTTS" type="application/ssml+xml"> <mark name="T0" />SAPI <mark name="T1" /><mark name="T2" />is <mark name="T3" /><mark name="T4" />a <mark name="T5" /><mark name="T6" />speech <mark name="T7" /><mark name="T8" />and <mark name="T9" /><mark name="T10" />text <mark name="T11" /><mark name="T12" />to <mark name="T13" /><mark name="T14" />speech <mark name="T15" /><mark name="T16" />interface <mark name="T17" /><mark name="T18" />by <mark name="T19" /><mark name="T20" />Microsoft. <mark name="T21" /><mark name="T22" />I <mark name="T23" /><mark name="T24" />use <mark name="T25" /><mark name="T26" />it <mark name="T27" /><mark name="T28" />to <mark name="T29" /><mark name="T30" />be <mark name="T31" /><mark name="T32" />able <mark name="T33" /><mark name="T34" />to <mark name="T35" /><mark name="T36" />talk <mark name="T37" /><mark name="T38" />to <mark name="T39" /><mark name="T40" />you. <mark name="T41" /> </speech>
Festival example
RemoteSpeechReply brad 2 OK: <?xml version="1.0" encoding="UTF-8"?> <speak> <soundFile name="d:\edwork\vhtoolkit\bin\FestivalRelay\data\cache\festival\utt_20110809_152521_brad_2.wav"/> <viseme start="0.000000" type="_" /> <mark name="T0" time="0.080000"/> <word end="0.400000" start="0.080000" > <viseme start="0.080000" type="Z" /> <viseme start="0.160000" type="Ao" /> <viseme start="0.240000" type="BMP" /> <viseme start="0.320000" type="EE" /> <mark name="T1" time="0.400000"/> </word> <mark name="T2" time="0.400000"/> <word end="0.560000" start="0.400000" > <viseme start="0.400000" type="Ih" /> <viseme start="0.480000" type="Z" /> <mark name="T3" time="0.560000"/> </word> <mark name="T4" time="0.560000"/> <word end="0.640000" start="0.560000" > <viseme start="0.560000" type="Ih" /> <mark name="T5" time="0.640000"/> </word> <mark name="T6" time="0.640000"/> <word end="0.960000" start="0.640000" > <viseme start="0.640000" type="Z" /> <viseme start="0.720000" type="BMP" /> <viseme start="0.800000" type="EE" /> <viseme start="0.880000" type="j" /> <viseme start="0.960000" type="_" /> <mark name="T7" time="1.040000"/> </word> <mark name="T8" time="1.040000"/> <word end="1.280000" start="1.040000" > <viseme start="1.040000" type="Ih" /> <viseme start="1.120000" type="NG" /> <viseme start="1.200000" type="D" /> <mark name="T9" time="1.280000"/> </word> <mark name="T10" time="1.280000"/> <word end="1.680000" start="1.280000" > <viseme start="1.280000" type="D" /> <viseme start="1.360000" type="Ih" /> <viseme start="1.440000" type="KG" /> <viseme start="1.520000" type="Z" /> <viseme start="1.600000" type="D" /> <mark name="T11" time="1.680000"/> </word> <mark name="T12" time="1.680000"/> <word end="1.840000" start="1.680000" > <viseme start="1.680000" type="D" /> <viseme start="1.760000" type="Ih" /> <mark name="T13" time="1.840000"/> </word> <mark name="T14" time="1.840000"/> <word end="2.160000" start="1.840000" > <viseme start="1.840000" type="Z" /> <viseme start="1.920000" type="BMP" /> <viseme start="2.000000" type="EE" /> <viseme start="2.080000" type="j" /> <mark name="T15" time="2.160000"/> </word> <mark name="T16" time="2.160000"/> <word end="2.720000" start="2.160000" > <viseme start="2.160000" type="Ih" /> <viseme start="2.240000" type="NG" /> <viseme start="2.320000" type="D" /> <viseme start="2.400000" type="Er" /> <viseme start="2.440000" type="R" /> <mark name="T17" time="2.480000"/> </word> <mark name="T18" time="2.160000"/> <word end="2.720000" start="2.160000" > <viseme start="2.480000" type="F" /> <viseme start="2.560000" type="Ih" /> <viseme start="2.640000" type="Z" /> <mark name="T19" time="2.720000"/> </word> <mark name="T20" time="2.720000"/> <word end="2.880000" start="2.720000" > <viseme start="2.720000" type="BMP" /> <viseme start="2.800000" type="Ih" /> <mark name="T21" time="2.880000"/> </word> <mark name="T22" time="2.880000"/> <word end="3.599999" start="2.880000" > <viseme start="2.880000" type="BMP" /> <viseme start="2.960000" type="Ih" /> <viseme start="3.039999" type="KG" /> <viseme start="3.119999" type="R" /> <viseme start="3.199999" type="Oh" /> <viseme start="3.279999" type="Z" /> <viseme start="3.359999" type="Ao" /> <viseme start="3.439999" type="F" /> <viseme start="3.519999" type="D" /> <viseme start="3.599999" type="_" /> <mark name="T23" time="3.679999"/> </word> <mark name="T24" time="3.679999"/> <word end="3.759999" start="3.679999" > <viseme start="3.679999" type="Ih" /> <mark name="T25" time="3.759999"/> </word> <mark name="T26" time="3.759999"/> <word end="3.999999" start="3.759999" > <viseme start="3.759999" type="OO" /> <viseme start="3.839999" type="Oh" /> <viseme start="3.919999" type="Z" /> <mark name="T27" time="3.999999"/> </word> <mark name="T28" time="3.999999"/> <word end="4.159998" start="3.999999" > <viseme start="3.999999" type="Ih" /> <viseme start="4.079998" type="D" /> <mark name="T29" time="4.159998"/> </word> <mark name="T30" time="4.159998"/> <word end="4.319998" start="4.159998" > <viseme start="4.159998" type="D" /> <viseme start="4.239998" type="Ih" /> <mark name="T31" time="4.319998"/> </word> <mark name="T32" time="4.319998"/> <word end="4.479998" start="4.319998" > <viseme start="4.319998" type="BMP" /> <viseme start="4.399998" type="EE" /> <mark name="T33" time="4.479998"/> </word> <mark name="T34" time="4.479998"/> <word end="4.799998" start="4.479998" > <viseme start="4.479998" type="Ih" /> <viseme start="4.559998" type="BMP" /> <viseme start="4.639998" type="Ih" /> <viseme start="4.719998" type="D" /> <viseme start="4.799998" type="_" /> <mark name="T35" time="4.879998"/> </word> <mark name="T36" time="4.879998"/> <word end="5.039998" start="4.879998" > <viseme start="4.879998" type="D" /> <viseme start="4.959998" type="Ih" /> <mark name="T37" time="5.039998"/> </word> <mark name="T38" time="5.039998"/> <word end="5.279997" start="5.039998" > <viseme start="5.039998" type="D" /> <viseme start="5.119998" type="Ao" /> <viseme start="5.199997" type="KG" /> <mark name="T39" time="5.279997"/> </word> <mark name="T40" time="5.279997"/> <word end="5.439997" start="5.279997" > <viseme start="5.279997" type="D" /> <viseme start="5.359997" type="Ih" /> <mark name="T41" time="5.439997"/> </word> <mark name="T42" time="5.439997"/> <word end="5.599997" start="5.439997" > <viseme start="5.439997" type="OO" /> <viseme start="5.519997" type="Oh" /> <mark name="T43" time="5.599997"/> </word> <viseme start="5.599997" type="_" /> </speak>
MSSpeechRelay Example
Text sent to MSSpeech:
<speak version="1.0" xml:lang="en-US"> <mark name="sp1:T0" />SAPI <mark name="sp1:T1" /> <mark name="sp1:T2" />is <mark name="sp1:T3" /> <mark name="sp1:T4" />a <mark name="sp1:T5" /> <mark name="sp1:T6" />speak <mark name="sp1:T7" /> <mark name="sp1:T8" />and <mark name="sp1:T9" /> <mark name="sp1:T10" />text <mark name="sp1:T11" /> <mark name="sp1:T12" />to <mark name="sp1:T13" /> <mark name="sp1:T14" />speak <mark name="sp1:T15" /> <mark name="sp1:T16" />interface <mark name="sp1:T17" /> <mark name="sp1:T18" />by <mark name="sp1:T19" /> <mark name="sp1:T20" />Microsoft. <mark name="sp1:T21" /> <mark name="sp1:T22" />I <mark name="sp1:T23" /> <mark name="sp1:T24" />use <mark name="sp1:T25" /> <mark name="sp1:T26" />it <mark name="sp1:T27" /> <mark name="sp1:T28" />to <mark name="sp1:T29" /> <mark name="sp1:T30" />be <mark name="sp1:T31" /> <mark name="sp1:T32" />able <mark name="sp1:T33" /> <mark name="sp1:T34" />to <mark name="sp1:T35" /> <mark name="sp1:T36" />talk <mark name="sp1:T37" /> <mark name="sp1:T38" />to <mark name="sp1:T39" /> <mark name="sp1:T40" />you. <mark name="sp1:T41" />. </speak>
Reply:
RemoteSpeechReply brad 4 OK: <?xml version="1.0" encoding="UTF-8"?> <speak> <soundFile name="d:\edwork\vhtoolkit\data\cache\audio\utt_20110809_154741_brad_4.wav"/> <viseme start="0" type="_"/> <mark name="T0" time="0.003"/> <word end="0.347" start="0.003"> <viseme start="0.003" type="Z"/> <viseme start="0.099" type="Ih"/> <viseme start="0.196" type="BMP"/> <viseme start="0.259" type="Ih"/> <mark name="T1" time="0.347"/> </word> <mark name="T2" time="0.347"/> <word end="0.465" start="0.347"> <viseme start="0.347" type="Ih"/> <viseme start="0.416" type="Z"/> <mark name="T3" time="0.465"/> </word> <mark name="T4" time="0.465"/> <word end="0.527" start="0.465"> <viseme start="0.465" type="Ih"/> <mark name="T5" time="0.527"/> </word> <mark name="T6" time="0.527"/> <word end="0.874" start="0.527"> <viseme start="0.527" type="Z"/> <viseme start="0.605" type="BMP"/> <viseme start="0.683" type="Ih"/> <viseme start="0.795" type="KG"/> <mark name="T7" time="0.874"/> </word> <mark name="T8" time="0.874"/> <word end="1.053" start="0.874"> <viseme start="0.874" type="Ih"/> <viseme start="0.957" type="D"/> <viseme start="1.04" type="D"/> <mark name="T9" time="1.053"/> </word> <mark name="T10" time="1.053"/> <word end="1.401" start="1.053"> <viseme start="1.053" type="D"/> <viseme start="1.119" type="Ih"/> <viseme start="1.238" type="KG"/> <viseme start="1.295" type="Z"/> <viseme start="1.348" type="D"/> <mark name="T11" time="1.401"/> </word> <mark name="T12" time="1.401"/> <word end="1.47" start="1.401"> <viseme start="1.401" type="D"/> <viseme start="1.442" type="Oh"/> <mark name="T13" time="1.47"/> </word> <mark name="T14" time="1.47"/> <word end="1.878" start="1.47"> <viseme start="1.47" type="Z"/> <viseme start="1.547" type="BMP"/> <viseme start="1.624" type="Ih"/> <viseme start="1.736" type="KG"/> <mark name="T15" time="1.878"/> </word> <mark name="T16" time="1.878"/> <word end="2.523" start="1.878"> <viseme start="1.878" type="Ih"/> <viseme start="1.955" type="D"/> <viseme start="2.032" type="D"/> <viseme start="2.075" type="Ih"/> <viseme start="2.11" type="R"/> <viseme start="2.145" type="F"/> <viseme start="2.257" type="Ih"/> <viseme start="2.399" type="Z"/> <mark name="T17" time="2.523"/> </word> <mark name="T18" time="2.523"/> <word end="2.665" start="2.523"> <viseme start="2.523" type="D"/> <viseme start="2.554" type="Ih"/> <mark name="T19" time="2.665"/> </word> <mark name="T20" time="2.665"/> <word end="3.931" start="2.665"> <viseme start="2.665" type="BMP"/> <viseme start="2.753" type="Ih"/> <viseme start="2.841" type="KG"/> <viseme start="2.913" type="R"/> <viseme start="2.943" type="Ih"/> <viseme start="2.973" type="Z"/> <viseme start="3.067" type="Ao"/> <viseme start="3.202" type="F"/> <viseme start="3.255" type="D"/> <viseme start="3.308" type="_"/> <viseme start="3.928" type="_"/> <mark name="T21" time="3.931"/> </word> <mark name="T22" time="3.931"/> <word end="4.067" start="3.931"> <viseme start="3.931" type="Ih"/> <mark name="T23" time="4.067"/> </word> <mark name="T24" time="4.067"/> <word end="4.336" start="4.067"> <viseme start="4.067" type="Ih"/> <viseme start="4.17" type="Oh"/> <viseme start="4.273" type="Z"/> <mark name="T25" time="4.336"/> </word> <mark name="T26" time="4.336"/> <word end="4.474" start="4.336"> <viseme start="4.336" type="Ih"/> <viseme start="4.403" type="D"/> <mark name="T27" time="4.474"/> </word> <mark name="T28" time="4.474"/> <word end="4.54" start="4.474"> <viseme start="4.474" type="D"/> <viseme start="4.515" type="Oh"/> <mark name="T29" time="4.54"/> </word> <mark name="T30" time="4.54"/> <word end="4.691" start="4.54"> <viseme start="4.54" type="D"/> <viseme start="4.588" type="Ih"/> <mark name="T31" time="4.691"/> </word> <mark name="T32" time="4.691"/> <word end="5.051" start="4.691"> <viseme start="4.691" type="Ih"/> <viseme start="4.847" type="D"/> <viseme start="4.901" type="Ih"/> <viseme start="4.976" type="D"/> <mark name="T33" time="5.051"/> </word> <mark name="T34" time="5.051"/> <word end="5.15" start="5.051"> <viseme start="5.051" type="D"/> <viseme start="5.095" type="Oh"/> <mark name="T35" time="5.15"/> </word> <mark name="T36" time="5.15"/> <word end="5.469" start="5.15"> <viseme start="5.15" type="D"/> <viseme start="5.244" type="Ao"/> <viseme start="5.404" type="KG"/> <mark name="T37" time="5.469"/> </word> <mark name="T38" time="5.469"/> <word end="5.64" start="5.469"> <viseme start="5.469" type="D"/> <viseme start="5.584" type="Oh"/> <mark name="T39" time="5.64"/> </word> <mark name="T40" time="5.64"/> <word end="6.558" start="5.64"> <viseme start="5.64" type="Ih"/> <viseme start="5.784" type="Oh"/> <viseme start="5.928" type="_"/> <viseme start="6.553" type="_"/> <mark name="T41" time="6.558"/> </word> <viseme start="6.558" type="_"/> </speak>
Saso Agent Example
Start Saso - sbm, nvbg, nlu, Fake Recognizer, Agent 1. Click "hello gentlemen".
RemoteSpeechCmd speak doctor-perez 1 M021 ../../data/cache/audio/utt_20110809_193606_doctor-perez_1.aiff <?xml version="1.0" encoding="UTF-8"?> <speech id="sp1" ref="" type="application/ssml+xml"> <mark name="T0" />hello <mark name="T1" /> <mark name="T2" />captain <mark name="T3" /> </speech>
RvoiceRelay Example
Text sent to Rvoice:
<?xml version="1.0" encoding="UTF-8"?> <speech id="sp1" ref="" type="application/ssml+xml"> <mark name="T0" />hello <mark name="T1" /> <mark name="T2" />captain <mark name="T3" /> </speech>
Reply:
RemoteSpeechReply doctor-perez 1 OK: <?xml version="1.0" encoding="UTF-8"?> <speak> <soundFile name="d:\edwork\saso\core\beavin\..\..\data\cache\audio\utt_20110809_193606_doctor-perez_1.aiff"/> <viseme start="0.0" type="_"/> <viseme start="0.0" type="_"/> <mark name="T0" time="0.049977324263038546"/> <word end="0.33696145124716553" start="0.049977324263038546"> <viseme start="0.049977324263038546" type="Ih"/> <viseme start="0.14498866213151929" type="Ih"/> <viseme start="0.2" type="D"/> <viseme start="0.24997732426303854" type="OW"/> </word> <mark name="T2" time="0.33696145124716553"/> <mark name="T1" time="0.33696145124716553"/> <word end="0.8029931972789116" start="0.33696145124716553"> <viseme start="0.33696145124716553" type="KG"/> <viseme start="0.39696145124716553" type="Ih"/> <viseme start="0.4819954648526077" type="BMP"/> <viseme start="0.5419954648526077" type="D"/> <viseme start="0.6399546485260771" type="Ih"/> <viseme start="0.7029931972789115" type="NG"/> </word> <mark name="T3" time="0.8029931972789116"/> <viseme start="0.8029931972789116" type="_"/> <viseme start="0.8529705215419501" type="_"/> </speak>