Sonntag, 7. November 2010

How simon learned to talk

Finally I find the time for a long overdue Blog update :). I already promised this in September when I blogged about the dialog system but I want to write a bit about simons text to speech infrastructure.

Because the next version of simon will be able to interact with the user through dialogs, we wanted to enable simon to actually "talk" with the user through the means of text to speech systems.

Of course we didn't reinvent the wheel but rather looked around at available open source solutions. We needed it to be cross plattform and work at least with English, German and Italian.

Naturally, Jovie (formerly KTTSD, KDEs text to speech system) is the obvious choice but it is not yet cross plattform as it uses speech dispatcher which only works on Linux. Also, it wasn't very stable when I tried it and had quite a few rough edges and missing features.

Furthermore the best (open) German voices I could find where HTS voices developed with and for the OpenMARY framework. They should theoretically also work with festival so they could be used with Jovie as well if someone wrote a festival configuration set for it. OpenMARY is cross plattform and provides very high quality synthesis but is a very big and heavy Java dependency which needs a lot of resources and is quite slow - even on current hardware (synthesizing a paragraph of text takes around 10 seconds on a Nettop).

So we decided to do what we always do and leave the final choice to the end user:

simons TTS framework now allows you to use Jovie (default), a generic webservice (like OpenMARY) or to record sound snippets yourself.

The last option is especially helpful if you are dealing with languages where no good open voices exist yet or your users who have trouble understanding them.


Simply create a new TTS set for your speaker (the one recording the sound bytes) and record the needed texts with him / her. When recording texts, simon will show you a list of recently synthesized texts so you can easily record whole dialogs quite quickly. Instead of using the Jovie or OpenMARY to synthesize the text, simon will then play back these recordings.

These TTS sets can be exported and imported so you can share your sound snippets with others - for example accompanying the scenario containing the dialog which uses them.

Multiple TTS backends can be used simultaniously which means that you can use pre-recorded sound bytes primarily but fall back to a TTS system for dialog paths you have not (yet) recorded.

You can find an online demonstration of the OpenMARY voices on their homeage and a demonstration of simons dialog system using Jovie on youtube.

Keine Kommentare: