Montag, 4. April 2011

GSoC idea: Voice Control for the Linux Desktop

As this has worked so perfectly the last time, I want to use this blog post to present another idea for the Google Summer of Code 2011 that has not yet found an interested student.

Motivation
The simon system currently has plugins to trigger shortcuts, simulate clicks and interact directly with applications through IPC technology like DBus and JSON. This makes simon perfect for interacting with a vast variety of applications as long as it is configured for each application beforehand.

To counteract this, we have the scenario system that allows users to exchange such configurations online. This repository already covers many of the "standard" applications.
Still: The user has to actively pick which applications to control. If there is no scenario available for an application, things get a bit more complicated.

So how could we create dynamic scenarios that allow the user to control new applications without configuring anything?

Well let's look at what's needed to voice control an application.

Commands
First of all, we need to know what options are currently available.

Let's look at KWrite as an example application:

Just looking at the screenshot a human can quickly tell that there are at least the following commands: "New", "Open", "Save", "Save As", "File", "Edit", etc.

Well if screenreaders can read those options to the user, why shouldn't simon parse them automatically as well?

With the upcoming AT-SPI-2 and the Qt accessibility bridge, the user interface (including buttons, menu items, etc.) are all exported over DBus.

As elements can also be triggered (clicked / selected) over this interface, simon can easily "read" running applications and create appropriate commands.

Best of all: Because screenreaders are well established, many applications already make sure that this will work properly.

Vocabulary and Grammar
Now that we have our commands in place simon still needs to recognize all those words ("New", "Save", etc.) that are probably not in the users active vocabulary.

As speech recognition systems need a phonetic description of each word that is not trivial.

...if it weren't for Sequitur. Sequitur is a grapheme to phoneme converter that translates any given text to a phonetical description.

The system can be compared to a native speaker: Even if you have never heard a word spoken out loud you still have at least a rough idea about how to pronounce it. That's because there are certain rules in any language that you know even if you aren't aware of them.
Sequitur works in much the same way that it learns those rules by reading large dictionaries. With the generated model it can transcribe even words that were not in the input dictionary.

In our tests, sequitur prooved to be very reliable, accurate and quite fast.

simon already allows the user to specify a dictionary large enough to act as the information source for sequitur: The shadow dictionary. Because there are already import mechanisms for most major pronunciation dictionary formats, there is more than enough raw material to "feed" to sequitur already available.

Now that we have the vocabulary, setting up an appropriate grammar is very easy. Just make sure that all the sentences of the created commands are allowed.

For static models no training data is required so that's all that'd be needed.

Summary
With a combination of AT-SPI-2 and Sequitur one could quite easily extend the current simon version to automatically create working voice commands for all standard widgets of running applications.

This allows the user of a static model to comfortably use any application-specific configuration at all.

Because AT-SPI-2 is a freedesktop.org standard, the resulting system would automatically work with both Qt and KDE applications as well as Gnome applications.

If you are interested in working on this idea, please send me an email.

1 Kommentar:

Anonym hat gesagt…

I personally think that this idea is great and it would be great to find someone willed to work on it. Actually something like this is missing for Linux Desktops at the moment and it is sad to see that it is missing, while the basic architexture for this does already exist.