Friday, March 2, 2012

Hello! Hello! Can you hear me?

I believe that telephone based applications offer a huge benefit in terms of ease of use as compared with web based applications. However, many people don't even try to create applications with a telephone interface because they mistakenly believe that it is very hard to do. Now that we are on the eve of Dublin Science Hackday I decided it would be a good idea to tell people how easy it is to develop applications with a phone based interface by describing a simple application I developed myself.
 First, let me explain some of the background for why I developed the application. I used to work on the development of a computer telephony system. We were strong believers in the theory that it was important for the developers to get a good understanding of the end users' perspective of the system and so we encouraged all of the development team to use early builds of the system as much as possible.
Some of the Headphones I Use.
To be totally honest the experience was painful in the early days. Each time I made a call I knew there was a significant chance that the call would not be successful. Not only was there a chance that there was a bug in the latest daily build of the client which I had installed on my machine or in the server code which was also updated regularly, there was also a very significant chance that there would be some problem with the volume settings on my headset. Most headsets have hardware volume controls and/or mute options on the headset and these controls might not be set properly to interact with the volume settings on my laptop's operating system - and because I carried my headsets around in a bag with other hardware they frequently suffered physical damage.
Because of all of these potential problems I often spend the first few minutes of a telephone meeting shouting "Hello! Hello! can you hear me?". If I was speaking to another team member I could expect them to be understanding of this wasted time and/or poor audio quality while I tested several headsets to find which was working best. However, when I was making an important call to someone I wanted to impress, I needed some way to be totally confident that all aspects of my telephony setup were working correctly.
Anyone who uses Skype is probably familiar with the "echo123" virtual user. This is a virtual Skype account that anyone can call and be answered by a pleasant sounding lady who will listen to what you say and then repeat it back to you as it sounds to her. I decided to hack together something similar that could be used with any telephony system. After a bit of searching on the internet I found the voxeo developers site which offeres excellent free resources to anyone wanting to develop voice based applications. Voxeo make their money from providing commercial grade voice response systems to mission critical systems, but in order to convince people how easy it is to develop a user friendly voice interface to their system they give developers free access to their powerful web based development environment and they will even host your application on their test servers so that you can test it out in action.
Voxeo support a number of programming languages including the industry standard VoiceXML. Developing a VoiceXML server is very complex, but the good news is that since Voxeo have done that you don't have to. Developing a voiceXML application is very easy (there are excellent tutorials on the Voxeo site to get your started). I was able to develop my application in under 30 lines of easy to write/understand XML. You can get the full source code here.
The way VoiceXML works is that you specify prompts for the system to play and then you listen for the user to say something (or type a DTMF tone on their keypad). You specify in XML what should be done with the response. You can see I have only one
statement and I use the text to speech function to generate the prompt (it is also possible to record the prompts for a more natural sounding interface).

The only complex line in my code is the one that reads
record name="R_1" beep="true" dtmfterm="true" maxtime="10s" finalsilence="1s" silence="3s
Translated into English this tag means:
  • Record what you hear in a file named R_1.wav
  • If you hear a DTMF tone, stop recording
  • Listen for a maximum of 10 seconds
  • If you hear nothing give up after 3 seconds
  • If you hear something then terminate when the speaker leaves a gap of 1 second or more
The rest of the application is just instructions to play back the recording to the user (or play an error message if we didn't hear anything). It then loops back to the start so that it gives me time to adjust my headset and see if it sounds any clearer.
Obviously real world applications could get more complex and if you try to recognize  what the user is saying it can get things hilariously wrong when the caller is not a native speaker. But the general idea is not too hard to master. In any case we only need to distinguish between when we hear something so that the "filled" tag applies, or when we hear nothing and the "noinput" tag applies.
If you want to try out the application you can call +1(617)963-0648  to get the version with this source code, or if you prefer the sound of my voice you can call +1(617)500-5332 to hear a slightly modified version where the prompts use a recording of my voice.


  1. Interesting. I used Twilio at the IDEO Makeathon recently - sounds like it has a similar API and approach.

  2. I never heard of Twilio before, but it does indeed seem to provide similar functionality to Voxeo. I also like the way you can browse apps on their web site by different dimensions.