Critique 1: Visual Interaction – Speak Up Display

I recently saw a talk on campus by Dr. Xuedong Huang, founder of Microsoft’s speech technology group. He did a  demo of the latest speech to text on Azure, combined with HoloLens  and I have to say I was impressed.

The went from this failure several years ago:

To this more recently (Speech to text -> translation -> text to speech… in your own voice… and a hologram for good measure):


This got me thinking that a more earthbound and practical application of this could be prototyped today, so I decided to make a heads up display for speech to text that functions external to a computer or smartphone.

If you are unable to hear whether due to a medical condition or just because you have music playing on your headphones, you are likely to miss things going on around you. I personally share an office with four other people, and I’m often found tucked away in the back corner with my earbuds in, completely unaware that the other four are trying to talk to me.

Similarly my previous research into common issue for those that are deaf, is being startled by someone coming up behind them, since they cannot hear their name being called.

With this use case in mind, I created an appliance that sits on a desktop within sight, but the majority of the time it does its best not to attract attention.

I realize it would be easy enough to pop open another window and display something on a computer screen, but that would either have to be a window always on top, or a bunch of notifications, so it seemed appropriate to take the display off screen to what would normally be the periphery.

The other advantage is a social one, if I look at my laptop screen while I’m supposed to be listening to you, you might think you’re being ignored, but with a big microphone between us, on a dedicated box with simple text display, I’m able to glance over it as i face you in conversation or in a lecture.

When it hears text it displays it on the LCD screen for a moment, and then it scrolls off leaving the screen blank when the room is quiet. This allows the user to glance over if they’re curious about what is being said around them:

Things get more interesting when the system recognizes key words like their name. It can be triggered to flash a colored light, in this case green, to draw attention and let the user know that someone is calling for them.

Finally, other events can be detected and trigger messages on the screen, and LED flashes.

The wiring is fairly simple. The board uses it’s onboard Neopixel RGB LED to trigger the color coded alerts, and the LCD screen just takes a (one way) serial connection.

Initially the project began with a more elaborate code base, but it has been scaled down to a more elegant system with a simple API for triggering text and LED displays.

A serial connection is established to the computer, and the processor listens for strings. If a string is less than 16 characters it pads it for clean display, and if it has a 17th character, it checks it for color codes:

void setled(String textandcolor){
    switch(textandcolor[16]) {
      case 'R':
        strip.setPixelColor(0, strip.Color(255,0,0));;
      case 'G':
        strip.setPixelColor(0, strip.Color(0,255,0));;
      case 'B':
        strip.setPixelColor(0, strip.Color(0,0,255));;
      case 'X':
        strip.setPixelColor(0, strip.Color(0,0,0));;

A computer which uses the appliance’s microphone to listen to nearby speech can send it off to be transcribed, and then feed it to the screen 16 characters at a time, watching for keywords or phrases. (This is still in progress, but the communication bus from the computer to the board is fully functional for text and LED triggers)

After some experimenting, it seems that the best way to display the text is to start at the bottom line, and have it scroll upwards (a bit like a teleprompter) one line at a time every half a second. Faster became hard to keep up with, and slower felt like a delayed reaction. (Arduino code + Fritzing diagram)

I’d love to expand this to do translation (these services have come a long way as well), and perhaps migrate to a Raspberry Pi to do the web API portion so that the computer can be closed and put away.


I made the system more interactive by making the microphone (big black circle in the images above) into a button. While you hold the button it listens to learn new keywords, and then alerts when it hears those words. Overtime keywords decay.

The idea of the decay is that you would trigger the system when you hear something it should tell you about, and if you don’t trigger it the next time it hears it, it becomes slightly less likely to trigger again. This also begins to filter out common words from more important keywords.

This weight system is merely  to be a placeholder for a more sophisticated system.

STT Update

Author: Matt Franklin

I'm a recovering engineer + sales guy... BSEE from UMD 2004, and then 15 years of working with signal processing, AV, control systems, networking, and other gadgetry (mostly B2B). Now I'm in the Master of Human-Computer Interaction program, graduating in August 2020. I have pretty solid experience with: - Linux - audio - video - rs232/422/485/midi/dmx protocols - sketchup and other cad tools - soldering - music (mostly guitar, but others too) - general troubleshooting - networking (wired + wireless) - signal processing - streaming video/audio - python I have some experience with or am mediocre at: - woodworking - welding - laser cutting - sewing - reverse engineering - ML (none really but I'm currently in a class) - some javascript - rusty with C++ and Java, but used to be decent - tube amplifiers - RF

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.