Your Voice is Puppeteering an Avatar in my Brain

November 23, 2014

I have been having a lot of video chat conversations recently with a colleague who is on the opposite side of the continent.

Now that we have been “seeing” each other on a weekly basis, we have become very familiar with each other’s voices, facial expressions, gesticulations, and so on.

But, as is common with any video conferencing system: the audio and video signal is unpredictable. Often the video signal totally freezes up, or it lags behind the voice. It can be really distracting when the facial expressions and mouth movements do not match the sound I’m hearing.

Sometimes we prefer to just turn off the video and stick with voice.

One day after turning off the video, I came to the realization that I have become so familiar with his body language that I can pretty much guess what I would be seeing as he spoke. Basically, I realized that…


Since the voice of my colleague is normally synchronized with his physical gesticulations, facial expressions, and body motions, I can easily imagine the visual counterpart to his voice.

This is not new to video chat. It has been happening for a long time with telephone, when we speak with someone we know intimately.


In fact, it may have even happened at the dawn of our species.

According to gestural theory, physical, visible gesture was once the primary communication modality in our ape ancestors. Then, our ancestors began using their hands increasingly for tool manipulation—and this created evolutionary pressure for vocal sounds to take over as the primary language delivery method. The result is that we humans can walk, use tools, and talk, all at the same time.

As gestures gave way to audible language, our ancestors could keep looking for nuts and berries while their companions were yacking on.

Here’s the point: The entire progression from gesture to voice remains as a vestigial pathway in our brains. And this is why I so easily imagine my friend gesturing at me as I listen to his voice.

Homunculi and Mirror Neurons

There are many complex structures in my brain, including several body maps that represent the positions, movements and sensations within my physical body. There are also mirror neurons – which help me to relate to and sympathize with other people. There are neural structures that cause me to recognize faces, walking gaits, and voices of people I know.

Evolutionary biology and neuroscience research points to the possibility that language may have evolved out of, and in tandem with gestural communication in homo sapiens. Even as audible language was freed from the physicality of gesture, the sound of one’s voice remains naturally associated with the visual, physical energy of the source of that voice (for more on this line of reasoning, check out Terrance Deacon).

puppeteer2Puppeteering is the art of making something come to life, whether with strings (as in a marionette), or with your hand (as in a muppet). The greatest puppeteers know how to make the most expressive movement with the fewest strings.

The same principle applies when I am having a Skype call with my wife. I am so intimately familiar with her voice and the associated visual counterpart, that all it takes is a few puppet strings for her to appear and begin animating in my mind – often triggered by a tiny, scratchy voice in a cell phone.

Enough pattern-recognition material has accumulated in my brain to do most of the work.

I am fascinated with the processes that go on in our brains that allow us to build such useful and reliable inner-representations of each other. And I have wondered if we could use more biomimicry – to apply more of these natural processes towards the goal of transmitting body language and voice across the internet.

These ideas are explored in depth in Voice as Puppeteer.

On Phone Menus and the Blowing of Gaskets

January 2, 2013

(This blog post is re-published from an earlier blog of mine called “avatar puppetry” – the nonverbal internet. I’ll be phasing out that earlier blog, so I’m migrating a few of those earlier posts here before I trash it).

This blog post is only tangentially related to avatars and body language. But it does relate to the larger subject of communication technology that fails to accommodate normal human behavior and the rules of natural language.

But first, an appetizer. Check out this video for a phone menu for callers to the Tennessee State Mental Hospital:

A Typical Scenario

You’ve probably had this experience. You call a company or service to ask about your bill, or to make a general inquiry. You are dumped into a sea of countless menu options given by a recorded message (I say countless, because you usually don’t know how many options you have to listen to – will it stop at 5? Or will I have to listen to 10?). None of the options apply to you. Or maybe some do. You’re not really sure. You hope – you pray, that you will be given the option to speak to a representative, a living, breathing, thinking, soft and cuddly human. After several agonizing minutes (by now you’ve forgotten most of the long-winded options) you realize that there is no option to speak to a human. Or at least youthink there is no option. You’re not really sure.

Your blood pressure has now reached levels that warrant medical attention. If you still have rational neurons firing, you get the notion to press “0″. And the voice says, “Please wait to speak to a phone representative”. You collapse in relief. The voice continues: “this call may be recorded for quality assurance” Yea, right. (I think I remember once actually hearing the message say, “this call may be recorded……because…we care”. Okay now that is gasket-blowing material).

Why Conversation Matters

I don’t think I need to go into this any further. Just do a search on “phone menu” (or “phone tree”) and “frustration”, or something like that, and follow the scent and you’ll find plenty of blog posts on the subject.

How would I best characterize this problem? I could talk about it from an economic point of view. For instance it costs a company a lot more to hire real people than to hook up an automated answering service or an interactive voice response (IVR) system. But companies have to also weigh the negative impact of a large percentage of irate customers. But too few companies look at this as a Design problem. Ah, there it is again: that ever-present normalizer and humanizer of technology: DesignIt’s invisible when it works well, and that’s why it is such an unsung hero.

The Hyper-Linearity of Non-Interactive Verbal Messages

The nature of this design problem, I believe, is that these phone menus give a large amount of verbal information (words, sentences, options, numbers, etc.) which take time to explain. They are laid out in a sequential order.

There is no way to jump ahead, to interrupt the monolog, or to ask it for clarification, as you would in a normal conversation. You are stuck in time – rigid, linear time, with no escape. (At least that’s what it feels like: there are usually options to hit special keys to go to the previous menu or pop out entirely, etc. But who knows what those keys are? And the dreaded fear of getting disconnected is enough to keep people like me staying within the lines, gritting  teeth, and being obedient (although that means I have the potential to become the McDonald’s gunman who makes the headlines the next morning.)

Compare this with a conversation with a phone representative: normal human dialog involves interruptions, clarifications, repetitions, mirroring (the “mm’s”, “hmm’s”, “ah’s”, “ok’s”, “uh-huh’s”, and such – the audible equivalent of eye-contact and head-nods), and all the affordances that you get from the prosody of speech. Natural conversations continually adapt to the situation. These adaptive, conversational dynamics are absent from the braindead phone robots. And their soft, soothing voices don’t help – in fact they only make me want to kill them that much harder.

There are two solutions:

1. Full-blown Artificial Intelligence, allowing the robot voice to “hear” your concerns, questions, and drill down, with your help, to the crux of the problem. But I’m afraid that AI  has a way to go before this is possible. And even if it is almost possible, the good psychologists, interaction designers, and human-user interface experts don’t seem to be running the show. They are outnumbered by the techno-geeks with low EQ, and little understanding of human psychology. Left-brainers gather the power and influence, and run the machines – computer-wise and business-wise – because they are good with the numbers, and rarely blow a gasket. The right-brained skill set ends up stuck on the periphery, almost by its very nature. I’m waiting for this revolution I keep hearing about – the Revenge of the Right Brain. So far, I still keep hitting brick walls built with left-brained mortar. But I digress.

2. Visual interfaces. By having all the options laid out in a visual space, the user’s eyes can jump around (much more quickly than a robot can utter the options). Thus, if the layout is designed well (a rarity in the internet junkyard) the user can quickly see, “ah, I have five options. Maybe I want to choose option 4 – I will select, “more information about option 4 to make sure”. All of this can happen within a matter of seconds. You could almost say that the interface affords a kind of body language that the user reads and acts upon immediately.

Consider the illustration below for a company’s phone tree which I found on the internet (I blacked-out the company name and phone number). Wouldn’t it be nice if you could just take a glance at this visual diagram and jump to the choice you want? If you’re like me, your eyes will jump straight to the bottom where the choice to speak to a representative is. (Of course it’s at the bottom).

This picture says it all. But of course. We each have two eyes, each with millions of photoreceptors: simultaneity, parallelism, instant grok. But since I’m talking about telephones, the solution has to be found within the modality of audio alone, trapped in time. And in that case, there is no other solution than an advanced AI program that can understand your question, read your prosodic body language, and respond to the flow of the conversation, thus collapsing time.

…and since that’s not coming for a while, there’s another choice: a meat puppet – one of those very expensive communication units that burn calories, and require a salary. What a nuisance.

Screensharing: Don’t Look at Me

January 11, 2012

Imagine discussing a project you are doing with a small group: a web site, a drawing, a contraption you are building; whatever. You would not expect the people to be looking at your face the whole time. Much of the time you will all be gazing around at different parts of the project. You may be pointing your fingers around, using terms like “this”, “that”, “here” and “there”.

When people have their focus on something separate from their own bodies, that thing becomes an extension of their bodies. Bodymind is not bound by skin. And collaborating, communicating bodyminds meld on an object of common interest.


The internet is dispersing our workspaces globally, and the same is happening to our bodies.

The anthropologist, Ray Birdwhistell coined the term “kinesics“, referring to the interpretation, science, or study of body language.

I invented a word: “telekinesics”. I define it as, “the science of body language as conducted over remote distances via some medium, including the internet” (ref)

My primary interest is the creation of body langage using remote manifestations of ourselves, such as with avatars and other visual-interactive forms. I don’t consider video conferencing as a form of virtual body language, because it is essentially a re-creation of one’s literal appearances and sounds. It is an extension of telephony.

But it is virtual in one sense: it is remote from your real body.

Video conferencing, and applications like Skype are extremely useful. I use Skype all the time to chat with friends or colleagues. Seeing my collaborator’s face helps tremendously to fill-in the missing nonverbal signals in telephony. But if the subject of conversation is a project we are working on, then “face-time”, is not helpful. We need to enter into, and embody, the space of our collaboration.

Screen Sharing

This is why screen sharing is so useful. Screen sharing happens when you flip a switch on your Skype (or whatever) application that changes the output signal from your camera to your computer screen. Your mouse cursor becomes a tiny Vanna White – annotating, referencing, directing people’s gazes.

Michael Braun, in the blog post: Screen Sharing for Face Time, says that seeing your chat partner is not always helpful, while screen sharing “has been shown to increase productivity. When remote participants had access to a shared workspace (for example, seeing the same spreadsheet or computer program), then their productivity improved. This is not especially surprising to anyone who has tried to give someone computer help over the phone. Not being able to see that person’s screen can be maddening, because the person needing help has to describe everything and the person giving help has to reconstruct the problem in her mind.”

Many software applications include cute features like collaborative drawing spaces, intended for co-collaborators to co-create, co-communicate, and to to co-mess up each other’s co-work. The interaction design (from what I’ve seen) is generally awkward. But more to the point: we don’t yet have a good sense of how people can and should interact in such collaborative virtual spaces. The technology is still frothing like tadpole eggs.

Some proponents of gestural theory believe that one reason speech emerged out of gestural communication was because it freed up the “talking hands” so that they could do physical work – so our mouths started to do the talking. Result: we can put our hands to work, look at our work, and talk about it, all at the same time.

Screen sharing may be a natural evolutionary trend – a continuing thread to this ancient  activity – as manifested in the virtual world of internet communications.



Watson’s Avatar: Just Abstract Art?

February 17, 2011

This video describes the visual design of the “avatar” for Watson – the Jeopardy-playing AI that recently debuted on the Jeopardy show.

This is a lovely example of generative art. Fun to watch as it swirls and swarms and shimmers. But I do not think it is a masterpiece of avatar design – or even information design in general. The most successful information design, in my opinion, employs natural affordances – the property of expressing the function or true state of an animal or thing. Natural affordances are the product of millions of years of evolution. Graphical user interfaces, no matter how clever and pretty, rarely come close to offering the multimodal stimuli that allow a farmer to read the light of the sky to predict rain, or for a spouse to sense the sincerity of her partner’s words by watching his head motions and changes in gaze.

Watson’s avatar, like many other attempts at visualizing emotion, intent, or states of human communication, uses arbitrary visual effects. They may look cool, but they do not express anything very deep.

…although Ebroodle thinks there is something pretty deep going on with Watson, as in… world domination.

Despite my criticism, I do commend Joshua Davis, the artist who developed the avatar. It is difficult to design non-human visual representations of human expression and communication. But it is a worthy effort, considering the rampant uncanny valley effect that has infected so much virtual human craft for so long, caused by artists (or non-artists) trying to put a literal human face on something artificial.

What Was Watson Thinking?

Watson’s avatar takes the form of a sphere with a swarm of particles that swirl around it. The particles migrate up to the top when Watson is confident about its answer, and to the bottom when it is unsure. Four different colors are used to indicate levels of confidence. Green means very confident. Sounds pretty arbitrary. I’ve never been a fan of color for indicating emotion or states of mind – it is overused, and ultimately arbitrary. Too many other visual affordances are underutilized (such as styles of motion).

Contradiction Between Visual and Audible Realism

Here’s something to ponder: Watson’s avatar is very abstract and arty. But Watsons voice is realistic … and kinda droll. I think Watson’s less-than-perfect speech creates a sonic uncanny valley effect. Does the abstraction of Watsons visual face help this problem, or make it more noticeable?

Is the uncanny valley effect aggravated when there is a discrepancy between visual and audible realism? I can say with more confidence that the same is true when visual realism is not met with complimentary behavioral realism (as I discuss in my book).

Am I saying that Watson should have a realistic human face – to match its voice? Not at all! That would be a disaster. But this doesn’t mean that its maker can craft abstract shapes and motions with reckless abandon. Indeed, the perception of shapes and colors changing over time – accompanied by sound – is the basis for all body language interpretation – it penetrates deep into the ancient communicative energy of planet Earth. Body language is the primary communication channel of humans as well as all animals. Understanding these ancient affordances is a good way to become good at information design.

Hmm – I just got an image of Max Headroom in my mind. Max Headroom had an electronic stutter in his voice as well as in his visual manifestation. Audio and video were complimentary. It would be kinda fun to see Max as Watson’s avatar.

What do you think, my dear Watson?


February 3, 2011

When my wife lay in a hospital bed for several weeks with a burst appendix, I spent a lot of time by her side. I was horrified by the cacophony in the recovery ward. How can anyone expect to heal when they are surrounded by a jungle of machines beeping incessantly? And what about the hospital staff? I decided that the nurses could just as easily be responding to the sound of bird calls. These machines could be playing the sounds of songbirds instead of emitting the sonic equivalent of a finger repeatedly poking you in the eye. Not only would bird calls make for a more pleasant soundscape in the hospital, but different bird calls could be used for different meanings.

Below is a poem I wrote many years ago which expresses my feelings about the way machines talk to us.


I came to the conclusion the other day
That our machines have something to say
Our cars, our phones, our computer screens,
Our ovens and our bank machines

They’re learning how to speak to us
Yes, times are changing and so they must
How could we ever get along
If they didn’t tell us when something’s wrong?

“Your seatbelt’s off”
“Don’t forget your card”
“Don’t click me there”
“Don’t press too hard”

But the builders of these technologies
Have yet to give them personalities
I think our machines might benefit, you see,
By having a bigger vocabulary!

There’s another gripe I need to share
A concern for which you may not care
There’s pollution in my neighborhood
I’d love to end it if I could

You hear that incessant beep beep beep?
It woke me from my morning sleep
Six blocks east on Main and First
A van just went into reverse

The device which generates this din
was built to reach the ears within
a ten block radius of this city
that all may know the van’s velocity

Folks living in a future world
May wince and twitch at what they once heard
Recalling the voice of our technology
This low-resolution cacaphony


Beep beep beep beep!

This is the third post of my new blog. I expect to have more things to say about body language in user interface design – not just regarding avatars in virtual worlds. I think the subject of virtual body language spans across all kinds of technology.  I’d love to hear your thoughts!