Your Voice is Puppeteering an Avatar in my Brain

November 23, 2014

I have been having a lot of video chat conversations recently with a colleague who is on the opposite side of the continent.

Now that we have been “seeing” each other on a weekly basis, we have become very familiar with each other’s voices, facial expressions, gesticulations, and so on.

But, as is common with any video conferencing system: the audio and video signal is unpredictable. Often the video signal totally freezes up, or it lags behind the voice. It can be really distracting when the facial expressions and mouth movements do not match the sound I’m hearing.

Sometimes we prefer to just turn off the video and stick with voice.

One day after turning off the video, I came to the realization that I have become so familiar with his body language that I can pretty much guess what I would be seeing as he spoke. Basically, I realized that…


Since the voice of my colleague is normally synchronized with his physical gesticulations, facial expressions, and body motions, I can easily imagine the visual counterpart to his voice.

This is not new to video chat. It has been happening for a long time with telephone, when we speak with someone we know intimately.


In fact, it may have even happened at the dawn of our species.

According to gestural theory, physical, visible gesture was once the primary communication modality in our ape ancestors. Then, our ancestors began using their hands increasingly for tool manipulation—and this created evolutionary pressure for vocal sounds to take over as the primary language delivery method. The result is that we humans can walk, use tools, and talk, all at the same time.

As gestures gave way to audible language, our ancestors could keep looking for nuts and berries while their companions were yacking on.

Here’s the point: The entire progression from gesture to voice remains as a vestigial pathway in our brains. And this is why I so easily imagine my friend gesturing at me as I listen to his voice.

Homunculi and Mirror Neurons

There are many complex structures in my brain, including several body maps that represent the positions, movements and sensations within my physical body. There are also mirror neurons – which help me to relate to and sympathize with other people. There are neural structures that cause me to recognize faces, walking gaits, and voices of people I know.

Evolutionary biology and neuroscience research points to the possibility that language may have evolved out of, and in tandem with gestural communication in homo sapiens. Even as audible language was freed from the physicality of gesture, the sound of one’s voice remains naturally associated with the visual, physical energy of the source of that voice (for more on this line of reasoning, check out Terrance Deacon).

puppeteer2Puppeteering is the art of making something come to life, whether with strings (as in a marionette), or with your hand (as in a muppet). The greatest puppeteers know how to make the most expressive movement with the fewest strings.

The same principle applies when I am having a Skype call with my wife. I am so intimately familiar with her voice and the associated visual counterpart, that all it takes is a few puppet strings for her to appear and begin animating in my mind – often triggered by a tiny, scratchy voice in a cell phone.

Enough pattern-recognition material has accumulated in my brain to do most of the work.

I am fascinated with the processes that go on in our brains that allow us to build such useful and reliable inner-representations of each other. And I have wondered if we could use more biomimicry – to apply more of these natural processes towards the goal of transmitting body language and voice across the internet.

These ideas are explored in depth in Voice as Puppeteer.

High Fidelity: Body Language through “Telekinesics”

June 2, 2013

Human communication demonstrates the usual punctuated equilibria of any natural evolutionary system. From hand gestures to grunts to telephones to email and beyond, human communication has not only evolved, but splintered off into many modalities and degrees of asynchrony.

hifi-logoI recently had the great fortune to join a company that is working on the next great surge in human communication: High Fidelity, Inc. This company is bringing together several new technologies to make this happen.

So, what is the newest evolutionary surge in human communication? I would describe it using a term from Virtual Body Language (page 22):

Telekinesics is a word invented to denote…”the study of all emerging nonverbal practices across the internet, by adding the prefix, tele to Birdwhistell’s, term kinesics. It could easily be confused with “telekinesis”: the ability to cause movement at a distance through the mind alone (the words differ by only one letter). But hey, these two phenomena are not so different anyway, so a slip of the tongue wouldn’t be such a bad thing. Telekinesics may be defined as “the science of body language as conducted over remote distances via some medium, including the internet”. 

And now it’s not just science, but practice: body language is ready to go online…in realtime.

And when I say “realtime” – I mean, pretty damn fast, compared to most things that zip (or try to zip) across the internet. And when we’re talking about subtle head nods, changes in eye contact, fluctuations in your voice, and shoulder shrugs, fast is not just a nicety, it is a necessity – for clear communication using a body.

Here’s Ryan Downe showing an early stage of avatar head movement using Google Glass.

Philip Rosedale, the founder of High Fidelity, often talks about how cool it would be for my avatar to walk up to your avatar and give it a little shoulder-shove, or a fist-bump, or an elbow-nudge, or a hug…and for your avatar to respond with a slight – but noticeable – movement.

It would appear that human touch (or at least the visual/audible representation of human touch) is on the verge of becoming a reality – through telekinesics. Of all the modalities and senses that we use to communicate, touch is the most primal: we share it with the oldest microorganisms.

touch_avatarWhen touch is manifest on the internet, along with highly-crafted virtual environments, maybe, just maybe, we will have reached that stage in human evolution when we can have a meaningful, intimate exchange – even if one person is in Shanghai and the other is in Chicago.

small_earthAnd that means people can stop having to fly around the world and burning fossil fuels in order to have 2-hour-long business meetings. And that means reducing our carbon footprint. And that means we might have a better chance of not pissing-off Mother Earth to the degree that she has a spontaneous fever and shrugs us off like pesky fleas.

Which would really suck.

So…keep an eye on what we’re doing at High Fidelity, and get ready for the next evolutionary step in human communication. It just might be necessary for our survival.

On Phone Menus and the Blowing of Gaskets

January 2, 2013

(This blog post is re-published from an earlier blog of mine called “avatar puppetry” – the nonverbal internet. I’ll be phasing out that earlier blog, so I’m migrating a few of those earlier posts here before I trash it).

This blog post is only tangentially related to avatars and body language. But it does relate to the larger subject of communication technology that fails to accommodate normal human behavior and the rules of natural language.

But first, an appetizer. Check out this video for a phone menu for callers to the Tennessee State Mental Hospital:

A Typical Scenario

You’ve probably had this experience. You call a company or service to ask about your bill, or to make a general inquiry. You are dumped into a sea of countless menu options given by a recorded message (I say countless, because you usually don’t know how many options you have to listen to – will it stop at 5? Or will I have to listen to 10?). None of the options apply to you. Or maybe some do. You’re not really sure. You hope – you pray, that you will be given the option to speak to a representative, a living, breathing, thinking, soft and cuddly human. After several agonizing minutes (by now you’ve forgotten most of the long-winded options) you realize that there is no option to speak to a human. Or at least youthink there is no option. You’re not really sure.

Your blood pressure has now reached levels that warrant medical attention. If you still have rational neurons firing, you get the notion to press “0″. And the voice says, “Please wait to speak to a phone representative”. You collapse in relief. The voice continues: “this call may be recorded for quality assurance” Yea, right. (I think I remember once actually hearing the message say, “this call may be recorded……because…we care”. Okay now that is gasket-blowing material).

Why Conversation Matters

I don’t think I need to go into this any further. Just do a search on “phone menu” (or “phone tree”) and “frustration”, or something like that, and follow the scent and you’ll find plenty of blog posts on the subject.

How would I best characterize this problem? I could talk about it from an economic point of view. For instance it costs a company a lot more to hire real people than to hook up an automated answering service or an interactive voice response (IVR) system. But companies have to also weigh the negative impact of a large percentage of irate customers. But too few companies look at this as a Design problem. Ah, there it is again: that ever-present normalizer and humanizer of technology: DesignIt’s invisible when it works well, and that’s why it is such an unsung hero.

The Hyper-Linearity of Non-Interactive Verbal Messages

The nature of this design problem, I believe, is that these phone menus give a large amount of verbal information (words, sentences, options, numbers, etc.) which take time to explain. They are laid out in a sequential order.

There is no way to jump ahead, to interrupt the monolog, or to ask it for clarification, as you would in a normal conversation. You are stuck in time – rigid, linear time, with no escape. (At least that’s what it feels like: there are usually options to hit special keys to go to the previous menu or pop out entirely, etc. But who knows what those keys are? And the dreaded fear of getting disconnected is enough to keep people like me staying within the lines, gritting  teeth, and being obedient (although that means I have the potential to become the McDonald’s gunman who makes the headlines the next morning.)

Compare this with a conversation with a phone representative: normal human dialog involves interruptions, clarifications, repetitions, mirroring (the “mm’s”, “hmm’s”, “ah’s”, “ok’s”, “uh-huh’s”, and such – the audible equivalent of eye-contact and head-nods), and all the affordances that you get from the prosody of speech. Natural conversations continually adapt to the situation. These adaptive, conversational dynamics are absent from the braindead phone robots. And their soft, soothing voices don’t help – in fact they only make me want to kill them that much harder.

There are two solutions:

1. Full-blown Artificial Intelligence, allowing the robot voice to “hear” your concerns, questions, and drill down, with your help, to the crux of the problem. But I’m afraid that AI  has a way to go before this is possible. And even if it is almost possible, the good psychologists, interaction designers, and human-user interface experts don’t seem to be running the show. They are outnumbered by the techno-geeks with low EQ, and little understanding of human psychology. Left-brainers gather the power and influence, and run the machines – computer-wise and business-wise – because they are good with the numbers, and rarely blow a gasket. The right-brained skill set ends up stuck on the periphery, almost by its very nature. I’m waiting for this revolution I keep hearing about – the Revenge of the Right Brain. So far, I still keep hitting brick walls built with left-brained mortar. But I digress.

2. Visual interfaces. By having all the options laid out in a visual space, the user’s eyes can jump around (much more quickly than a robot can utter the options). Thus, if the layout is designed well (a rarity in the internet junkyard) the user can quickly see, “ah, I have five options. Maybe I want to choose option 4 – I will select, “more information about option 4 to make sure”. All of this can happen within a matter of seconds. You could almost say that the interface affords a kind of body language that the user reads and acts upon immediately.

Consider the illustration below for a company’s phone tree which I found on the internet (I blacked-out the company name and phone number). Wouldn’t it be nice if you could just take a glance at this visual diagram and jump to the choice you want? If you’re like me, your eyes will jump straight to the bottom where the choice to speak to a representative is. (Of course it’s at the bottom).

This picture says it all. But of course. We each have two eyes, each with millions of photoreceptors: simultaneity, parallelism, instant grok. But since I’m talking about telephones, the solution has to be found within the modality of audio alone, trapped in time. And in that case, there is no other solution than an advanced AI program that can understand your question, read your prosodic body language, and respond to the flow of the conversation, thus collapsing time.

…and since that’s not coming for a while, there’s another choice: a meat puppet – one of those very expensive communication units that burn calories, and require a salary. What a nuisance.

Nano Avatars

June 8, 2012

(This blog post is re-published from an earlier blog of mine called “avatar puppetry” – the nonverbal internet. I’ll be phasing out that earlier blog, so I’m migrating a few of those earlier posts here before I trash it).


The other day, Jeremy Owen Turner told me about NanoArt. Here’s a cool nano art piece by Yong Qing Fu, described in Chemistry World.


We started imagining a nano virtual world. Jeremy pontificates on avatars as works of art, avatars that can take on alternate forms, including nano art. I started thinking about what an avatar that consisted of a molecule might be like.

Some illustrations of the hemoglobin molecule look a bit like the flying spaghetti monster. Which reminds me, Cory Linden’s avatar in Second Life is based on the flying spaghetti monster.


We’ve seen avatars hanging out among virtual molecules


but what about avatars that ARE molecules? Stephanie H. Chanteau and James M. Tour of Rice University created anthropomorphic molecules.


But I’m not so interested in how people make anthropomorphic molecules. I’m interested in avatars that live a molecule’s life. Check this out…

scanning tunneling microscope (STM) is set up in a magnificent auditorium.


The microscope’s subject matter is projected onto a giant video screen. An audience of thousands watch as a team of five molecule-avatar controllers sit with computer mice and keyboards and mingle in a virtual world that is actually not virtual. In the middle of all the flamboyant machinery is a tiny nano-stage, a performance dance floor where five molecules show something rather strange and new

Since the STM can be used for atom manipulation as well as visioning (a consequence of the Observer Effect), the very technology for seeing the avatars is used to control them.

The audience collectively winces as the avatars try to, um, walk. Okay, maybe walking isn’t the right word. What exactly do these avatars do? They combine to form supermolecules. They jump and twitch. They split and reform. They blink and chirp. They fall off the edge of the stage and accidentally get stuck on carbon atoms. It may not be elegant. But hey it would be so cool to watch.

When the performance is done, the avatars take a bow…or something. The audience applauds with a standing ovation. A new genre is born. Constraints define creative boundaries and therefore creativity. And the limited repertoire of molecular interactions define the social vocabulary of these agents. Kind of reminds me of Flatland.

Avatars are embodiments of humans (or human intention) in virtual worlds.

“Seeing” a molecule is a problematic term, in the same sense that “seeing” a planet in a distant star system is a problematic term. It’s not “seeing” on a human scale. It’sprosthetic seeing. And so, just like a software-based virtual world, there must be arenderer.


Our most distant ancestor is a molecule that accidentally replicated and thus started the upward avalanche that is called Evolution. Dennett’s intentional stance can be applied on all levels of the biosphere. Molecular avatars represent the most basic and primitive expression of agentry. And unlike the constraints of C++, Havok, and OpenGL, in virtual world software programs, the constraints in this molecular world are real.

It may yield some insights about the fundamentals of interaction.

Just Because It’s Visual Doesn’t Mean It’s Better

May 24, 2012

I’ve been renting a lot of cars lately because my own car died. And so I get to see a lot of the interiors of American cars. Car design is generally more user-friendly than computer interfaces – for the simple reason that when you make a mistake on a computer interface and the computer crashes, you will not die.

As cars become increasingly computerized, the “body language” starts to get wonky, even in aspects that are purely mechanical.

In a car I recently rented, I was looking for the emergency brake. The body language of most of the cars I’ve used offers an emergency brake just to the right of my seat in the form of a lever that I pull up. Body language between human bodies is mostly unconscious. If a human-manufactured tool is designed well, its body langage is also mostly-unconscious: it is natural. Anyway…I could not find an emergency brake in the usual place in this particular car. So I looked in the next logical place: near the floor to the left of the foot pedals. There I saw the following THING:

I wanted to check to make sure it was the brake, so that I wouldn’t inadvertently pop open the hood or the cap of the gas tank. So I peered more closely at the symbol on this particular THING, and I asked myself the following question:

What the F?

Once I realized that this was indeed the emergency brake, I decided that a simple word would have sufficed.

In some cars, the “required action” is written on the brake:

Illiterate Icon Artists

I was reminded of an episode in one of the companies I was working for, where an “icon artist” was hired to build the visual symbols for several buttons on a computer interface. He had devised a series of icons that were meant to provide visual language counterparts to basic actions that we typically do on computer interfaces. He came up with novel and aesthetic symbols. But….UN-READABLE.

I suggested he just put the words on the icons, because the majority of computer users know English, and if they don’t know English, they could always open up a dictionary. Basically, this guy’s clever icons had no counterpart to the rest of the world. They were his own invention – they were UNDISCOVERABLE.

Moral of the story:

Designed body language should corresponds to “natural affordances”;  the expectations and readability of the natural world. If that is not possible, use historical conventions (by now there is plenty of reference material on visual symbols, and I would suspect that by now there are ways to check for the relative “universality” of certain symbols).

In both cases, whether using words or visuals, literacy is needed.

Put in another way:

It is impossible to invent a visual langage from scratch. Because the only one who can visually “read” it is the creator. If it does not commute, it is not language. This applies to visual icons as much as it does to words.

As technology becomes more and more computerized (like cars) we have less and less opportunity to take advantage of natural affordances. Eventually, it will be possible to set the emergency brake by touching a tiny red button, or by uttering a message into a microphone. Thankfully, emergency brakes are still very physical, and I get to FEEL the pressure of that brake as I push it in, or pop it off….

that is…if I can ever find the damn thing.

Voice as Puppeteer

May 5, 2012

(This blog post is re-published from an earlier blog of mine called “avatar puppetry” – the nonverbal internet. I’ll be phasing out that earlier blog, so I’m migrating a few of those earlier posts here before I trash it).


According to Gestural Theory, verbal language emerged from the primal energy of the body, from physical and vocal gestures.


The human mind is at home in a world of abstract symbols – a virtual world separated from the gestural origins of those symbols. An evolution from the analog to the digital continues today with the flood of the internet over earth’s geocortex. Our thoughts are awash in the alphabet: a digital artifact that arose from a gestural past. It’s hard to imagine that the mind could have created the concepts of Self, God, Logic, and Math: belief structures so deep in our wiring – generated over millions of years of genetic, cultural, and neural evolution. I’m not even sure if I fully believe that these structures are non-eternal and human-fabricated. Since the Copernican Revolution yanked humans out from the center of the universe, it continues to progressively kick down the pedestals of hubris. But, being humans, we cannot stop this trajectory of virtuality, even as we become more aware of it as such.

I’ve observed something about the birth of online virtual worlds, and the foundational technologies involved. One of the earliest online virtual worlds was Onlive Traveler, which used realtime voice.


My colleague, Steve DiPaola invented some techniques for Traveler which cause the voice to animate the floating faces that served as avatars.

But as online virtual worlds started to proliferate, they incorporated the technology of chat rooms – textual conversations. One quirky side-effect of this was the collision of computergraphical humanoid 3D models with text-chat. These are strange bedfellows indeed – occupying vastly different cognitive dimensions.


Many of us worked our craft to make these bedfellows not so strange, such as the techniques that I invented with Chuck Clanton at, called Avatar Centric Communication.

Later, voice was introduced to I invented a technique for voice chat, and later re-implemented a variation for Second Life, for voice-triggered gesticulation.

Imagine the uncanny valley of hearing real voices coming from avatars with no associated animation. When I first witnessed this in a demo, the avatars came across as propped-up corpses with telephone speakers attached to their heads. Being so tuned-in to body language as I am, I got up on the gesticulation soap box and started a campaign to add voice-triggered animation. As an added visual aid, I created the sound wave animation that appears above avatar heads for both There and SL…


Gesticulation is the physical-visual counterpart to vocal energy – we gesticulate when we speak – moving our eyebrows, head, hands, etc. – and it’s almost entirely unconscious. Since humans are so verbally-oriented, and since we expect our bodies to produce natural body language to correspond to our spoken communications, we should expect the same of our avatars. This is the rationale for avatar gesticulation.

I think that a new form of puppeteering is on the horizon. It will use the voice. And it won’t just take sound signal amplitudes as input, as I did with voice-triggered gesticulation. It will parse the actual words and generate gestural emblems as well as gesticulations. And just as we will be able to layer filters onto our voices to mask our identities or role-play as certain characters, we will also be able to filter our body language to mimic the physical idiolects of Egyptians, Native Americans, Sicilians, four-year-old Chinese girls, and 90-year old Ethiopian men.

Digital-alphabetic-technological humanity reaches down to the gestural underbelly and invokes the primal energy of communication. It’s a reversal of the gesture-to-words vector of Gestural Theory.

And it’s the only choice we have for transmitting natural language over the geocortex, because we are sitting on top of a thousands-year-old heap of alphabetic evolution.

Seven Hundred Puppet Strings

March 31, 2012

(This blog post is re-published from an earlier blog of mine called “avatar puppetry” – the nonverbal internet. I’ll be phasing out that earlier blog, so I’m migrating a few of those earlier posts here before I trash it).

The human body has about seven hundred muscles. Some of them are in the digestive tract, and make their living by pushing food along from sphincter to sphincter. Yum! These muscles are part of the autonomic nervous system.

Other muscles are in charge of holding the head upright while walking. Others are in charge of furrowing the brow when a situation calls for worry. The majority of these muscles are controlled without conscious effort. Even when we do make a conscious movement (like waving a hand at Bonnie), the many arm muscles involved just do the right thing without our having to think about what each muscle is doing. The command region of the brain says, “wave at Bonnie”, and everything just happens like magic. Unless Bonnie scowls and looks the other way, in which case, the brow furrows, and is sometimes accompanied by grumbling vocalizations.

The avatar equivalent of unconscious muscle control is a pile of procedural software and animation scripts that are designed to “do the right thing” when the human avatar controller makes a high-level command, like <walk>, or <do_the_coy_shoulder_move>, or <wave_at, “Bonnie”>. Sometimes, an avatar controller might want to get a little more nuanced: <walk_like, “Alfred Hitchcock”>; <wave_wildly_at, “Bonnie”>. I have pontificated about the art of puppeteering avatars in the following two web sites:

Also this interview with me by Andrea Romeo discusses some of the ideas about avatar puppetry that he and I have been bantering around for about a year now.

The question of how much control to apply on your virtual self has been rolling around in my head ever since I started writing avatar code for and Second Life. Avatar control code is like a complex marionette system, where every “muscle” of the avatar has a string attached to it. But instead of all strings having equal importance, these strings are arranged in a hierarchical structure.

The avatar controller may not necessarily want or need to have access to every muscle’s puppet string. The question is: which puppet strings do the avatar controller want to control at any given time, and…how?

I’ve been thinking about how to make a system that allows a user to shift up and down the hierarchy, in the same way that our brains shift focus among different motion regimes


The movements – communicative and otherwise – that our future avatars make in virtual spaces may be partially generated through live motion-capture, but in most cases, there will be substitutions, modifications, and deconstructions of direct motion capture. Brian Rotman sez:

“Motion capture technology, then, allows the communicational, instrumental, and affective traffic of the body in all its movements, openings, tensings, foldings, and rhythms into the orbit of “writing”.

Becoming Beside Ourselves, page 47

Thus, body language will be alphabetized and textified for efficient traversal across the geocortex. This will give us the semantic knobs needed to puppeteer our virtual selves – at a distance. And to engage the semiotic process.

If I need my avatar to run up a hill to watch out for a hovercraft, or to walk into the next room to attend another business meeting, I don’t want to have to literally ambulate here in my tiny apartment to generate this movement in my avatar. I would be slamming myself against the walls and waking up the neighbors. The answer to generating the full repertoire of avatar behavior is hierarchical puppeteering. And on many levels. I may want my facial expressions, head movements, and hand movements to be captured while explaining something to my colleagues in remote places, but when I have to take a bio-break, or cough, or sneeze, I’ll not want that to be broadcast over the geocortex

And I expect the avatar code to do my virtual breathing for me.

And when my avatar eats ravioli, I will want its virtual digestive tract to just do its thing, and make a little avatar poop when it’s done digesting. These autonomic inner workings are best left to code. Everything else should have a string, and these strings should be clustered in many combinations for me to tug at many different semantic levels. I call this Hierarchical Puppetry.

Here’s a journal article I wrote called Hierarchical Puppetry.