Seeing is believing.

Introduction

This project reveals the discrepancy between auditory and visual perceptions of humans by exploring a collection of contrasting experiences in perceiving the world with or without visual information.

Seeing is believing.

As the saying goes, vision dominates how we perceive the world for many of us. When the visual channel is not available, we rely on other sensory channels such as the auditory sensation, often attempting to visualize the scenes in mind based on these signals.

But how accurately can we reconstruct the world without vision? How reliable is the auditory channel as a medium to perceive the world? How dominant is vision really in how we experience the world?

Typology Machine

To answer these questions, I built a typology machine (Figure 3) to capture these contrasting experiences, which functions in three steps.

  1. Real-world scene construction with auditory and visual out-of-place cues.
  2. 360° spatial capture.
  3. Interactive Virtual Reality (VR) software.

1. Real-world scene construction with auditory and visual out-of-place cues

I set up a real-world scene where an out-of-place cue exists in both visual and auditory channels. For example, in a grass yard, there is an angry stuffed seal and the sound of flushing the toilet. I placed the visual out-of-place cue in the scene, and I played the auditory out-of-place sound from a smartphone placed at the same location as the visual cue. I constructed four scenes (see Figure 1 and 3). In all scenes, the visual out-of-place cue remained the same — the angry stuffed seal, but the audio cue varied.

A. Parking lot: There were ambient bugs and car sounds. As an auditory out-of-place cue, the sound of seagulls and waves was used.

B. Grass yard: There were ambient birds and people sounds. As an auditory out-of-place cue, the sound of toilet flushing was used.

C. Office with no windows: Quiet space with no ambient sounds and no natural lighting. As an auditory out-of-place auditory cue, the sound of birds singing was used.

D. Patio with pathways: There were ambient birds and people sounds. As an auditory out-of-place cue, the sound of angry dogs barking was used.

Figure 1. Spatial capture setup using dual-mounted 360° camera and ambisonic microphone. First row left: (A) parking lot, right: (B) grass yard. Second row left: (C) office with no windows, right: (D) patio with pathways.

2. 360° spatial capture

In the center of the scene is placed dual-mounted 360° camera (Mi Sphere Camera) and ambisonic microphone (Zoom H3-VR) to record 3D spatial scape of the world, as shown in Figure 1 and 2. (An ambisonic microphone captures the full 360° around the microphone, representing the surround soundscape at a point.)

Figure 2. Close-up of dual-mounted 360° camera and ambisonic microphone setup.

3. Interactive Virtual Reality (VR) software

I built a Virtual Reality (VR) software that reconstructs the scene captured in Step 2 and presents the contrasting experiences. I developed the software for use in Oculus Quest 2 headsets. In the virtual scene, the player first enters a black spherical space with only sound. By listening to the sound, the player imagines where they are. Then, by clicking parts of the surrounding black sphere, they can remove the blocking parts and unveil the visual world. The video in Figure 3 demonstrates the VR experiences in four scenes with out-of-place visual and audio cues.

Figure 3. Spatial capture setup using dual-mounted 360° camera and ambisonic microphone in four scenes.

Findings

The main finding from the collected demonstrations of contrasting experiences is that an auditory out-of-place cue (e.g., the sound of seagulls in a parking lot, toilet flushing in a grass yard, and birds singing in an office with no windows) can completely deceive the user where they are, without the presence of visual information. In the first scene of the video (Figure 3), the parking lot could be deceived as a beach with seagulls. In the second scene, an outdoor grass yard could be deceived as a bathroom with toilet flushing. In the third scene, an indoor office could be deceived as a forest with sunshine. In the fourth scene, the incongruence is less so, but a peaceful patio could be deceived as a place with a more imminent threat from angry dogs growling.

On the other hand, a visual out-of-place cue (e.g., the stuffed seal) does not change the perception of where the user is. It makes the user think that it’s odd that the stuffed seal is there, not the other way around.

This highlights the difference in the richness of ambient or peripheral information in visual and auditory channels. As shown in the video (Figure 3), ambient audio is preserved such as the sounds of bugs, birds, people, cars passing by, etc. However, one out-of-place cue with strong characteristics is dominant enough to overshadow other ambient cues. Only until the visual channel becomes available. In the visual channel, the full context of the scene appeals more to the perception of the location rather than a single out-of-place cue — the stuffed seal.

The finding was reinforced by people who experienced VR software. For example, in the reaction video below (Figure 4), the person first thinks he is on the beach given the sound of seagulls (00:13). Later, as he reveals the visual information, he not only realizes he is in a parking lot under a bridge but also now thinks the sound of waves was actually the sound of cars passing by (01:20). He modifies his previous understanding of the auditory cue(sound of waves) to fit the newly obtained visual information. This exemplifies the dominance of visual information in human perception of the world.

Figure 4. Example reaction of a person trying the VR software.

Furthermore, this wasn’t captured in the video, unfortunately, but later while unfolding after the VR experience, the person explained that he first thought it was a seal, but he hesitated to say it (03:40) because it sounded “too stupid to say there’s a seal” even though he did talk about my coat under the seal. This hints at possible differences in the mechanisms of human auditory and visual perceptions. While one strong cue can be dominant in auditory perception, the combination of ambient information might appeal stronger in visual perception.

In summary, this collection of interactive sensory experiences reveals the contrasts between auditory and visual perceptions of humans.

Reflections

Inspirations and Development of the Project

I started the project with a broad interest in capturing one’s experience fully through technology. Then, it triggered my curiosity about how much information spatial sound could carry about one’s experience and memory.

Before designing the specific details of the apparatus, I conducted some initial explorations. I first wanted to feel what it is like to make a digital copy of a real-life experience. I brought the Zoom H3-VR ambisonic microphone to different places such as a public park with people and wild animals, a harbor, an airport, a kitchen, etc. Tinkering with the ambisonic audio, I realized that, unlike my expectations, the ambient sound rarely contains auditory cues that give a clue about the space. Also because these cues are so rare, one distinctive cue could easily deceive a person. Inspired by this, I started designing a VR interactive media where the participants could (verbatim) unveil the discrepancy between our dominant visual channel and supportive auditory channel, which developed into the final design described above.

Self-Evaluation: Challenges, Opportunities, and Learnings

Throughout the process, I encountered several challenges due to my underestimation. Technically, I thought the implementation would take only a brief amount of time since I have some experience in building AR software. However, using new hardware (Oculus Quest 2), software platforms, and new types of data (ambisonic audio and video) was more struggling than I thought, which consumed a lot of my time in building the apparatus itself. Especially, development using a fast-growing platform like Oculus meant a lot of deprecated documents online and having to figure out issues with recent updates through a series of trial and error.

If the apparatus building took less time, I would have explored more diverse scene settings and visual out-of-place cues). More user studies and another collection of people’s reactions to facing the discrepancy would have been insightful as well. Through iterations, the design of the experience itself could be improved too to exhibit the contrasting perception in a more straightforward way.

I personally learned a lot through the execution of the project. I learned the differences in auditory and visual perceptions through the collection of immersive contrasting experiences. Technically, I learned how ambisonics work, how to use 360° visual and ambisonic captures in VR, and VR development in general.

Lastly, I learned that for successful 360° video filming, I always need to find somewhere to hide myself in advance. 🙂