Christian Broms – 60-461/761: Experimental Capture

Christian – Final Supplementary Materials

Zoom Morsels

A series of small interactions using facetracking to control your presence on Zoom.

Christian – Final

Zoom Morsels: a series of small interactions using facetracking to control your presence on Zoom.

I created a system that captures people’s faces from Zoom, tracks them over time, uses them to control a new environment, then sends video back to Zoom. I was most interested in finding a way to capture people’s face and gaze from their Zoom feeds without them having to make any sort of effort; just by being in the call they get to see a new view. The system runs on one computer, uses virtual cameras to extract sections of the screen for each video, then uses the facetracking library ofxFaceTracker2 and openCV in openFrameworks to re-render the video with new rules for how to interact. There’s lots of potential uses and fun games you could play with this system, once the pose data has been extracted from the video you can do nearly anything with in in openFrameworks.

The System

Creating this system was mainly a process of finding a way to grab zoom video and efficiently running the facetracking model on multiple instances. I created a virtual camera using the virtual camera plugin for OBS Studio for each person and masked out their video by hand. Then, in openFrameworks I grabbed each of the video streams and ran the facetracker on each of them on different threads. This method gets quite demanding quickly, but enables very accurate landmark detection in real time.

The process of grabbing the video feed from Zoom through OBS, creating a virtual camera, creating an openFrameworks program that grabs the camera feed and runs a facetracking library on the video, grabbing the program’s output with another OBS instance, and using that virtual camera back through Zoom as output.

Here’s what happens on the computer running the capture:

The section of the screen with the speaker’s video from zoom is clipped through OBS (this is done 3x when there are three people on the call)
OBS creates a virtual cam with this section of the screen.
openFrameworks uses the virtual camera stream from OBS as its input and runs a face tracking library to detect faces and draw over them.
A second OBS instance captures the openFrameworks window and creates a second virtual camera, which is used as the input for Zoom.

Landmark detection on a Zoom video feed.

The benefit of this method is that it enables very accurate facetracking, as each video stream gets its own facetracker instance. However, this gets quite demanding quickly and with more than three video streams requires much more processing power than I had available. Another limitation is that each video feed must be painstakingly cut out and turned into a virtual camera in OBS; it would be preferable if just one virtual camera instance could grab the entire grid view of participants. My update post explains more about this. I’m still working on a way to make the grid view facetracking possible.

The Interactions

Having developed the system, I wanted to create a few small interactions that explore some of the ways that you might be able to control Zoom, if Zoom were cooler. I took inspiration in the simplicity of some of the work done by Zach Lieberman with his face studies.

I was also interested in thinking about how you might interact with standard Zoom controls (like toggling video on and off) through just your face in a very basic way. For example, having the video fade away unless you move your head constantly:

And for people who like their cameras off, the opposite:

I created four different interaction modes and a couple of extra tidbits, including a tag game. The code that runs the facetracking and different interaction modes can be found here.

Final Project WIP Update

I got face tracking working with Zoom!

It requires at least two computers: one with a camera that streams video through Zoom to the second, which processes the video and runs the face tracking software, then send the video back to the first.

Here’s what happens on the computer running the capture:

The section of the screen with the speaker’s video from zoom is clipped through OBS.
OBS creates a virtual cam with this section of the screen.
openFrameworks uses the virtual camera stream from OBS as its input and runs a face tracking library to detect faces and draw over them.
A second OBS instance captures the openFrameworks window and creates a second virtual camera, which is used as the input for Zoom.

This is what results for the person on the laptop:

There’s definitely some latency in the system, but it appears as if most of it is through Zoom and unavoidable. The virtual cameras and openFrameworks program have nearly no lag.

For multiple inputs, it becomes a bit more tricky. I tried running the face detection program on Zoom’s grid view of participants but found it to struggle to find more than one face at a time. The issue didn’t appear to be related to the size of the videos, as enlarging the capture area didn’t have an effect. I think it has something to do with the multiple “windows” with black bars between; the classifier likely wasn’t trained on video input with this format.

The work around I found was to create multiple OBS instances and virtual cameras, so each is recording just the section of screen with the participant’s video. Then in openFrameworks I run the face tracker on each of these video streams individually, creating multiple outputs. The limitation of this method is the number of virtual cameras that can be created; the OBS plugin currently only supports four, which means the game will be four players max.

End of Semester Plan

I will develop some kind of a game that uses people’s positions in their webcams through Zoom to control a collective environment. They move their body and they are able to control something about the environment. This will involve finding a way to capture the video feeds from Zoom and sending that data to openFrameworks, then running some sort of face/pose detection on many different videos at once. Then they can control some aspect of the game environment on my screen, which I share with them.

Walk Timelapse

I went on a walk up a hill nearby and took a timelapse. It’s about 90 minutes reduced down to 3.5.

I recorded the timelapse with an iPhone in my shirt pocket, which contributed to the jitteriness. It definitely gets very hard to tell what’s going on during most of the uphill portions of the walk, where the horizon disappears and all that’s visible is the ground. The clearer portions of the video are the flatter sections of trail, which on this walk were rather limited.

As it got darker in the trees on the north side of the hill, the camera automatically reduced the shutter speed, resulting in a very streaky and somewhat incomprehensible image. I think it would be cool to exploit this effect more by trying one of these timelapses in an even darker environment, maybe at night.

As for the most interesting thing found on this walk, I think it would have to be this rubber chicken screwed onto a tree by some mountain bikers.

Person in Time Ideas

Show someone a film, record where they look, and record what they remember from it at different intervals (1 day, 3 days, 1 week…). Combine the video/tracking info and audio recordings. How does what they looked at influence what and how long they remember?
Capture the experiential qualities such as the brightness and loudness of the environments that someone is a part of over the course of a day. What are the difference and similarities in people’s preferences and the qualities of places they like to (or must) inhabit?
Have people do a think-aloud protocol on nothing. Think-aloud is usually a way of getting people to describe their thought process when engaging with a specific task or system. What happens when there’s no task? How might people describe their stream of consciousness?

Typology of Fixation Narratives

A typology of video narratives driven by people’s gaze across clouds in the sky.

How do people create narratives through where they look and what they choose to fixate on? As people look at an image, there are brief moments where they fixate on certain areas of it.

Putting these fixation points in the sequence that they’re created begins to reveal a kind of narrative that’s being created by the viewer. It can turn static images into a kind of story:

The story is different for each person:

I wanted to see how far I could push this quality. It’s easy to see a narrative when there’s a clear relationship between the parts in a scene, but what happens when there’s no clear elements with which to create a story?

I asked people to create a narrative from some clouds. I told them to imagine that they were a director creating a scene, where their eye position dictates where the camera moves. More time spent looking at an area would result in zooming the camera in, and less time results in a wider view. Here are five different interpretations of one of the scenes:

I used a Tobii Gaming eye tracker and wrote some programs to record the gaze and output images. The process works like this:

An openFrameworks program to show people a sequence of images and record the stream of gaze points to files. This program communicates with the eye tracker to grab the data and outputs JSONs. The code for this program can be found here.
Another openFrameworks program to read and smooth out the data, then zoom in and out on the image based on movement speed. It plays through the points in the sequence that they’ve been recorded and exports individual frames. Code can be found here.
A small python program to apply some post-processing to the images. This code can be found in the export program’s repository as well.
Export the image sequences as video in Premiere.

There were a couple of key limitations with this system. First, the eye tracker only works in conjunction with a monitor. There’s no way to have people look at something other than a monitor (or a flat object the same exact size as the monitor) and accurately track where they’re looking. Second, the viewer’s range of movement is low. They must sit relatively still to keep the calibration up. Finally, and perhaps most importantly, the lack of precision. The tracker I was working with was not meant for analytical use, and therefore produces very noisy data that can only give a general sense of where someone was looking. It’s common to see differences of up to 50 pixels between data points when staring at one point and not moving at all.

A couple of early experiments showed just how bad the precision is. Here, I asked people to find constellations in the sky:

Even after significant post processing on the data points, it’s hard to see what exactly is being traced. For this reason, the process that I developed uses the general area that someone fixates on to create a frame that’s significantly larger than the point reported by the eye tracker.

Though the process of developing the first program to record the gaze positions wasn’t particularly difficult, the main challenge came from accessing the stream of data from the Tobii eye tracker. The tracker is specifically designed to not give access to it’s raw X and Y values, instead to provide access to the gaze position in a C# SDK that’s meant for developing games in Unity, where the actual position is hidden. Luckily, someone’s written an addon for openFrameworks that allows for access to the raw gaze position stream. Version compatibility issues aside, it was easy to work with.

The idea about creating a narrative from a gaze around an image came up when exploring the eye tracker. In some sense the presentation is not at all specific to the subject of the image; I wanted to create a process that was generalizable to any image. That said, I think the weakest part of this project is the content itself, the images. I wanted to push away from “easy” narratives produced from representational images with clear compositions and elements. I think I may have gone a bit far here, as the narrative starts to get lost in the emptiness of the images. It’s sometimes hard to tell what people were thinking when looking around; the story they wanted to tell is a bit unclear. I think the most successful part of this project—the system—has the potential to be used with more compelling content. There are also plenty more possible ways of representing the gaze and the camera-narrative style might be augmented in the future to better reflect the content of the image.

Typology Proposal: Eye Tracking Sequences

I’d like to use the eye tracker to reveal the sequence that people look at images. When looking at something, your eyes move around and quickly fixate on certain areas, allowing you to form the impression of the overall image in your head. The image that you “see” is really made up of the small areas that you’ve focused on, the remainder is filled in by your brain. I’m interested in the order in which people move their eyes over areas of an image, as well as the direction that their gaze flows as they look.

Eye tracking has been used primarily to determine the areas that people focus on in images, and studies have been conducted to better understand how aspects of images contribute towards where people look. In this example, cropping the composition changed how people’s eyes moved around the image.

There is also a famous study by Yarbus that showed that there’s a top-down effect on vision; people will look different places when they have different goals in mind.

I haven’t been able to find any studies that examine the sequence in which people look around an image. I’d like to see if there are differences between individuals, and perhaps how the image itself plays a role.

I’m going to show people a series of images (maybe 5) and ask them to look at each of them for some relatively short amount of time (maybe 30 sec). I’ll record the positions of their eyes over the course of the time in openFrameworks. After doing this, I’ll have eye position data for each of the images for however many people.

Next, it’s a question of presenting the fixation paths. I’m still not totally sure about this. I think I might reveal the area of the image that they fixated on and manipulate the “unseen” area to make it less prominent. There’s lots of interesting possibilities for how this might be done. I read a paper about a model of the representation of peripheral vision that creates distorted waves of information.

It could be nice to apply this kind of transformation to the areas of the image that aren’t focused on. The final product might be a video that follows the fixation points around the image in the same timescale as they’re generated, with the fixation in focus and area around distorted. A bit like creating a long exposure of someone’s eye movements around an image.

Postphotography Response

While not necessarily an example of non-human photography as described in the chapter exactly, I think these satellite photos of planes are particularly interesting. They reveal the way that the image is created from red, green, and blue bands that are layered together. Each image is taken by the satellite sequentially, so there’s a slight delay between each. For things on the ground that are relatively still, this is fine, but for fast moving objects like planes each of the frames doesn’t exactly line up. This creates a kind of rainbow shadow. It reveals something about the process that can’t be found in other forms of visible light photography. It’s only possible because of the computational process behind how the images are captured, and is in a sense a kind of a glitch, but a very nice one. Clearly, the author’s observation is correct in that the algorithmic approach intensifies the “nonhuman entanglement,” in addition to revealing something new.

There’s a full album of these planes on Flickr.

SEM Acorn

I brought the top of an acorn.

It was comprised of these little squiggly spaghetti-like things.

It was a bit tough capturing the stereo pair because the acorn kept moving around. Donna said this was because it was a bit large and the electrons were hitting it and causing it to shift a bit. In the end the stereo really just captured the movement of the acorn around on the surface, but I think it creates an interesting effect. It was pretty unexpected to see the acorn top made up of this kind of structure and was cool to be able to zoom way into it.

On the inside cross-section of the acorn there were these dead cells.