Vincent tries to transcribe things – Experimental Capture

So, what’s my project? Alright, well, the first part of it was spending a lot of time nitpicking over ideas.

I knew I wanted to do something with sound, and then I knew I wanted to do something on technophony. If a soundscape is composed of biophony, geophony, and anthrophony— technophony is electro-mechanical noise, subcategory of human-noise.

To me technophony seems to fall into background-noise-noise-pollution and sounds meant to communicate with humans. Ex. ventilation drone and a synthetic voice or beep.

The first is interesting to me because of how subtle and unnoticed yet constant and invasive it is. The second is interesting because by giving machines sensors and reactive ques, they gain a sense of agency where they otherwise shouldn’t have any (language is typically considered a trait only of things with sentience). If these two ideas are combined, you’re presented with a world where you’re constantly engulfed in sentient actors that are completely invisible.

There’s a point that centralized a/c isn’t particularly human-reactive or communicative— it’s only sensing room temp, it also doesn’t have a voice. However, a lot of a/c.s will cycle on and off in a pattern, which creates an image where you are inside of a giant thing that is breathing very very slowly. There’s other things like this— streetlights that turn on and off at dusk and dawn have nocturnal sleep-wake cycles.

What I’ve ended up doing — I’ve been trying to get speech-to-text transcription to work on technophonic noises.

Examples of code outputs: https://github.com/Fangknife/technophony/tree/main

The catches:

An extremely subtle sound is indistinguishable from room tone, and it feels like I am not recording any one specific technophonee.
Okay, I can abandon interviewing a/c units, and try particuarly clangy radiators or faulty elevators. Great, yup.
Whisper (python library) transcriptions will give wildly different results on the same file— the input and the process are both random- bunk.
Vosk (python library) transcriptions vary little enough that it’s an actual methodology, but that means it’s too good at filtering out anything not human speech.

Where I’m actually at:

I can generate spectrograms (librosa).
I can transcribe a file via Whisper or Vosk to a .txt file.
I can input an audio file, output a .mp4 with captions time-stamped to word-level (via vosk in videogrep + moviepy).
Getting word level timestamps out of a whisper .json fucking sucks dude.