Speaking tools for audio journaling

Today is March 12, 2023 and it’s a magical evening. I’m using my bluetooth headphones to record this text as a voice memo while I’m sorting my laundry.

There are two fundamental mantras I like to remind myself of:

Movement is life.
Speech is life.

That means: as we’re getting older, as we’re changing our environments to be more comfortable, I want be careful. Especially,

if more comfort means less opportunity to physically move around. Less walking, running, climbing, cycling, and so on
if more comfort means speaking less

Movement is a topic of its own. But why care about speaking?

I believe that speaking is healthy. Verbalizing, articulating, expressing, all help us think. Haven’t we all said something like: now that I’m talking about it, I just realized that, … . Some thoughts only seem to form when we start speaking. Heinrich von Kleist talks about a similar phenomenon in his piece on the gradual completion of thoughts during speech.

Of course silent reflection is possible as well. You can just take a quiet walk and think through something. But in a world with an abundance of hyper-engaging video content, distractions are everywhere. It’s getting harder to focus the mind on a single topic for a longer time. Writing is another way to think through something, but writing requires a deliberate effort to sit down, use a keyboard or pen & paper and actually write. So I believe that speaking is a easier way to help us find our way back to focus. But with opportunities to work from home, it has become easier to slide into speaking less with likeminded people; easier to isolate ourselves comfortably, than ever before. So if we’re too easily distracted to think, and too lazy to write, then why don’t we give speaking a try.

It’s easy to speak when there’s someone who wants to listen. But even if we’re in good company, there are topics even our significant other won’t be interested in hearing about (all the time).

Also, if I remember correctly, part of Noam Chomsky’s linguistic theory was that – counter-intuitively – language didn’t evolve as a means of communicating something to someone else, but primarily to serve as a tool for thought. I haven’t checked supporting evidence in evolutionary biology, but the idea resonated with me.

So two years ago during covid, I had a phase when I would record short voice memos to help me practice clarifying my thoughts through speaking. I’d slowly speak out loud what came to mind, reflecting on the day or on some recent event, and I’d take care to speak in full sentences, without sounds of hesitation and without repetitions. But it got challenging very quickly and I stopped the habit.

What I was missing back then was an easy way

to transcribe my voice memos into text
to summarize and restructure the content of those voice memos
to analyze the content of those voice memos for topics, sentiment, emotions, and other metrics or characteristics

Overall, I was missing something like an audio journaling app. Or call it thought structuring toolkit. Or call it speaking therapy to tackle the clouded mind. Something like this:

Well, welcome to 2023. Welcome to the era of speech to text, welcome to large language models (LLMs). The app Mac Whisper uses OpenAI’s whisper models to transcribe audio files into text. It works fully locally, data doesn’t leave our device. The small (500 MB) model was already sufficient for me to transcribe my (German) voice memos to text in close-to-perfect quality. In addition to that, we have LLMs coming up that may be run on our local devices as well – check dalai llama. Using these tools locally enables data privacy.

Imagine the following workflow

record voice memo using bluetooth headphones
convert to text using whisper
clean up the text using a large language model
analyze the text (e.g. using LLMs again / sentiment analysis / static text analysis)
visualize results in an app / on a dashboard

Say we’ve got the transcript of an audio recording, we could easily ask an LLM:

what are the main (1-3) topics?
what’s the key quote from the text?
which main (1-3) emotions characterize this text?
which 3 adjectives describe this text?

Then we could visualize the results in an app, make the transcripts easily searchable and shareable, show trends over time, and so much more.

At the moment of writing, I’m thinking whether I should start building this app. But in any case, all of those tools are available. They’re going to become so powerful, that I can already see a new generation of speaking tools coming up that will help us reflect, learn, and maybe even stay sane and healthy.

I’m clearly on the – it’s going to be a great time to be alive – side of things.

Note: The raw form of this blog post was recorded, transcribed and corrected with the help of language models. Without those tools, I would have had a hard time sitting down and writing everything from scratch. ✌️

Title Photo by Jason Rosewell on Unsplash