Inside Out

Notes on seeking wisdom and crafting software

Transcribe podcasts for notes

Table of contents

A mini note on a setup to automatically download podcasts from a feed and transcribe it using a Whisper speech to text model.

Whisper on local box

For this experiment, I’m using an old desktop with 4GB VRAM. [Whisper.cpp] perfectly fits the bill allowing us to run Whisper on CPU.

# Clone and build the whisper.cpp repo for GPU
> git clone
> cd whisper.cpp
> WHISPER_CUDA=1 make -j

# Get the base model and quantize it to fit our GPU
> make base.en
>./quantize models/ggml-base.en.bin models/ggml-base.en.q5_0.bin q5_0

# Optional: Run transcription for any wav file like this
# -p 6: using 6 physical cores for the transcription process
> ./main -m models/ggml-base.en.q5_0.bin -p 6 -f /mnt/media/podcasts/vedantany/tmp/out.wav -otxt

Getting podcast episodes

We’re using a tiny program called Poddl for listing and fetching the podcasts from a given RSS feed.

# List episodes for a podcast.
# Please use a feed url of your choice
> poddl -l | less

# Download an episode.
# -o .: use current folder as the output directory
# -i: add episode index number to downloaded episode
# -n 1: download the episode number 1
# -z 4: use a 4 digit numbering scheme for the name
> poddl -o . -i -n 1 -z 4

This will get the podcast episode in mp3 format.


Here’s the entire script I’m using with notes on customizations.


set -e

# Change these settings per your podcast feed

if [ $COMMAND = "list" ]; then
    poddl $FEED_URL -l | less
    exit 0
elif [ $COMMAND = "get" ]; then
    echo "Usage: transcribe (list|get) [EPISODE_NUMBER]"
    exit 0

echo "Checking if temp and out directories exist..."
mkdir -p $OUT_DIR

echo "Getting episode name..."
EPISODE_NAME=$(poddl $FEED_URL -l -n $EPISODE_NUMBER | tail -n 1)
echo "Episode: $EPISODE_NAME"

if [[ -f "$TRANSCRIPT_FILE.txt" ]]; then
    echo "Transcript exists: $TRANSCRIPT_FILE.txt"
    exit 0

# FIXME Assumes podcast episode number is unique
AUDIO_FILE="$(ls $OUT_DIR/$EPISODE_NUMBER* | tail -n 1)"
if [[ -n $AUDIO_FILE && -f $AUDIO_FILE ]]; then
    echo "Podcast audio exists: $AUDIO_FILE"
    echo "Downloading episode $EPISODE_NUMBER..."
    poddl $FEED_URL -o $OUT_DIR -s -n $EPISODE_NUMBER

AUDIO_FILE="$(ls $OUT_DIR/$EPISODE_NUMBER* | tail -n 1)"
echo "Convert to wav..."
if [[ -f "$OUT_DIR/out.wav" ]]; then
  echo "Remove older wav file"
  rm $OUT_DIR/out.wav
ffmpeg -i $AUDIO_FILE -ar 16000 $OUT_DIR/out.wav

echo "Transcribe..."
    -p 6 -pp \
    -f $OUT_DIR/out.wav -otxt -of "$TRANSCRIPT_FILE"
echo "Done!"

Base Whisper models are decent enough for English speech to text. With 6 processors, and the base English model, it takes ~3 mins to transcribe an hour worth of audio. Please try and check which model suits your scenario the best!