Inside Out

Notes on seeking wisdom and crafting software

Transcribe podcasts for notes

Table of contents

A mini note on a setup to automatically download podcasts from a feed and transcribe it using a Whisper speech to text model.

Whisper on local box

For this experiment, I’m using an old desktop with 4GB VRAM. [Whisper.cpp] perfectly fits the bill allowing us to run Whisper on CPU.

# Clone and build the whisper.cpp repo for GPU
> git clone https://github.com/ggerganov/whisper.cpp.git
> cd whisper.cpp
> WHISPER_CUDA=1 make -j

# Get the base model and quantize it to fit our GPU
> make base.en
>./quantize models/ggml-base.en.bin models/ggml-base.en.q5_0.bin q5_0

# Optional: Run transcription for any wav file like this
# -p 6: using 6 physical cores for the transcription process
> ./main -m models/ggml-base.en.q5_0.bin -p 6 -f /mnt/media/podcasts/vedantany/tmp/out.wav -otxt

Getting podcast episodes

We’re using a tiny program called Poddl for listing and fetching the podcasts from a given RSS feed.

# List episodes for a podcast.
# Please use a feed url of your choice
> poddl https://feeds.soundcloud.com/users/soundcloud:users:311396902/sounds.rss -l | less

# Download an episode.
# -o .: use current folder as the output directory
# -i: add episode index number to downloaded episode
# -n 1: download the episode number 1
# -z 4: use a 4 digit numbering scheme for the name
> poddl https://feeds.soundcloud.com/users/soundcloud:users:311396902/sounds.rss -o . -i -n 1 -z 4

This will get the podcast episode in mp3 format.

Automation!

Here’s the entire script I’m using with notes on customizations.

#!/bin/bash

set -e

# Change the feed URL and other settings as per your setup
FEED_URL=https://feeds.soundcloud.com/users/soundcloud:users:311396902/sounds.rss
OUT_DIR=tmp
TRANSCRIPT_DIR=transcripts
WHISPER_PATH=~/src/extern/whisper.cpp
CPU_CORES=6 # of physical CPU cores for parallelization
WHISPER_BIN="$WHISPER_PATH/main"
WHISPER_MODEL="$WHISPER_PATH/models/ggml-base.en.q5_0.bin"

COMMAND=$1
if [ $COMMAND = "list" ]; then
    poddl $FEED_URL -l | less
    exit 0
elif [ $COMMAND = "get" ]; then
    EPISODE_NUMBER=$2
else
    echo "Usage: transcribe (list|get) [EPISODE_NUMBER]"
    exit 0
fi

echo "Checking if temp and out directories exist..."
mkdir -p $OUT_DIR
mkdir -p $TRANSCRIPT_DIR

echo "Getting episode name..."
EPISODE_NAME=$(poddl $FEED_URL -l -n $EPISODE_NUMBER | tail -n 1)
echo "Episode: $EPISODE_NAME"

TRANSCRIPT_FILE="$TRANSCRIPT_DIR/$EPISODE_NAME.txt"
if [ -f "$TRANSCRIPT_FILE" ]; then
    echo "Transcript exists: $TRANSCRIPT_FILE"
    exit 0
fi

# FIXME Assumes podcast episode number is unique
AUDIO_FILE="$(ls $OUT_DIR/$EPISODE_NUMBER* | tail -n 1)"
if [ -f $AUDIO_FILE ]; then
    echo "Podcast audio exists: $AUDIO_FILE"
else
    echo "Downloading episode $EPISODE_NUMBER..."
    poddl $FEED_URL -o $OUT_DIR -s -n $EPISODE_NUMBER
fi

AUDIO_FILE="$(ls $OUT_DIR/$EPISODE_NUMBER* | tail -n 1)"
echo "Convert to wav..."
if [ -f "$OUT_DIR/out.wav" ]; then
  echo "Remove older wav file"
  rm $OUT_DIR/out.wav
fi
ffmpeg -i $AUDIO_FILE -ar 16000 $OUT_DIR/out.wav

echo "Transcribe..."
$WHISPER_BIN -m \
    $WHISPER_MODEL\
    -p $CPU_CORES -pp \
    -f $OUT_DIR/out.wav -otxt -of "$TRANSCRIPT_FILE"
echo "Done!"

Base Whisper models are decent enough for English speech to text. With 6 processors, and the base English model, it takes ~3 mins to transcribe an hour worth of audio. Please try and check which model suits your scenario the best!