A mini note on a setup to automatically download podcasts from a feed and transcribe it using a Whisper speech to text model.
Whisper on local box
For this experiment, I’m using an old desktop with 4GB VRAM. [Whisper.cpp] perfectly fits the bill allowing us to run Whisper on CPU.
# Clone and build the whisper.cpp repo for GPU
> git clone https://github.com/ggerganov/whisper.cpp.git
> cd whisper.cpp
> WHISPER_CUDA=1 make -j
# Get the base model and quantize it to fit our GPU
> make base.en
>./quantize models/ggml-base.en.bin models/ggml-base.en.q5_0.bin q5_0
# Optional: Run transcription for any wav file like this
# -p 6: using 6 physical cores for the transcription process
> ./main -m models/ggml-base.en.q5_0.bin -p 6 -f /mnt/media/podcasts/vedantany/tmp/out.wav -otxt
Getting podcast episodes
We’re using a tiny program called Poddl for listing and fetching the podcasts from a given RSS feed.
# List episodes for a podcast.
# Please use a feed url of your choice
> poddl https://feeds.soundcloud.com/users/soundcloud:users:311396902/sounds.rss -l | less
# Download an episode.
# -o .: use current folder as the output directory
# -i: add episode index number to downloaded episode
# -n 1: download the episode number 1
# -z 4: use a 4 digit numbering scheme for the name
> poddl https://feeds.soundcloud.com/users/soundcloud:users:311396902/sounds.rss -o . -i -n 1 -z 4
This will get the podcast episode in mp3 format.
Automation!
Here’s the entire script I’m using with notes on customizations.
#!/bin/bash
set -e
# Change these settings per your podcast feed
FEED_URL=https://feeds.soundcloud.com/users/soundcloud:users:311396902/sounds.rss
OUT_DIR=tmp
TRANSCRIPT_DIR=transcripts
WHISPER_PATH=~/src/extern/whisper.cpp
WHISPER_BIN="$WHISPER_PATH/main"
WHISPER_MODEL="$WHISPER_PATH/models/ggml-base.en.q5_0.bin"
COMMAND=$1
if [ $COMMAND = "list" ]; then
poddl $FEED_URL -l | less
exit 0
elif [ $COMMAND = "get" ]; then
EPISODE_NUMBER=$2
else
echo "Usage: transcribe (list|get) [EPISODE_NUMBER]"
exit 0
fi
echo "Checking if temp and out directories exist..."
mkdir -p $OUT_DIR
mkdir -p $TRANSCRIPT_DIR
echo "Getting episode name..."
EPISODE_NAME=$(poddl $FEED_URL -l -n $EPISODE_NUMBER | tail -n 1)
echo "Episode: $EPISODE_NAME"
TRANSCRIPT_FILE="$TRANSCRIPT_DIR/$EPISODE_NAME"
if [[ -f "$TRANSCRIPT_FILE.txt" ]]; then
echo "Transcript exists: $TRANSCRIPT_FILE.txt"
exit 0
fi
# FIXME Assumes podcast episode number is unique
AUDIO_FILE="$(ls $OUT_DIR/$EPISODE_NUMBER* | tail -n 1)"
if [[ -n $AUDIO_FILE && -f $AUDIO_FILE ]]; then
echo "Podcast audio exists: $AUDIO_FILE"
else
echo "Downloading episode $EPISODE_NUMBER..."
poddl $FEED_URL -o $OUT_DIR -s -n $EPISODE_NUMBER
fi
AUDIO_FILE="$(ls $OUT_DIR/$EPISODE_NUMBER* | tail -n 1)"
echo "Convert to wav..."
if [[ -f "$OUT_DIR/out.wav" ]]; then
echo "Remove older wav file"
rm $OUT_DIR/out.wav
fi
ffmpeg -i $AUDIO_FILE -ar 16000 $OUT_DIR/out.wav
echo "Transcribe..."
$WHISPER_BIN -m \
$WHISPER_MODEL\
-p 6 -pp \
-f $OUT_DIR/out.wav -otxt -of "$TRANSCRIPT_FILE"
echo "Done!"
Base Whisper models are decent enough for English speech to text. With 6 processors, and the base English model, it takes ~3 mins to transcribe an hour worth of audio. Please try and check which model suits your scenario the best!