Recently I started working on creating videos for an AI course. I wanted to create subtitles/captions (.srt and .vtt files) for my videos. I explored various online tools. But they impose a lot of limitations. Previously I used YouTube to create subtitles. But this time I already uploaded similar videos (but the duration is a bit different) to my channel. Uploading it again will create a duplicate issue. So, I decided to use the whisper from my local computer.
First I installed whisper and ffmpeg using below pip command.
pip install openai-whisper ffmpeg
Then run the below command for creating .wav audio file using ffmpeg
ffmpeg -i "what is AI.mp4" -ac 2 -ar 16000 -vn "what_is_AI.wav"
And, created the below pyhton code to create .srt file.
Running this python code as "python3 transcribe.py" created the .srt file.
I am surprised with the quality of .srt file. Even it is better than subtitles generated by YouTube. It is really amazing to create good subtitle files without using the internet.
ebook - Unlocking AI: A Simple Guide for Beginners
This .srt file won't accepted in a few platforms. For example, Udemy will accept only .vtt file format only.
So, in this case, we can use the below python code to convert the .srt file into .vtt file.
Since fine-tuning Whisper directly isn't officially supported by OpenAI, the best way to adapt it to your voice is by using alternative methods like custom post-processing, speaker adaptation, and dataset-based training. Here's how you can proceed:
1. Improve Whisper’s Accuracy for Your Voice
Even though Whisper can't be fine-tuned directly, you can adapt it using the following methods:
A. Use a Custom Vocabulary & Prompting
Whisper supports custom prompts to bias its transcription. You can pass commonly misrecognized words as a prompt:
This helps Whisper recognize your name, technical terms, or unique words you use often.
B. Use Custom Word Replacement (Post-processing)
If Whisper frequently misinterprets specific words, use text correction with Python:
This method helps fix errors specific to your voice.
2. Train a Custom ASR Model with Your Voice
If you need real fine-tuning, you can train a smaller ASR (Automatic Speech Recognition) model:
A. Collect Audio & Transcripts
- Record at least 5-10 hours of your voice.
- Create text transcripts for each recording.
B. Train a Model Using Open-Source ASR Frameworks
If Whisper fine-tuning isn't an option, you can train an alternative ASR model:
- ESPnet – Open-source ASR framework supporting speaker adaptation.
- Kaldi – Traditional ASR system that can adapt to your voice.
- NVIDIA NeMo – Train ASR models with custom datasets.
Example: Fine-tuning a model with NVIDIA NeMo:
This method requires GPU power, but it allows full customization.
3. Use Speaker Adaptation (Voice ID)
Another workaround is to train a speaker recognition model alongside Whisper:
- Use a model like Wav2Vec 2.0 or DeepSpeaker to recognize your voice.
- Apply speaker-specific corrections.
This method helps Whisper adapt to your unique pronunciation.
Final Thoughts
- If Whisper misrecognizes words, try custom prompts and text correction.
- If Whisper struggles with your accent, train a smaller ASR model on your voice.
- If you need true fine-tuning, consider using ESPnet or NeMo.