Generate VTT captions from video or audio files with word-level timestamps. Optionally translate captions to multiple languages.
Upload your video or audio file directly, or select from your Sirv library. Supports MP4, MOV, WebM, MP3, WAV, and more.
Optionally select languages to translate your captions into. We support 50+ languages including all major world languages.
Get word-level accurate captions in VTT format. Preview captions synced to your video, then download for use anywhere.
Powered by OpenAI's Whisper model—one of the most accurate speech recognition systems available.
Precise timestamps for every word, enabling accurate highlighting and karaoke-style captions.
Translate captions to 50+ languages including Spanish, French, German, Japanese, Chinese, Arabic, Hindi, and many more.
Industry-leading transcription accuracy for clear audio in supported languages.
Industry-standard WebVTT format compatible with all video players and platforms.
Preview captions synced to your video directly in the browser before downloading.
Select videos directly from your Sirv library for seamless workflow integration.
Works with audio-only files like podcasts, interviews, and voice recordings.
VTT format works with all major video platforms
Make your video content accessible to deaf and hard-of-hearing viewers with accurate captions.
Add captions for viewers watching without sound on Facebook, Instagram, LinkedIn, and TikTok. 85% of social video is watched muted.
Reach global audiences by translating video captions into multiple languages automatically.
Create transcripts for podcasts and audio content to improve SEO and provide text alternatives.
Add captions to educational content for better comprehension and accessibility compliance.
Caption internal videos, training materials, and company announcements for global teams.
Transcription costs 1 credit per minute of audio/video (rounded up). Each translation language adds 1 credit. For example, a 3-minute video with 2 translations would cost 5 credits (3 + 2).
We support 50+ languages including: Spanish, French, German, Portuguese, Italian, Dutch, Polish, Russian, Japanese, Korean, Chinese, Arabic, Hindi, Turkish, Vietnamese, Thai, Indonesian, Ukrainian, Swedish, and many more. The original language is auto-detected.
Video: MP4, MOV, WebM, AVI, MKV. Audio: MP3, WAV, M4A, FLAC, OGG. Maximum file size depends on your connection, but we handle files up to several GB.
We use OpenAI's Whisper model, one of the most accurate speech recognition systems available. Accuracy is typically 95%+ for clear audio in supported languages. Background noise and multiple speakers may reduce accuracy.
Yes! Download the VTT file and edit it in any text editor or use specialized caption editing software. VTT is a simple, human-readable format.
Every word in the caption has its own timestamp, not just each line. This enables precise highlighting, karaoke-style effects, and more accurate subtitle timing.
Yes! Search engines can index caption text, making your video content more discoverable. Captions also improve watch time and engagement, which are ranking factors.