====== How to rip DVD subtitles with vobsub2srt ====== The **vobsub2srt** program reads a pair of **subtitles.sub** and **subtitles.idx** files, OCRs the images contained in the //sub// file and creates a **subtitles.srt** file with the subtitles text and the appropriate timing information obtained from the **idx** file. The program **vobsub2srt** does not exists in Debian 12 Bookworm, but it should be possible to compile it from source (see the **[[https://github.com/ruediger/VobSub2SRT|VobSub2SRT GitHub]]** repository). Alternatively you can get the binary package from the **[[https://deb-multimedia.org/dists/testing/main/binary-i386/package/vobsub2srt|Deb Multimedia repository]]**. The required Debian packages are: * **lsdvd** - From the official Debian repository. * **vobcopy** - From the official Debian repository. * **mediainfo** - From the official Debian repository. * **mkvtoolnix** - From the official Debian repository. * **vobsub2srt** - From the Deb Multimedia repository. ===== Ripping the .vob from the DVD ===== A DVD can contain several **titles** and you should identify which one you want to rip; generally it is the longer one or the one with most chapters. We check the DVD content using the **lsdvd** tool: lsdvd /dev/sr0 Disc Title: DVD_TITLE Title: 01, Length: 01:02:36.480 Chapters: 03, Cells: 03, Audio streams: 02, Subpictures: 04 Title: 02, Length: 00:00:12.800 Chapters: 01, Cells: 01, Audio streams: 02, Subpictures: 04 Title: 03, Length: 00:21:01.760 Chapters: 01, Cells: 01, Audio streams: 02, Subpictures: 04 Title: 04, Length: 00:00:00.480 Chapters: 01, Cells: 01, Audio streams: 02, Subpictures: 04 Title: 05, Length: 00:21:10.000 Chapters: 01, Cells: 01, Audio streams: 02, Subpictures: 04 Title: 06, Length: 00:20:24.720 Chapters: 01, Cells: 01, Audio streams: 02, Subpictures: 04 Longest track: 01 The longest title is the **#1**, so we will extract it using **vobcopy**: vobcopy -n '1' -i /dev/sr0 --large-file -o . The resulting file will be saved into the working directory (as specified by the **%%-o%%** option) and it will be named by the DVD title, something like **DVD_TITLE.vob**. You can inspect the content of the file using the **mediainfo** tool, in our case the file contains one video stream, two audio streams and three subtitle streams. The subtitles are in the standard DVD format: VobSub, which is a images (bitmap) format, not text. ===== Converting the .vob into .mkv format ===== As far I know, there is not a tool capable of extracting the VobSub subtitles directly from the vob file; we might hope that **ffmpeg** was capable of doing this, but it seems not. Fortunately the **mkvextract** (from the mkvtoolnix Debian package) can extract the VobSub stream from a //mkv// file, so we firstly use ffmpeg to convert the //vob// into //mkv//. In the following example all the stream are copied, without re-encoding. At this step you may want to re-encode the video to squeeze the MPEG2 stream into the more efficient H264 format. ffmpeg -probesize 500M -analyzeduration 500M \ -i 'DVD_TITLE.vob' \ -map 0:v:0 -map 0:a:0 -map 0:a:1 -map 0:s:0 -map 0:s:1 -map 0:s:2 \ -vcodec 'copy' \ -acodec 'copy' \ -scodec 'copy' \ 'DVD_TITLE.mkv' Notice the several **%%-map%%** options required to embed all the source streams into the destination file; in our example we have **one video** stream, **two audio** streams and **three subtitles** streams. The **%%-probesize%%** and **%%-analyzeduration%%** options are required because the subtitles streams start not at the very begin of the file and they may be missed. ===== Extracting .sub and .idx files from the .vob ===== From the //mkv// file it is now possibile to create **two files** (.sub and .idx) for each subtitles stream. The stream numbering expected by ''mkvextract'' in our example is as follow: **#0** is the video stream, **#1** and **#2** are the two audio streams, so the first subtitle stream is the **#3**: mkvextract 'DVD_TITLE.mkv' tracks -c 'S_VOBSUB' '3:subtitles-3' The result will be two files: **subtitles-3.sub** and **subtitles-3.idx**. It is possible to repeat the command to extract the other subtitles (**#4** and **#5** in our example). ===== OCR the images from the .sub file ===== vobsub2srt --ifo './VTS_01_0.IFO' --dump-images --tesseract-lang ita 'subtitles-3' The .IFO file is required to get the correct palette, width and hight, but it is not mandatory.