Differences

This shows you the differences between two versions of the page.

--- doc:appunti:linux:video:ripping_dvds_with_mencoder [2017/10/12 10:20] – [Extracting the subtitles] niccolo
+++ doc:appunti:linux:video:ripping_dvds_with_mencoder [2020/04/21 17:05] (current) – [OCRing] niccolo
@@ Line 1: / Line 1: @@
 ====== Ripping DVDs with Mencoder ======
+:!: For a simple recipe to rip (extract) the content of a DVD using Debian 10, see **[[vobcopy]]**.
 ===== Install the necessary programs =====
@@ Line 200: / Line 201: @@
 ===== Extract Subtitles with transcode =====
+FIXME The following programs are **missing in Debian 10 Buster**: **tcextract**, **subtitle2vobsub** and **subtitle2pgm**. We are searching for some alternatives.
 DVDs have subtitles stored as images. There are some options for dealing with them:
@@ Line 264: / Line 267: @@
 <code>
-subtitle2pgm -o english -c 255,0,0,255 < subs-en
+cat subtitles_stream.ps1 | subtitle2pgm
 </code>
-Each subtitle should now be one pgm file, and a srtx file will be created to index them and their times on-screen.
+If you want to control how grey levels are converted, try to use the **%%-c%%** option of subtitle2pgm, something like: **%%-c 255,0,0,255%%**.
-Now to ocr all that with gocr (using a nice wrapper for the job):
+Each subtitle should now be one file named like **movie_subtitle0003.pgm**, and a **movie_subtitle.srtx** file will be created to index them and their times on-screen.
+=== With Tesseract OCR ===
+<code bash>
+#!/bin/sh
+find . -type f -name '*.pgm' | sort | while read file; do
+    echo -n "$(basename $file) "
+    tesseract -l eng --psm 4 "$file" "$file"
+done
+</code>
+=== With Gocr ===
+**NOTICE**: Dont' use the following, because Gocr is not the best tool for OCR. Use **Tesseract OCR** instead.
+To ocr all the .pgm image with **gocr** (using a nice wrapper for the job):
 <code>
-pgm2txt english
+pgm2txt -d -f en -v -s 10 movie_subtitle
 </code>
 It will prompt you for tons of characters that it doesn't understand, and often totally bugger them up even when you give it the correct ones (it reads part of what it showed you again as another character...)
-We will re-merge all these text files produced into a big subtitle file:
+==== Make a single .srt file ====
+Now we will re-merge all these text files produced into a big subtitle file:
 <code>
-srttool -s -w < english.srtx > english.srt
+srttool -s -w < movie_subtitle.srtx > movie_subtitle.srt
 </code>
@@ Line 302: / Line 323: @@
 You can now add english.srt onto the end of your ''ogmmerge'' command. Oh, and stick a ''-c LANGUAGE=English'' before it ;-)
+==== Fixing time, etc  ====
+Finally you can proof-check the final .srt file using the graphical interface of **Gaupol**, a full-featured subtitle editor program. It can handle some of the more common operation required:
+  * **Shift times**, from //Tools//, //Shift Positions...//
+  * **Renumber subtitles**, this is done automatically when you save the project.
 ===== Links =====