I recently had the desire to rip some DVDs so I could watch them on my computer without swapping discs. I figured I could just pull everything from the DVD into Matroska files, since Matroska supports everything that DVDs do. When I went looking on the Internet, I found few resources for moving from DVD to MKV, and everything that did talk about it actually reencoded the DVD video to get it into its final destination. Since Matroska can contain all of the codecs native to DVDs, I wanted to transfer everything losslessly. This is how I did it. (Note that I’m using the Linux command line; I prefer Linux to Windows, and the command line to X.)
The programs I used are as follows:
- dvdbackup (optional)
- xine (optional; only for cracking CSS)
- lsdvd
- transcode, mostly
tcextract
- subtitleripper, for
subtitle2vobsub
- ogmtools, for
dvdxchap
- mkvtoolnix, for
mkvmerge
- mplayer, for most of the stream dumping
§ Some Background
I don’t know all the details of how data is stored on DVDs, but here’s a rough overview. The video on DVDs is encoded in either MPEG-2 or MPEG-1 with a variable bit rate. The audio can be in raw PCM, DTS, MP2, or AC3. Most DVDs use AC3. Not all DVD players support DTS. Subtitles are stored as bitmaps with associated timecodes governing when to show them on screen.
In a DVD, the basic unit of video is a title. Each title consists of one or more video streams, zero or more audio streams, zero or more subtitle streams, and a list of timepoints to mark chapter boundaries. The titles on a DVD are grouped together into titlesets. The grouping may be arbitrarily-chosen by the DVD manufacturer, but all titles in a given titleset must have the same video encoding parameters (codec, dimensions, framerate, etc.). All of the data for each titleset is concatenated together into VOB format and then split into 1GB chunks. The net result is that there’s no one-to-one correspondence between files on the DVD and individual titles. Worse, titles are actually implemented as start and end indices into the VOB stream, so it’s entirely possible for titles to overlap each other. This often shows up in TV show DVDs with a “play all” option: all of the episodes are in a single titleset and the “play all” menu option goes to a title that spans the entire titleset, while individual episodes are titles that only span the relevant part of the titleset.
If a title has more than one video stream, one will be the primary stream while the others represent alternate angles. Few DVDs have multiple angles, so I’m not sure how the data for those works; all of the DVDs I’ve seen just have a single video stream for each title.
Also note that not all of the titles on a DVD are the feature content. Almost every bit of video, including DVD extras like bloopers and “making of” videos, is stored as a title. The one exception is the DVD menus. Those are also stored on the DVD as VOBs, but they’re indexed differently, so they don’t show up as titles. Be aware that DVD easter eggs, including some apparently-longer videos, are often implemented as menus, so they won’t show up as titles.
Any of a DVD’s titles may be encrypted with CSS. Either the DVD player or the DVD drive must have a licensed CSS decryption key in order to read the encrypted data. Fortunately, CSS is somewhat weak, and most Linux programs for accessing DVDs use libdvdcss to bypass the encryption.
§ Ripping the DVD
Ripping the DVD isn’t strictly necessary, but it helps to have all of the data on your hard drive for processing. Even if you don’t copy the videos to your hard drive, you’ll have to mount the DVD and use its IFO files; I’ll get to that later.
The easiest way to rip the DVD is with dvdbackup
. It creates a
directory for the DVD and then puts a VIDEO_TS subdirectory in the DVD
directory. The VIDEO_TS directory contains all of the files in the DVD’s
VIDEO_TS directory. (Or, at least, it will if you use the -M
option;
other options give more restricted results.)
dest_dir=<destination directory>
dvd_name=<DVD name>
dvd_device=<DVD device, e.g. /dev/dvd>
dvdbackup -M -i $dvd_device -o $dest_dir -n $dvd_name
In theory, you could also mount the DVD and just copy all of the files over, but that has not worked well for me in the past, partly because of CSS problems, but also partly because my drive is a little wonky.
You can also just take an image of the DVD with dd. You’ll need to
disable the CSS beforehand. I’ve found that just running xine
on the
DVD is sufficient.
dest_dir=<destination directory>
dvd_name=<DVD name>
dvd_device=<DVD device, e.g. /dev/dvd>
xine dvd://
dd if=${dvd_device} bs=2048 conv=sync,noerror of=${dest_dir}/${dvd_name}.iso
If you have pv installed, you can get a fancy progress bar.
dest_dir=<destination directory>
dvd_name=<DVD name>
dvd_device=<DVD device, e.g. /dev/dvd>
xine dvd://
dd if=${dvd_device} bs=2048 conv=sync,noerror |
pv -s $(fdisk -l $dvd_device |
perl -nle 'm{^Disk '${dvd_device}': \d+ MB, (\d+) bytes$} and print $1') \
>${dest_dir}/${dvd_name}.iso
§ Get Disc Info
Whether you’ve ripped the DVD to disk or not, you need to see what’s on
it. Change into your working directory and run lsdvd
. (NB: From here
on out, unless otherwise noted, all commands that reference a DVD will
work equally well with a device (e.g. /dev/dvd), a disc image (like the
one created with dd
), or a directory containing a VIDEO_TS directory
structure.)
dvd=<DVD device, image, or directory>
lsdvd -a -n -c -s -v $dvd > contents
§ Rip Each Title
The first order of business is to get the title data off of the DVD.
tccat
will pull just the given title’s stream out of the DVD. (Note
that the resulting file has the possibility of exceeding 7GB in size; make
sure your filesystem can handle files that large.)
title=<title number, e.g. 01>
dvd=<DVD device, image, or directory>
tccat -i $dvd -t dvd -T ${title},-1 >${title}.vob
The information about the title’s chapters isn’t in the VOB, so you’ll
have to extract that separately with dvdxchap
. In my experience,
dvdxchap
never gets useful information for the chapter names (perhaps
the DVD only contains the timepoints with no names associated), so you may
want to edit the resulting file to put in more meaningful names. (Note
that mplayer
will output chapter information if you use its -identify
option, but dvdxchap
is more precise in its timing and also generates
the data in the format that mkvmerge
wants.)
dvdxchap -t $title $dvd > ${title}.chapters
I’ve seen DVDs where the TOC info as reported by lsdvd
doesn’t match the
actual streams in the titles, so it’s good to check the track directly.
Ideally, tcprobe
would give all the information about the streams, but
while it gives good information about audio and video streams, it doesn’t
give all the details we’ll need about subtitle streams. Thus, we need to
use mplayer
. mplayer
gives audio stream ids in decimal, not hex, so
the first audio stream will show as 128, not 0x80. It numbers the
subtitle streams from zero, though, so you have to add 0x20 to the numbers
it gives to get the actual subtitle stream ids.
mplayer -dvd-device $dvd -vo null -ao null -frames 0 -v dvd://${title} 2>&1 | egrep '[as]id' > ${title}.streams
In an ideal world, mkvmerge
would be able to operate directly on the
VOB, but when I tried that, it had problems demuxing the data and it died
halfway through. So I’ll use tcextract
to pull out the individual
components. Video first.
tcextract -i ${title}.vob -t vob -x mpeg2 >${title}.video.m2v
Next up are the audio tracks. The VOB may contain more than one audio
track. They should be labeled as to to their language, but check
mplayer
’s info, not lsdvd
’s. mplayer
’s info will also tell what
format the audio is in. tcextract
wants the audio tracks numbered from
zero, but mplayer
reports their actual track ids, which usually start at
128 and go up from there. The lowest-numbered track is track 0 to
tcextract
, and so on.
lang=<language code>
track=<source audio track: 0, 1, 2, etc.>
format=<extension for audio format; e.g. ac3, mp2>
tcextract -i ${title}.vob -t vob -x $format -a $track >${title}.audio-${lang}.${format}
The VOB also contains subtitles, although most programs that query it
won’t see them. Unlike when extracting audio, tcextract
requires that
you use the absolute track number, but mplayer
reports a relative
number. You will need to add 0x20, or 32 to the value that mplayer
reports for the subtitle tracks. Some of the information for subtitles is
stored in .IFO files on the DVD. Each titleset has its own .IFO file;
check the contents file to see what titleset contains the track and use
that titleset’s .IFO file. It will be in the VIDEO_TS
directory, named
VTS_<titleset number>_0.IFO
.
Matroska supports several subtitle formats, but VobSub is probably the easiest to use, because it’s a series of bitmaps, just like the DVD subtitles. If you’re not happy with VobSub, you’ll need to OCR each image to get its text; there are instructions for doing so elsewhere on the Internet.
lang=<language code>
stream_id=<id of the subtitle stream: 0x20, 0x21, 32, 33, etc.>
ifo=<IFO file; e.g. /path/to/VIDEO_TS/VTS_nn_0.IFO>
tcextract -i ${title}.vob -t vob -x ps1 -a $stream_id >${title}.subs-${lang}.raw
subtitle2vobsub -p ${title}.subs-${lang}.raw -i $ifo -o ${title}.subs-${lang}
Finally, it’s time to bring everything together with mkvmerge
. When I
use <title>
, I mean the actual textual title for the video, like “Bob’s
House of Horror 2” or whatever. ${title}
still refers to the title
number on the DVD.
mkvmerge -o <final filename> \
--title <title> \
--chapters ${title}.chapters \
${title}.video.m2v \
<audio clauses> \
<subtitle clauses>
For each audio file, you’ll need a clause giving the file and its
language. The first file you list on the command line will be the default
audio, unless you use mkvmerge
’s --default-track
option to change it.
--language 0:${lang} ${title}.audio-${lang}.ac3
Likewise, you’ll need a clause for each subtitle file. Since I generally don’t want any subtitles displayed by default, I set things so that there isn’t a default subtitle track.
--language 0:${lang} --default-track 0:0 ${title}.subs-${lang}.idx
And that should do it. After a fair bit of disk-churning, you should have a Matroska file containing all of the elements from the original DVD title. You can now delete all of your intermediate files and just keep the MKV on your computer and the DVD in its box.