Fri, 02 Jan 2009
I previously considered different backup schemes. Writing that entry crystallized my basic ideas about what I was going to do about the backups. I then proceeded to implement them. This entry is a detailed description of what I did.
Backup Overview
I ended up buying two 80 GB hard drives and a ThermalTake BlacX external enclosure. The overall plan is to do backups to one of the drives on a daily, automated basis, and the other on a periodic, maybe monthly basis. Most of the time, the periodic backup drive will live at my office and will serve as my offsite backup.
I want to have a backup history in the same way that a tape rotation
scheme would give me. That way, if I don't notice there's something wrong
with a file for a time, there's still a good chance I can retrieve it. I
also want things stored space-efficiently, so backing up unchanged files
doesn't take up additional space. This is accomplished pretty easily with
rsync; I do one full backup, and then subsequent backups use rsync's
--link-dest option pointing to the most recent complete backup; any
files that haven't changed are simply hardlinked together, so the two
directory entries point to the same physical location on the disk.
For the backup history, I decided to go with a variant of the Towers of Hanoi backup rotation. Instead of using a preset number of pegs, as I would have to do if I were using physical tapes, I can just calculate the numbers on the fly, effectively behaving as if I had an infinite number of tapes. This rotation gives me roughly exponential backoff for my history; I can look at backups from yesterday, two days ago, four days ago, eight days ago, and so on.
Finally, I decided to encrypt the drives. This lets me cart them around with confidence that if they get lost or stolen, anyone who ends up with them won't have my personal files. I used LUKS to encrypt the disks, and used both a file on my computer and a password as keys. The file makes it easier to mount the drives without manual intervention (important if my desktop reboots), while the password lets me get at the data if the key file isn't available (important if the main disk dies and I need my backups).
Backup Procedure
Set up fstab.
The first thing I did was to set up /etc/fstab for the disks. Since I only have one enclosure, I have to swap disks, so only one will ever be connected at the same time. Thus, I mount whichever's in at the moment on /backups. Likewise, I will associate each one with the dm-crypt name of "backups", so the device will be /dev/mapper/backups. Thus, I added the following line to /etc/fstab:
/dev/mapper/backups /backups auto defaults 0 0
Set up rsync filters.
I'm using rsync to do the backups, but in some cases I don't want
everything in the source directories to be backed up. Thus, I tell
rsync to look at a filter file for each directory so it knows what to
skip and what to keep. rsync will go through the filters for each file
or directory it considers, and will take the first action that matches.
If nothing matched, the file is copied. If a directory is ignored, none
of the files in that directory are considered at all, so I had to include
a few explicit directory chains.
In /var, I only want to back up a few things, so the final pattern ignores anything that isn't explicitly included.
+ /backups
+ /backups/**
+ /lib
+ /lib/bzr
+ /lib/bzr/**
+ /lib/svn
+ /lib/svn/**
- *
For my home directory, I include everything, with a few exceptions. For instance, most of my music directory can be reripped from CD if needed, so I don't need to take up space backing up those files. On the other hand, I have some files that I either purchased online or downloaded and wouldn't be able to easily replace if they were lost, so I do back them up. Here's an excerpt from my home filter file:
+ /movies/Star_Wars_Holiday_Special
+ /movies/Star_Wars_Holiday_Special/**
- /movies
+ /music
+ /music/Bonerama
+ /music/Bonerama/Bringing_It_Home
+ /music/Bonerama/Bringing_It_Home/**
+ /music/Jonathan_Coulton
+ /music/Jonathan_Coulton/Thing_a_Week_Three
+ /music/Jonathan_Coulton/Thing_a_Week_Three/03-Code_Monkey*
+ /music/Nine_Inch_Nails
+ /music/Nine_Inch_Nails/Ghosts_I-IV
+ /music/Nine_Inch_Nails/Ghosts_I-IV/**
+ /music/Nine_Inch_Nails/The_Slip
+ /music/Nine_Inch_Nails/The_Slip/**
+ /music/Obelix
+ /music/Obelix/**
+ /music/Solo_String_Project
+ /music/Solo_String_Project/**
- /music/**
- /tmp
Initialize disks.
I wrote a script to initialize the disks for me: init-backup-disk. It takes two parameters: the name of the device for the backup disk, and the file to use as a key for the partition. If the key file doesn't exist, it will be created.
After a few sanity checks, the script starts doing things. It starts by
checking that the disk is good with badblocks. If it encounters any
errors, it stops there and the drive needs to be sent in for warranty
replacement. Following that, it goes into the most time-consuming part of
the initialization: writing random data to the disk. (badblocks already
wrote random data, but its PRNG is somewhat simplistic; /dev/urandom is a
much better source of pseudo-random numbers.) Without this step, it would
be obvious which bits of the disk had encrypted data on them. I use
pv to give a progress meter and time estimate. On my computer,
badblocks took a little over two hours and /dev/urandom took about eight
hours for each 80GB disk.
# Check and randomize disk.
badblocks -b 512 -s -w -t random -v $disk || exit 2
</dev/urandom pv -s $(fdisk -l $disk |
perl -nle 'm{^Disk '${disk}': [0-9.]+ [KMGT]B, (\d+) bytes$} and print $1') |
dd bs=512 conv=sync,noerror of=$disk
The next step is to format the encrypted partition. I use sfdisk to
create a single partition that spans the entire drive, followed by
cryptsetup to do the format. I explicitly specify the cipher in order
to use ESSIV, which makes certain attacks more difficult. The
--batch-mode option keeps it from asking for confirmation before
writing. The second call to sfdisk just tells the kernel to reread the
disk's partitions so it will pick up the UUID that cryptsetup created.
# Add and format the LUKS partition.
echo , | sfdisk --Linux $disk
cryptsetup luksFormat --cipher aes-cbc-essiv:sha256 --batch-mode ${disk}1 $keyfile
sfdisk -R $disk; sleep 5
Next, I open, format, and mount the partition. JFS is the filesystem that's been nicest to me, of all the journaling filesystems I've tried. (In the future, it might be nice to use ZFS for the backup disks--I'd get better detection of disk errors, at least--but I don't think it would play entirely well with running over an encrypted volume, and they haven't integrated encryption into ZFS yet, as far as I can tell.)
# Open LUKS partition, format and mount the encrypted volume.
cryptsetup --key-file $keyfile luksOpen ${disk}1 backups
mkfs -t jfs -q /dev/mapper/backups
mount /backups
Now I run the initial backup. Each backup version is just a separate numbered directory in the partition, so the first one is '1'. I'm backing up /etc, some of /var, my and my wife's home directories, and any databases I have. My wife's stuff isn't directly backed up here because she's on a different computer; I have to initiate her backup from there. The script, in its first requirement for user interaction, will wait until I tell it that's done.
# Do the initial backup.
mkdir -vp /backups/1/{etc,var,phil,postgresql,mysql,rivana}
chmod a-r /backups/1
chown rivana /backups/1/rivana
chown postgres /backups/1/postgresql
rsync -avP --filter 'merge /etc/backups/etc-filter' /etc/ /backups/1/etc/
rsync -avP --filter 'merge /etc/backups/var-filter' /var/ /backups/1/var/
rsync -avP --filter 'merge /etc/backups/phil-filter' /home/phil/ /backups/1/phil/
su -c 'pg_dumpall -v >/backups/1/postgresql/dump' postgres
mysqldump -v --all-databases >/backups/1/mysql/dump
echo -n "Hit enter when rivana is backed up. "
read foo
Now that the backup is done, the script unmounts and deactivates the partition.
# Deactivate the encrypted volume.
umount /backups
cryptsetup luksClose backups
And I get prompted for the password that will unlock the partition if the key file isn't available.
# Add user password.
cryptsetup --key-file $keyfile --verify-passphrase luksAddKey ${disk}1
Finally, the script displays the UUID for the partition, which is needed for later use.
# Display the partition's UUID.
echo -n 'UUID: '
cryptsetup luksUUID ${disk}1
Set up crypttab.
Debian, at least, has an /etc/crypttab file that lists encrypted partitions to be enabled at boot time. I put the onsite backup disk in there so it'll be automatically mounted if the computer reboots. This plus a backup cronjob make the backup process completely automated.
backups /dev/disk/by-uuid/<onsite UUID> <key file> luks
Do local backups.
I have a simple script to do the daily backups: perform-backup. It's
basically the same as the initial backup, but with the --link-dest
option as I mentioned previously.
last_num=$(ls -t /backups | head -1)
((num=$last_num+1))
mkdir -p /backups/$num/{etc,var,phil,postgresql,mysql,rivana}
chown rivana /backups/$num/rivana
chmod a-r /backups/$num
rsync -a --filter 'merge /etc/backups/etc-filter' --link-dest=/backups/$last_num/etc /etc/ /backups/$num/etc/
rsync -a --filter 'merge /etc/backups/var-filter' --link-dest=/backups/$last_num/var /var/ /backups/$num/var/
rsync -a --filter 'merge /etc/backups/phil-filter' --link-dest=/backups/$last_num/phil /home/phil/ /backups/$num/phil/
chown postgres /backups/$num/postgresql
su -c "pg_dumpall >/backups/$num/postgresql/dump" postgres
mysqldump --all-databases >/backups/$num/mysql/dump
Do Becca's backup
My wife has her own computer but, fortunately, simpler backup requirements. I have ssh public key authentication set up so she can ssh to my computer without a password, which makes the backups work properly in an automated fashion.
The initial backup is a simple rsync one-liner.
rsync -avP ~/ mithrandir:/backups/1/rivana/
Subsequent backups are a simple script:
#!/bin/sh
num=$(ssh mithrandir ls -t /backups | head -1)
last_num=$(ssh mithrandir ls -t /backups | head -2 | tail -1)
rsync -a --link-dest=/backups/${last_num}/rivana ${HOME}/ mithrandir:/backups/${num}/rivana/
Backup rotation.
The Towers of Hanoi rotation is effected by a script that goes through and deletes any directories that don't need to be there: purge-backups. I won't quote it here because I don't think it's all that interesting. It just finds the largest power of two less than or equal to the current number and then works its way down from that to enumerate all of the directories, deleting everything else.
Offsite backups.
Every so often I'll bring the offsite disk home and back up to it. The script for that is insert-offsite-backup. It unmounts the onsite disk, waits for me to insert the offsite disk, runs a backup, unmounts the offsite disk, waits for me to reinsert the onsite disk, then remounts that disk. It needs to be told what my offsite UUID is, but it picks up all the other settings from /etc/crypttab.
The backup takes about half an hour, so I have ample time to manually run the backup script on Becca's computer.
Mon, 22 Dec 2008
I had a dream last night that the apartment beneath ours caught on fire, we had to rush out of the building, and my computer and all of its data was destroyed.
I've been pondering a formal backup system for a while now. (My current system involves making sure important files are in a version control system and exist on at least my laptop and desktop. This is pretty ad-hoc, inconsistently updated, and not entirely comprehensive.) I'm taking my dream as impetus to actually set something up. This post is to help me organize my thoughts and see if anyone has any comments or suggestions.
My Requirements
I want to have a full rotating versioned backup system, where I have complete daily backups for a recent time span (say a week or so) and more sporadic backups back to as much as a year in the past. Ideally, the backups should be stored in a space-efficient manner so unchanged files don't take up more space than a single copy would require. The backups should have off-site redundancy. They should be relatively easy to use; they should be fully automated on a day-to-day basis, with notification when things go wrong. Ease of setup would be nice but not necessary.
My Data
I currently have about 720 GB of data in my home directory, plus a few hundred MB elsewhere on the computer that I'd want to back up. I also have about 11GB in a bzr repository, but all of that should remain duplicated in my home directory. Most of the data in my home directory is in media files that I can either replace (rerip CDs, etc.) or live without; only 25 GB of it is stuff that I must back up. (A further 130 GB is stuff that would be nice to back up, but I can just burn it to DVD and consider those my backups; the data is essentially static.)
JWZ Backups
The easiest approach is the JWZ backup solution. For all of my data, that would be two 1 TB external hard drives, for about $220. If I restrict myself to the "must backup" data, I could make do with two 60 GB external hard drives for about $80. In either case, I'd keep one drive at the office and swap them periodically.
The advantage of this approach is that I control everything. I can put encrypted volumes on the drives, so if they get lost or stolen, my data isn't usable to other people. I can use rsync with hardlinks between datestamped directories to get versioned backups with efficient disk usage. The drawbacks are a modest initial monetary outlay and the need to coordinate shuttling drives back and forth.
Amazon S3
Another approach is to use Amazon S3 to store my data. It's offsite by definition (and stored among multiple data centers; if I write data to it, I can reasonably trust that I'll get that data back). It's not too expensive: at $0.17/GB-month, my minimal backup will cost about $3.85/month. Throw in transfer costs and churn, and I doubt I'd exceed $6/month. (The initial upload would be $2.56. A full restore would cost me $4.36.) With S3, I would only back up the minimal data; the 130 GB of optional backups would cost an additional $20/month, which would exceed the cost of the full do-it-myself hard drive backups in one year.
The complication to S3 is that it's just a web-based data storage service; you need additional software to make a reasonable backup solution.
Jungle Disk
From everything I've read, Jungle Disk is currently the best software for storing filesystem data on S3. It runs on Windows, Mac OSX, and Linux, and exports your S3 buckets as a WebDAV disk, which you can then mount and treat like an ordinary (unlimited capacity) disk drive. All data is encrypted before it's sent to S3.
I like this approach. Since it looks like a disk, I can use the same rsync setup I would with my own disks, and since the data is encrypted, I don't need to worry too much about it being transported over the Internet and stored on someone else's servers. The main drawback is that it's proprietary software. In addition to my principled preference of open source software to proprietary, there's also the issue that, especially because the data's encrypted, this software would be my only access to my backups. If something went wrong and I couldn't get support from the company (e.g. they went out of business), I'd be out of luck.
The software costs $20. Assuming $5/month on S3, it would take one year for this approach to cost more than the minimal get-my-own-disks approach.
Other S3 software
I haven't seen anything else that will let me back up to S3 and keep versioned backups in a space-efficient manner. Most of the S3 backup software I've seen doesn't do versions, and the few that do don't appear to do it space-efficiently. As always, I have the option of writing my own, but that would take a fair amount of time and effort, and I'd be likely to give up partway through, continuing to leave myself without good backups.
Conclusion
Barring any better suggestions from others, I'm leaning towards the two smallish hard drives. They'd pay for themselves after a year of use, and I get complete control of my data (for better or worse). I like the idea of using S3, but it's more expensive in the long run, and I'm not completely happy with any of the software I've found to use with it.
Wed, 19 Nov 2008
I have simple needs. I have a base class with some generic behavior and subclasses with specific information for that generic behavior. More concretely, the subclasses need to provide the generic behavior with an ordered list of things that designate key fields on database tables. The best representation of those "things" in Delphi seems to be members of an enumeration:
type
TKeyField = (kfFoo, kfBar, kfBaz, kfQuux);
Since I need the list of fields to be ordered, I need them in an array:
type
TKeyFieldArray = array of TKeyField;
The declaration of the base class is pretty simple:
type
TBaseClass = class
protected
function GetKeyFieldList : TKeyFieldArray; virtual; abstract;
public
procedure DoSomethingWithKeyFields;
end;
As is the declaration of the subclass:
type
TSubClass = class(TBaseClass)
protected
function GetKeyFieldList : TKeyFieldArray; override;
end;
So where's the problem? Where's the hate? The hate is in the implementation. If Delphi had array literals, this would be easy. Something like:
function TSubClass.GetKeyFieldList : TKeyFieldArray;
begin
Result := [kfBar, kfFoo, kfQuux];
end;
But it doesn't. It has some special magic for array literals if they're the parameter to a function, but not anywhere else. It does, however, have a syntax for array constants. Perhaps this will work:
function TSubClass.GetKeyFieldList : TKeyFieldArray;
const
keyFieldList : TKeyFieldArray = (kfBar, kfFoo, kfQuux);
begin
Result := keyFieldList;
end;
But no. That TKeyFieldArray is a dynamic array; Delphi doesn't
allocate any space for it, so it can't be a constant value. You have to
tell Delphi how big each constant array is, even though you're already
telling it how many elements are in the array. So perhaps this is the
solution:
function TSubClass.GetKeyFieldList : TKeyFieldArray;
const
keyFieldList : array[0..2] of TKeyField = (kfBar, kfFoo, kfQuux);
begin
Result := keyFieldList;
end;
But no. Because of Delphi's approach to static typing, those are actually different types, and are therefore not assignment-compatible. (See previous hates on this subject.) No, here is the code that Delphi makes me type for what should be a one-line function implementation:
function TSubClass.GetKeyFieldList : TKeyFieldArray;
begin
SetLength(Result, 3);
Result[0] := kfBar;
Result[1] := kfFoo;
Result[2] := kfQuux;
end;
And just earlier this morning I was pleased because I read that Delphi
2007 (to which I'll soon be upgrading from Delphi 5) has for...in loops,
so I can finally have foreach. (Can't get the generics and anonymous
functions in Delphi 2009, because we need .NET and that's not yet
available for Delphi 2009.) Oh, Delphi. The one hand giveth, and the
entire rest of the stupid, anemic, pox-ridden language taketh away.
Fri, 11 Jul 2008
I recently had the desire to rip some DVDs so I could watch them on my computer without swapping discs. I figured I could just pull everything from the DVD into Matroska files, since Matroska supports everything that DVDs do. When I went looking on the Internet, I found few resources for moving from DVD to MKV, and everything that did talk about it actually reencoded the DVD video to get it into its final destination. Since Matroska can contain all of the codecs native to DVDs, I wanted to transfer everything losslessly. This is how I did it. (Note that I'm using the Linux command line; I prefer Linux to Windows, and the command line to X.)
The programs I used are as follows:
- dvdbackup (optional)
- xine (optional; only for cracking CSS)
- lsdvd
- transcode, mostly
tcextract - subtitleripper, for
subtitle2vobsub - ogmtools, for
dvdxchap - mkvtoolnix, for
mkvmerge - mplayer, for most of the stream dumping
Some Background
I don't know all the details of how data is stored on DVDs, but here's a rough overview. The video on DVDs is encoded in either MPEG-2 or MPEG-1 with a variable bit rate. The audio can be in raw PCM, DTS, MP2, or AC3. Most DVDs use AC3. Not all DVD players support DTS. Subtitles are stored as bitmaps with associated timecodes governing when to show them on screen.
In a DVD, the basic unit of video is a title. Each title consists of one or more video streams, zero or more audio streams, zero or more subtitle streams, and a list of timepoints to mark chapter boundaries. The titles on a DVD are grouped together into titlesets. All of the data for each titleset is concatenated together into VOB format and then split into 1GB chunks. The net result is that there's no one-to-one correspondence between files on the DVD and individual titles.
If a title has more than one video stream, one will be the primary stream while the others represent alternate angles. Few DVDs have multiple angles, so I'm not sure how the data for those works; all of the DVDs I've seen just have a single video stream for each title.
Also note that not all of the titles on a DVD are the feature content. Practically everything displayed by a DVD, including the splash image, copyright warning, ads and trailers, and even the menus is a DVD title.
Any of a DVD's titles may be encrypted with CSS. Either the DVD player or the DVD drive must have a licensed CSS decryption key in order to read the encrypted data. Fortunately, CSS is somewhat weak, and most Linux programs for accessing DVDs use libdvdcss to bypass the encryption.
Ripping the DVD
Ripping the DVD isn't strictly necessary, but it helps to have all of the data on your hard drive for processing. Even if you don't copy the videos to your hard drive, you'll have to mount the DVD and use its IFO files; I'll get to that later.
The easiest way to rip the DVD is with dvdbackup. It creates a
directory for the DVD and then puts a VIDEO_TS subdirectory in the DVD
directory. The VIDEO_TS directory contains all of the files in the DVD's
VIDEO_TS directory. (Or, at least, it will if you use the -M option;
other options give more restricted results.)
dest_dir=<destination directory>
dvd_name=<DVD name>
dvd_device=<DVD device, e.g. /dev/dvd>
dvdbackup -M -i $dvd_device -o $dest_dir -n $dvd_name
In theory, you could also mount the DVD and just copy all of the files over, but that has not worked well for me in the past, partly because of CSS problems, but also partly because my drive is a little wonky.
You can also just take an image of the DVD with dd. You'll need to
disable the CSS beforehand. I've found that just running xine on the
DVD is sufficient.
dest_dir=<destination directory>
dvd_name=<DVD name>
dvd_device=<DVD device, e.g. /dev/dvd>
xine dvd://
dd if=${dvd_device} bs=2048 conv=sync,noerror of=${dest_dir}/${dvd_name}.iso
If you have pv installed, you can get a fancy progress bar.
dest_dir=<destination directory>
dvd_name=<DVD name>
dvd_device=<DVD device, e.g. /dev/dvd>
xine dvd://
dd if=${dvd_device} bs=2048 conv=sync,noerror |
pv -s $(fdisk -l $dvd_device |
perl -nle 'm{^Disk '${dvd_device}': \d+ MB, (\d+) bytes$} and print $1') \
>${dest_dir}/${dvd_name}.iso
Get Disc Info
Whether you've ripped the DVD to disk or not, you need to see what's on
it. Change into your working directory and run lsdvd. (NB: From here
on out, unless otherwise noted, all commands that reference a DVD will
work equally well with a device (e.g. /dev/dvd), a disc image (like the
one created with dd), or a directory containing a VIDEO_TS directory
structure.)
dvd=<DVD device, image, or directory>
lsdvd -a -n -c -s -v $dvd > contents
Rip Each Title
The first order of business is to get the title data off of the DVD.
tccat will pull just the given title's stream out of the DVD. (Note
that the resulting file has the possibility of exceeding 7GB in size; make
sure your filesystem can handle files that large.)
title=<title number, e.g. 01>
dvd=<DVD device, image, or directory>
tccat -i $dvd -t dvd -T ${title},-1 >${title}.vob
The information about the title's chapters isn't in the VOB, so you'll
have to extract that separately with dvdxchap. In my experience,
dvdxchap never gets useful information for the chapter names (perhaps
the DVD only contains the timepoints with no names associated), so you may
want to edit the resulting file to put in more meaningful names. (Note
that mplayer will output chapter information if you use its -identify
option, but dvdxchap is more precise in its timing and also generates
the data in the format that mkvmerge wants.)
dvdxchap -t $title $dvd > ${title}.chapters
I've seen DVDs where the TOC info as reported by lsdvd doesn't match the
actual streams in the titles, so it's good to check the track directly.
Ideally, tcprobe would give all the information about the streams, but
while it gives good information about audio and video streams, it doesn't
give all the details we'll need about subtitle streams. Thus, we need to
use mplayer. mplayer gives audio stream ids in decimal, not hex, so
the first audio stream will show as 128, not 0x80. It numbers the
subtitle streams from zero, though, so you have to add 0x20 to the numbers
it gives to get the actual subtitle stream ids.
mplayer -dvd-device $dvd -vo null -ao null -frames 0 -v dvd://${title} 2>&1 | egrep '[as]id' > ${title}.streams
In an ideal world, mkvmerge would be able to operate directly on the
VOB, but when I tried that, it had problems demuxing the data and it died
halfway through. So I'll use tcextract to pull out the individual
components. Video first.
tcextract -i ${title}.vob -t vob -x mpeg2 >${title}.video.m2v
Next up are the audio tracks. The VOB may contain more than one audio
track. They should be labeled as to to their language, but check
mplayer's info, not lsdvd's. mplayer's info will also tell what
format the audio is in. tcextract wants the audio tracks numbered from
zero, but mplayer reports their actual track ids, which usually start at
128 and go up from there. The lowest-numbered track is track 0 to
tcextract, and so on.
lang=<language code>
track=<source audio track: 0, 1, 2, etc.>
format=<extension for audio format; e.g. ac3, mp2>
tcextract -i ${title}.vob -t vob -x $format -a $track >${title}.audio-${lang}.${format}
The VOB also contains subtitles, although most programs that query it
won't see them. Unlike when extracting audio, tcextract requires that
you use the absolute track number, but mplayer reports a relative
number. You will need to add 0x20, or 32 to the value that mplayer
reports for the subtitle tracks. Some of the information for subtitles is
stored in .IFO files on the DVD. Each titleset has its own .IFO file;
check the contents file to see what titleset contains the track and use
that titleset's .IFO file. It will be in the VIDEO_TS directory, named
VTS_<titleset number>_0.IFO.
Matroska supports several subtitle formats, but VobSub is probably the easiest to use, because it's a series of bitmaps, just like the DVD subtitles. If you're not happy with VobSub, you'll need to OCR each image to get its text; there are instructions for doing so elsewhere on the Internet.
lang=<language code>
stream_id=<id of the subtitle stream: 0x20, 0x21, 32, 33, etc.>
ifo=<IFO file; e.g. /path/to/VIDEO_TS/VTS_nn_0.IFO>
tcextract -i ${title}.vob -t vob -x ps1 -a $stream_id >${title}.subs-${lang}.raw
subtitle2vobsub -p ${title}.subs-${lang}.raw -i $ifo -o ${title}.subs-${lang}
Finally, it's time to bring everything together with mkvmerge. When I
use <title>, I mean the actual textual title for the video, like "Bob's
House of Horror 2" or whatever. ${title} still refers to the title
number on the DVD.
mkvmerge -o <final filename> \
--title <title> \
--chapters ${title}.chapters \
${title}.video.m2v \
<audio clauses> \
<subtitle clauses>
For each audio file, you'll need a clause giving the file and its
language. The first file you list on the command line will be the default
audio, unless you use mkvmerge's --default-track option to change it.
--language 0:${lang} ${title}.audio-${lang}.ac3
Likewise, you'll need a clause for each subtitle file. Since I generally don't want any subtitles displayed by default, I set things so that there isn't a default subtitle track.
--language 0:${lang} --default-track 0:0 ${title}.subs-${lang}.idx
And that should do it. After a fair bit of disk-churning, you should have a Matroska file containing all of the elements from the original DVD title. You can now delete all of your intermediate files and just keep the MKV on your computer and the DVD in its box.
Thu, 29 May 2008
The other day, while I was wating for several GB to transfer over the network at work, I finally got around to setting something that's been dancing at the back of my mind for a while: computer-based proximity detection using Bluetooth.
I have a Treo 650. It has Bluetooth. I also have a USB Bluetooth Adapter. I originally planned to carry the bluetooth adapter around and hook it up to different computers whenever I wanted to talk to the Treo, but I've only been using it at work, so I've been leaving the adapter connected to my Linux computer at work. The thought occurred to me that I could use the Bluetooth adapter to see whether my phone was nearby and do things based on that information. At least to start, I decided to have the computer lock itself when I wasn't around.
I have the BlueZ Bluetooth stack installed. (On Debian, that's
the bluez-utils package.) They include a l2ping program, but that
establishes a full Bluetooth connection with the device, which makes my
Treo turn on the screen, play a little sound, and show a pop-up dialog.
That's a little intrusive for something that I want checked several times
a minute. Some people use hcitool rssi to find out the strength of the
phone's (or other device's) Bluetooth signal. That also requires a full
Bluetooth connection. I ended up using hcitool name, which returns the
name of the device if it's found and nothing if it's not. More
importantly, it doesn't cause the Treo to do anything but silently send
its response, and it works even if the Treo screen is off.
So I now have a stupid little shell script that looks like this:
#!/bin/sh
PHONE_ADDR=01:23:45:67:89:AB
PHONE_NAME="glamdring"
WAIT_TIME=15
while true; do
if [ "$(hcitool name $PHONE_ADDR)" \!= "$PHONE_NAME" ]; then
xscreensaver-command -lock
fi
sleep $WAIT_TIME
done
There are programs for Windows that do similar things. Possibly one of the simplest is Blue Lock, which is also open-source (and written in Delphi). I'm probably just going to write a simple Windows program to listen on the network for a message from my Linux computer to tell it to lock the screen.
Tue, 19 Feb 2008
I'm only a few weeks into my Java class and I'm already annoyed at the language. I'm completely willing to ascribe this to newbieness, where I'm just not working with what the language gives me, but the metaobject stuff in Java seems a bit painful.
I'm working on a project for the class where I have to accept input in several different units of heat (BTUs, calories, and joules) and output the measurement in joules. I've made an abstract base class for the various units and created concrete classes for each unit the program has to read. I'd like to just have a list of available classes and have my program enumerate them automatically (rather than hardcoding the behavior for each), but the way I would normally think about doing this is painful in Java.
In Delphi, I'd do something like this:
type
THeatUnits = class
public
constructor Create(Value : Real); virtual;
class function GetUnitsName : String; virtual; abstract;
function ConvertToJoules : Real; virtual; abstract;
end;
THeatUnitsClass = class of THeatUnits;
TBTUs = class(THeatUnits);
TCalories = class(THeatUnits);
TJoules = class(THeatUnits);
const
availableUnits : array [1..3] of THeatUnitsClass
= (TBTUs, TCalories, TJoules);
...
procedure DoStuff(Index : Integer; Value : Real);
var
Units : THeatUnits;
begin
Units := availableUnits[Index].Create(Value);
// Now do things polymorphically with Units.
end;
Delphi's class types often seem like a quick hack to me, but they beat
what Java does in the same situation. For one thing, there doesn't seem
to really be a class type in the same way that Delphi does it. There are
instead objects of type Class. As far as I can tell, the best way to get
one of those is to call Class.forName("ClassName"). But the painful
part is that there's no specialization of class types at compile time, so
the Java code equivalent to my Units := availableUnits[Index].Create(Value);
above would be something like this:
static final AVAILABLE_UNITS = new String[] ("BTUs", "Calories", "Joules");
public void doStuff(int index, double value) {
Class unitClass = Class.forName(AVAILABLE_UNITS[index]);
Constructor unitConstructor = unitClass.getConstructor(new Class[] (double.class));
HeatUnits units = (HeatUnits)unitConstructor.newInstance(new Object[] (new Double(value)));
// now do things polymorphically with units.
}
(Common Lisp, of course, would be more succinct than Delphi, because everything is first-class; I would probably do something like this:
(defclass heat-units () ())
(defgeneric get-units-name (unit-class))
(defgeneric convert-to-joules (unit-class))
(defclass btus (heat-units) ())
(defclass calories (heat-units) ())
(defclass joules (heat-units) ())
(defparameter +available-units+ #(btus calories joules))
(defun do-stuff (index value)
(let ((units (make-instance (svref +available-units+ index)
:value value)))
;; Now do things polymorphically with units.
))
)
As I said, though, I think this is just an artifact of not thinking in Java to the appropriate degree. Most of Java's reflection stuff seems set up to be useful at run-time while Delphi's run-time reflection is much uglier than Java's. And I think I'm going to approach my Java problem from a different direction, with an enum and a factory method. I was still struck by how annoyingly wordy (and not entirely typesafe) my first approach turned out to be in Java.
Fri, 31 Aug 2007
I came across an interesting programming puzzle today, and I'd like to share a couple of variants on it.
To start with, let's say we have an array A that contains some numbers. For simplicity, let's say that the array is 1-based. How can we efficiently find out if any of those numbers are duplicates?
The easiest approach is to make a hash table to keep track of the numbers we've seen so far. Something like this:
Given an array A of length n,
let h ← (new hash table)
for 1 <= i <= n:
if A[i] is present in h: return A[i]
set h[A[i]] ← True
return <no duplicates>
Now, suppose that the array is of length n and only contains positive integers less than n. We can be sure (by the pigeonhole principle) that there is at least one duplicate. The bounds on the values mean that we can be a little more efficient and store the already-seen numbers in an array rather than a hash table:
let s ← array [1..n]
initialize s to all zeroes
for 1 <= i <= n:
if s[A[i]] > 0: return A[i]
set s[A[i] ← 1
Now, suppose that we have to do this in constant space; no more creating hash tables and arrays on the fly. Is if still possible? If you add the constraint that there be exactly one duplicate value, this is easy.
Since there are n values from 1 to n-1 with one duplicate, we can use the fact that the sum of the integers from 1 to m is (m*(m+1))/2, take the sum from 1 to n-1, subtract that from the actual sum of numbers in the array, and the result will be the duplicated number:
let s ← 0
for 1 <= i <= n:
s ← s + A[i]
return s - n*(n-1)/2
But suppose there are no guarantees about the number of duplicates, and the only assurance you have is that the values are between 0 and n, exclusive. Is it still possible to find a duplicated value in O(n) time and O(1) space? Feel free to stop here and try to work things out yourself, if you want.
As you might have guessed from the length of this entry, there is an approach that will work. It's something of a radical departure from the previous approaches, and it relies rather heavily on the particular bounds we have on the values in the array.
First, there's the solution-by-cheating. If you're working in a language with bounded integers, you can exploit that knowledge to meet the letter of the law in the "constant space" requirement.
Let's say we're using a language and platform where integers are 32 bits, signed. The maximum value for an array element would therefore be 231-1, while the minimum value would still be 1. We only need one bit of information per value (seen/not seen), so, at 8 bits per byte, an array 223 bytes (8MB) long would be enough to track any set of integers the environment is capable of passing. This algorithm is therefore O(n) time and O(1) space on a system with 32-bit signed integers:
let s ← array [1..2^31-1] of bit
for 1 <= i <= n:
if s[A[i]] = 1: return A[i]
set s[A[i]] ← 1
Of course, it relies on implementation details that might change (a similar array on a 64-bit system would require 32 petabytes of memory) and it might be nice to have an algorithm that works regardless of any artifical bounds on your integers. Fortunately, there are a couple more approaches we can take.
Recall that the array is 1-based, so its elements are numbered from 1 to n. The values are positive integers less than n, so they can range anywhere from 1 to n-1. Because the values are a subset of the array bounds, we can treat those values as indices into the array itself.
If we're allowed to modify the array, we can just go through it reordering the elements, trying to put each value into the corresponding element (so the value 1 goes into the first element, 2 goes into the second element, and so on). When we find a collision, that's the duplicate.
for 1 <= i <= n:
while A[i] ≠ i:
if A[A[i]] = A[i]: return A[i]
swap(A[A[i]], A[i])
If the array is immutable, though, the solution is a little more involved. Given the view of each value as an index into the array, we can treat the array as a directed graph, where each element is a node and the value of that element is an edge pointing at the next node. Since every node points to another node, there is at least one cycle in the graph. We can also conclude that every node leads to a cycle, even if it is not directly part of one.
(Side note on terminology, for those not immediately familiar: A graph is simply a collection of nodes and edges. Each edge connects exactly two nodes. In a directed graph, each edge also has a direction; it goes from one node to another. If you start at a particular node, follow (one of) its edges to the next node, continue doing that for a certain amount of time, and eventually come back to the starting node, that's a cycle. If you start at a particular node, follow its edges out in the same way, and end up in a cycle that does not contain the starting node, all of the nodes and edges you crossed before getting to the cycle are called the tail. The first node that you reach that is in the cycle is defined to be the beginning of the cycle. The symbols λ (lambda) and μ (mu) represent the lengths of the tail and cycle, respectively. Technically, the directed graph we're dealing with here is also a sequence, since every node has at most one edge leading away from it (though it might have more than one leading into it).)
Since the values in the array are all less than n, no element in the array can point to the nth element. Therefore, the last element cannot be part of a cycle. Since every element must lead to a cycle, it must be part of a tail. If we follow the sequence from this point, we will eventually run into a cycle, and the last element in the tail will be a duplicate value.
So, how do we find the beginning of the cycle? The easiest approach is to use Floyd's cycle-finding algorithm. It works roughly like this: Start at the beginning of the sequence. Keep track of two values (call them ai and aj). At each step of the algorithm, move ai one step along the sequence, but move aj two steps. Stop when ai = aj.
At this point, the difference between i and j must be a multiple of the cycle length. If the algorithm took m steps to get to this point, then i = m and j = 2m, so m is a multiple of the cycle length. i has also traveled λ steps along the tail and m-λ steps along the cycle.
If we now start another value (call it ak) at the beginning of the sequence and move both it and ai simultaneously and at the same rate, after λ steps, ak will have traveled the entire tail and will have arrived at the beginning of the cycle. At the same time, ai will have moved a total of m steps along the cycle and, since m is a multiple of μ, will also have arrived at the beginning of the cycle. Thus, if you stop as soon as ai = ak, you will have found the beginning of the cycle and the end of the tail.
So. Applying all of that to our problem means that we start at the last element of the array, have two variables following the pointers in the array, one going twice as fast as the other, until they are equal to each other. At that point, start one of them back at the beginning and run them at the same speed until they again equal each other. Their value will then be the duplicate value.
let i ← n, j ← n
do: i ← A[i], j ← A[A[j]]; until i = j
set j ← n
do: i ← A[i], j ← A[j]; until i = j
return i
This algorithm also lends itself nicely to functional programming, since it doesn't actually require any side effects, so here's an implementation in Haskell:
import Data.Array
findDup a = findDup' n $ findCycle (a!n) (a!(a!n))
where n = snd $ bounds a
findCycle i j | i == j = i
| otherwise = findCycle (a!i) (a!(a!j))
findDup' i j | i == j = i
| otherwise = findDup' (a!i) (a!j)