Fri, 02 May 2008

New Site Hosting

In the interests of better site availability and less Comcast AUP-breaking, I've finally gotten around to outsourcing my website hosting. I'm currently at NearlyFreeSpeech.net, a webhost committed to the twin goals of free speech and affordable web hosting.

How free is their speech? Read their Abuse page:

"A NearlyFreeSpeech.NET member site is defaming me or otherwise injuring me civilly."

Please forward a copy of your legal finding from a court of competent jurisdiction to our contact address. If you have not yet obtained such a finding, a preliminary injunction or court order is also sufficient.

If you are not able to obtain the above, you will need to work directly with the site operator to resolve your differences. We will have to fall back on our members' contractual assertion that the content they upload is legitimate and therefore we will not be able to get involved

How affordable is their hosting? You pay only for the bandwidth and storage that you actually use: $1 per gigabyte of bandwidth and $0.01 per megabyte-month of storage. (Plus the bandwidth cost goes down the more you use.)

They support a variety of CGI scripting languages, including C, PHP, Perl, Python, and Ruby. Oh, but also Fortran, Tcl, Lisp, Scheme, OCaml, and Haskell.

We'll see how it goes, but I think I'll like it here.

Tue, 19 Feb 2008

Java Reflection

I'm only a few weeks into my Java class and I'm already annoyed at the language. I'm completely willing to ascribe this to newbieness, where I'm just not working with what the language gives me, but the metaobject stuff in Java seems a bit painful.

I'm working on a project for the class where I have to accept input in several different units of heat (BTUs, calories, and joules) and output the measurement in joules. I've made an abstract base class for the various units and created concrete classes for each unit the program has to read. I'd like to just have a list of available classes and have my program enumerate them automatically (rather than hardcoding the behavior for each), but the way I would normally think about doing this is painful in Java.

In Delphi, I'd do something like this:

type
  THeatUnits = class
   public
    constructor Create(Value : Real); virtual;
    class function GetUnitsName : String; virtual; abstract;
    function ConvertToJoules : Real; virtual; abstract;
  end;

  THeatUnitsClass = class of THeatUnits;

  TBTUs = class(THeatUnits);
  TCalories = class(THeatUnits);
  TJoules = class(THeatUnits);

const
  availableUnits : array [1..3] of THeatUnitsClass
                 = (TBTUs, TCalories, TJoules);

...

procedure DoStuff(Index : Integer; Value : Real);
  var
    Units : THeatUnits;
begin
  Units := availableUnits[Index].Create(Value);
  // Now do things polymorphically with Units.
end;

Delphi's class types often seem like a quick hack to me, but they beat what Java does in the same situation. For one thing, there doesn't seem to really be a class type in the same way that Delphi does it. There are instead objects of type Class. As far as I can tell, the best way to get one of those is to call Class.forName("ClassName"). But the painful part is that there's no specialization of class types at compile time, so the Java code equivalent to my Units := availableUnits[Index].Create(Value); above would be something like this:

static final AVAILABLE_UNITS = new String[] ("BTUs", "Calories", "Joules");

public void doStuff(int index, double value) {
  Class unitClass = Class.forName(AVAILABLE_UNITS[index]);
  Constructor unitConstructor = unitClass.getConstructor(new Class[] (double.class));
  HeatUnits units = (HeatUnits)unitConstructor.newInstance(new Object[] (new Double(value)));
  // now do things polymorphically with units.
}

(Common Lisp, of course, would be more succinct than Delphi, because everything is first-class; I would probably do something like this:

(defclass heat-units () ())

(defgeneric get-units-name (unit-class))
(defgeneric convert-to-joules (unit-class))

(defclass btus (heat-units) ())
(defclass calories (heat-units) ())
(defclass joules (heat-units) ())

(defparameter +available-units+ #(btus calories joules))

(defun do-stuff (index value)
  (let ((units (make-instance (svref +available-units+ index)
                              :value value)))
    ;; Now do things polymorphically with units.
    ))

)

As I said, though, I think this is just an artifact of not thinking in Java to the appropriate degree. Most of Java's reflection stuff seems set up to be useful at run-time while Delphi's run-time reflection is much uglier than Java's. And I think I'm going to approach my Java problem from a different direction, with an enum and a factory method. I was still struck by how annoyingly wordy (and not entirely typesafe) my first approach turned out to be in Java.

Fri, 30 Nov 2007

Pumpkin Pie

This is from smakelijke gerechten, a cookbook assembled by the First Reformed Church in Oak Harbor, Washington in 1971. My grandmother's sister belonged to the church, and the cookbook was a gift for my grandmother's birthday.

Preheat oven to 425°F. Mix all ingredients. Pour into pie crust. Bake at 425°F for 15 minutes, then reduce temperature to 350°F and bake for 45 more minutes or until a toothpick inserted in center comes out clean.

Tue, 27 Nov 2007

Grandma's Potato Salad

While I was down in Texas for Thanksgiving, I got my grandmother to give me a couple of her recipes. Unfortunately, she was so familiar with the ingredients that she didn't have any measurements more specific than "'Til it tastes right," so that's all I can provide at the moment. I plan to make the recipe a few more times myself and figure out what measurements work well.

Scrub and boil the potatoes. Hard boil the eggs. When the potatoes are done, let them cool then peel them--the skins should rub right off. Cut the potatoes into roughly 1" pieces. Dice the eggs. Mix together potatoes, eggs, and onion.

In a separate bowl, mix the mayonnaise and pickle juice until the mixture is the right consistency. Add the curry powder, salt, pepper, and mustard to taste.

Mix the dressing with the potatoes, then sprinkle paprika over the salad.

Sun, 25 Nov 2007

Grandma's Parker House Rolls

While I was down in Texas for Thanksgiving, I got my grandmother to give me a couple of her recipes. She portions many of the ingredients by feel or taste, though, so this recipe assumes some familiarity with yeast breads. I plan to make them a couple of times and get some more exact measurements.

In a small saucepan, mix the milk with the sugar, salt, and 6 tablespoons of butter. Heat to just below a boil, then allow to cool to room temperature.

Proof the yeast in some water with a little sugar. Mix the yeast into the milk, then add enough flour to get the right consistency.

Knead the bread, then allow it to rise until doubled in size, about an hour.

Preheat the oven to 375°F.

Melt the remaining 2 tablespoons of butter. Flatten the dough, then roll it out to about 1/4" thick. Cut circles in the dough with a biscuit cutter or the opening of a glass. For each circle, dip half of one side in the butter, then fold it in half with the butter on the inside. Place the rolls on baking sheets and allow to raise until increased in size by roughly 50%.

Bake at 375°F until done, roughly 8-10 minutes.

Fri, 31 Aug 2007

Finding Duplicate Elements in an Array

I came across an interesting programming puzzle today, and I'd like to share a couple of variants on it.

To start with, let's say we have an array A that contains some numbers. For simplicity, let's say that the array is 1-based. How can we efficiently find out if any of those numbers are duplicates?

The easiest approach is to make a hash table to keep track of the numbers we've seen so far. Something like this:

Given an array A of length n,

let h ← (new hash table)
for 1 <= i <= n:
  if A[i] is present in h: return A[i]
  set h[A[i]] ← True
return <no duplicates>

Now, suppose that the array is of length n and only contains positive integers less than n. We can be sure (by the pigeonhole principle) that there is at least one duplicate. The bounds on the values mean that we can be a little more efficient and store the already-seen numbers in an array rather than a hash table:

let s ← array [1..n]
initialize s to all zeroes
for 1 <= i <= n:
  if s[A[i]] > 0: return A[i]
  set s[A[i] ← 1

Now, suppose that we have to do this in constant space; no more creating hash tables and arrays on the fly. Is if still possible? If you add the constraint that there be exactly one duplicate value, this is easy.

Since there are n values from 1 to n-1 with one duplicate, we can use the fact that the sum of the integers from 1 to m is (m*(m+1))/2, take the sum from 1 to n-1, subtract that from the actual sum of numbers in the array, and the result will be the duplicated number:

let s ← 0
for 1 <= i <= n:
  s ← s + A[i]
return s - n*(n-1)/2

But suppose there are no guarantees about the number of duplicates, and the only assurance you have is that the values are between 0 and n, exclusive. Is it still possible to find a duplicated value in O(n) time and O(1) space? Feel free to stop here and try to work things out yourself, if you want.

As you might have guessed from the length of this entry, there is an approach that will work. It's something of a radical departure from the previous approaches, and it relies rather heavily on the particular bounds we have on the values in the array.

First, there's the solution-by-cheating. If you're working in a language with bounded integers, you can exploit that knowledge to meet the letter of the law in the "constant space" requirement.

Let's say we're using a language and platform where integers are 32 bits, signed. The maximum value for an array element would therefore be 231-1, while the minimum value would still be 1. We only need one bit of information per value (seen/not seen), so, at 8 bits per byte, an array 223 bytes (8MB) long would be enough to track any set of integers the environment is capable of passing. This algorithm is therefore O(n) time and O(1) space on a system with 32-bit signed integers:

let s ← array [1..2^31-1] of bit
for 1 <= i <= n:
  if s[A[i]] = 1: return A[i]
  set s[A[i]] ← 1

Of course, it relies on implementation details that might change (a similar array on a 64-bit system would require 32 petabytes of memory) and it might be nice to have an algorithm that works regardless of any artifical bounds on your integers. Fortunately, there are a couple more approaches we can take.

Recall that the array is 1-based, so its elements are numbered from 1 to n. The values are positive integers less than n, so they can range anywhere from 1 to n-1. Because the values are a subset of the array bounds, we can treat those values as indices into the array itself.

If we're allowed to modify the array, we can just go through it reordering the elements, trying to put each value into the corresponding element (so the value 1 goes into the first element, 2 goes into the second element, and so on). When we find a collision, that's the duplicate.

for 1 <= i <= n:
  while A[i] ≠ i:
    if A[A[i]] = A[i]: return A[i]
    swap(A[A[i]], A[i])

If the array is immutable, though, the solution is a little more involved. Given the view of each value as an index into the array, we can treat the array as a directed graph, where each element is a node and the value of that element is an edge pointing at the next node. Since every node points to another node, there is at least one cycle in the graph. We can also conclude that every node leads to a cycle, even if it is not directly part of one.

(Side note on terminology, for those not immediately familiar: A graph is simply a collection of nodes and edges. Each edge connects exactly two nodes. In a directed graph, each edge also has a direction; it goes from one node to another. If you start at a particular node, follow (one of) its edges to the next node, continue doing that for a certain amount of time, and eventually come back to the starting node, that's a cycle. If you start at a particular node, follow its edges out in the same way, and end up in a cycle that does not contain the starting node, all of the nodes and edges you crossed before getting to the cycle are called the tail. The first node that you reach that is in the cycle is defined to be the beginning of the cycle. The symbols λ (lambda) and μ (mu) represent the lengths of the tail and cycle, respectively. Technically, the directed graph we're dealing with here is also a sequence, since every node has at most one edge leading away from it (though it might have more than one leading into it).)

Since the values in the array are all less than n, no element in the array can point to the nth element. Therefore, the last element cannot be part of a cycle. Since every element must lead to a cycle, it must be part of a tail. If we follow the sequence from this point, we will eventually run into a cycle, and the last element in the tail will be a duplicate value.

So, how do we find the beginning of the cycle? The easiest approach is to use Floyd's cycle-finding algorithm. It works roughly like this: Start at the beginning of the sequence. Keep track of two values (call them ai and aj). At each step of the algorithm, move ai one step along the sequence, but move aj two steps. Stop when ai = aj.

At this point, the difference between i and j must be a multiple of the cycle length. If the algorithm took m steps to get to this point, then i = m and j = 2m, so m is a multiple of the cycle length. i has also traveled λ steps along the tail and m-λ steps along the cycle.

If we now start another value (call it ak) at the beginning of the sequence and move both it and ai simultaneously and at the same rate, after λ steps, ak will have traveled the entire tail and will have arrived at the beginning of the cycle. At the same time, ai will have moved a total of m steps along the cycle and, since m is a multiple of μ, will also have arrived at the beginning of the cycle. Thus, if you stop as soon as ai = ak, you will have found the beginning of the cycle and the end of the tail.

So. Applying all of that to our problem means that we start at the last element of the array, have two variables following the pointers in the array, one going twice as fast as the other, until they are equal to each other. At that point, start one of them back at the beginning and run them at the same speed until they again equal each other. Their value will then be the duplicate value.

let i ← n, j ← n
do: i ← A[i], j ← A[A[j]]; until i = j
set j ← n
do: i ← A[i], j ← A[j]; until i = j
return i

This algorithm also lends itself nicely to functional programming, since it doesn't actually require any side effects, so here's an implementation in Haskell:

import Data.Array

findDup a = findDup' n $ findCycle (a!n) (a!(a!n))
    where n = snd $ bounds a
          findCycle i j | i == j    = i
                        | otherwise = findCycle (a!i) (a!(a!j))
          findDup' i j | i == j    = i
                       | otherwise = findDup' (a!i) (a!j)

Sun, 01 Jul 2007

Reply-To: for Mailing Lists

Many mailing lists I'm on (particularly ones inhabited by sizable portions of geeks) have periodic "discussions" about the behavior of the Reply-To header on list email. The discussions usually follow fairly predictable paths. "Reply-To" Munging Considered Harmful is quoted. Proponents respond with Reply-To Munging Considered Useful.

My opinion is threefold:

First, not touching the header is the purest solution. For all the reasons in "Reply-To Munging Considered Harmful", mailing lists shouldn't touch it, in an ideal world, at least.

Second, "Reply-To Munging Considered Useful" gets one things right: it adds functionality. Even today, most mail clients can either "reply" or "reply-to-all". If "reply" goes to the message originator, then they're left with "reply-to-all" as the easiest way to send a message to the list. (And sending mail to the list should, in most cases, be easy.) Some MUAs, such at mutt and KMail [please email me if you know of others] add a "list-reply" function. If most common MUAs had such a feature, I would find it much easier to advocate against munging. But they don't, and it's rather elitist to insist that people switch MUAs just for your mailing list. So, in today's world and for an average list, I tend to vote for munging the header. (Though I really like the (void) mailing list, which provides twin lists, one with munging, the other without. All posts do, of course, appear on both lists.)

(Note that "reply-to-all" is not the same as "list-reply". "list-reply" sends one message--it goes to the list (unless the original email has a Mail-Followup-To: header, but that's another post). "reply-to-all" sends one message to the list, plus one message for each sender and recipient of the original email. That adds up quickly for involved discussion threads, and all of those extra messages are wasts of both time (human and computer) and bandwidth.)

Third, I don't have to care either way, because I use mutt and procmail. (And, actually, the procmail part is optional.) In my .procmailrc is the following recipe:

:0 fhw
| sed -e "s/^\\(Subject:.*\\)\\[$MATCH\\]\\( \\)*/\\1/I" \
      -e "/^Reply-To:.*$MATCH@/ d" \
      -e "s/^Old-Reply-To:/Reply-To:/"

$MATCH is already set to the name of the mailing list, which is usually also the local portion of its email address. (Note that this loses information if the list doesn't munge the header and someone has their own Reply-To header that happens to have the same local part as the mailing list. In my experience, this is highly unlikely.)

My mutt config contains the following directives:

set ignore_list_reply_to=yes
set reply_to=yes
subscribe <list addresses>

so if a Reply-To header is set to an email address that mutt knows to be a mailing list, it ignores that header. (And since those are ignored, I set "reply_to" to yes, so it always uses acceptable headers; by default it asks whether I want to use them.) Thus, the "r" key always replies to the person who sent the message, the "g" key always replies to the sender and all recipients of the message, and the "L" key sends the reply to the mailing list the email was from (and any other addresses in the Mail-Followup-To header).


Phil! Gregory