General musings on programming languages, and Java.

Tuesday, August 21, 2007

Objects To Functions 1 - Stateless Classes

Before I had really heard of functional programming, I had started to notice that it made a lot of sense to make certain objects 'value objects', i.e., create them with all the data they need and then never change them, which lets you pass them around without worrying about who changes them. This sounds like low-hanging fruit, I mean, of course you can easily keep track of changes, just like C++ programmers always manage to track who frees an object.. Ah. You probably see where I'm going now. So, if I don't mutate an object, I can pass it around without thinking about the consequences. That's why it turns out to be better to share Strings rather than StringBuffers in Java. Of course, you can treat a StringBuffer as an immutable value, just by never mutating it, but as it is intended to be mutated, you generally don't share it, at least not without thinking very carefully. Let's look at a very simple stateful class:


public final class Point
{
    public double x,y;
}
I'm sure some readers will be complaining about the lack of get/set methods now - I'll happily explain why I don't use them if you ask in a comment. Others may complain about final, I'll happily explain that too. If you instantiate a Point and pass it around, some programmers will see that it's intended to be mutable, and store a copy of it (or take pains never to store it, but just use its values once), and other programmers will take the attitude of, well, if you're giving out a mutable object you'd damn well better expect it to be mutated*. The latter sounds like an unreasonable attitude, but it's why IDEA places warnings on methods that return the values of fields that hold arrays and collections. IDEA puts the onus on the class giving out the mutable object, rather than on the user of that object - it assumes that the latter programmer is the most likely one. * this latter programmer often also has the attitude that while it's ok for him to mutate the object, nobody else should! It's quite easy to make Point immutable, you can provide a constructor that sets x and y, and then make x and y final. You don't absolutely have to do this - all it really takes to stop a data structure from being modified is to stop modifying it - you don't have to put up barriers to modification to do that. Still, in Java modifying things is the norm, so it might be a good idea for now. If you get to a stage where mutation is only done with great care then you might not need the barrier.

public final class Point
{
    public final double x,y;

    public Point(double x,double y)
    {
       this.x=x;
       this.y=y;
    }
}
So, how does this relate to functions? A Point is now a function, as in mathematics. If you create a Point, it will always give you the same value for x and the same value for y. You can think about it as a function that takes the symbol x or the symbol y and always gives the same result. That is, objects can be or represent functions just as well as methods can. Math.abs represents a function, because it always gives the same output for the same input, and so does point.x. Now I can treat a Point as a long-lived value and never worry whether it will be mutated, just as I don't worry about it being garbage collected. In effect, I've taken time out of its semantics. You might be thinking that Point is crying out to be an interface, because programming to the implementation is inflexible. However, I'd like all clients of Point to be able to rely on .x and .y returning the same thing, forever, and not have to worry about different implementations doing strange things. Instead I would suggest that if you wanted a different implementation, you don't call it Point, and don't try to make it extend or implement Point. Clearly it would be of a different type, e.g., MutablePoint, or DebuggablePoint, Quaternion, etc. - and you wouldn't want to use it in places that expect a Point. There are cases that make this way of programming tricky, and solutions to each: 1. Collections. It would be pretty expensive to copy a collection instead of changing it, so in many cases you may as well change it in place, which loses you the time-independence, and stops the collection from behaving as a function. There are some concessions you could make - e.g., only allowing adds, not removes (then any successful accesses will always succeed - you won't break assumptions). For the more general case, there is one kind of collection in particular that works well as a function - a linked list. Java's LinkedList class doesn't give you access to the individual nodes of the list, which is what we need for this to work, so you'd need to write your own.

class Link<T>
{
    public final T item;
    public final Link<T> next;

    public Link(T item,Link<T> next)
    {
        this.item=item;
        this.next=next;
    }
}

Link<String> list=new Link<String>("hello",null);
If you feel the need to add to the list, you don't need to change anything, just call new Link<String>("world",list) and use that value instead from now on. Then each list retains its properties as a function - it will always return the same value if you ask it the same question. If you want to iterate over such a linked list and do something, say, finding the maximum of a Link<Integer>, you could do it like so:

public int max(Link<Integer> list)
{
    int max=Integer.MIN_VALUE;

    while (list!=null)
    {
        max=Math.max(max,list.item);
        list=list.next;
    }

    return list;
}
This works, and from the outside you can't see any problem, but on the inside you can see that time is important. That is, the list objects always work as functions, but because the variables list and max change values, they don't work as functions - you have to think about time whenever you think of them. It would be easier if you only had to look at the method parameters to see the values, instead of having to trace changes in your head. This is a small-scale version of the bigger problems caused by things that change. We can probably deal with this version with no real problems, but let's follow through and see what happens when we apply the large-scale thinking to the small scale. Because a list is a recursive data structure, it turns out to be quite easy to rewrite max as a recursive method:

public int max(Link<Integer> list)
{
    return new Object()
    {
        public int max(Link<Integer> remaining,int current)
        {
            return remaining==null ? current : max(remaining,Math.max(current,remaining.item));
        }
    }.max(list,Integer.MIN_VALUE);
}
Woah! I'm abusing inner classes or something! What's going on? Java doesn't let you define a method inside a method, but it lets you define a method inside a class, and a class inside a method, so that's what I've done. If I was going to write it in more idiomatic Java, I'd do this:

public int max(Link<Integer> list)
{
    return maxImpl(list,Integer.MIN_VALUE);
}

private int maxImpl(Link<Integer> remaining,int current)
{
    return remaining==null ? current : maxImpl(remaining.next,Math.max(current,remaining.item));
}
Now that you've understood that these two idioms are actually equivalent to each other, you can read the first one without any confusion. Just understand that if I could declare a method inside a method directly, I would have done. There is a problem with this code, in that it throws a StackOverflowError for medium-sized collections, on the Sun JVM, but not IBM's, so I am only using it as an intermediate case for this blog post, before reaching a final conclusion. Anyway, let's move on and create a similar method that gives the minimum for a collection.

public int min(Link<Integer> list)
{
    return new Object()
    {
        public int min(Link<Integer> remaining,int current)
        {
            return remaining==null ? current : min(remaining,Math.min(current,remaining.item));
        }
    }.min(list,Integer.MAX_VALUE);
}
I copied and pasted the code, then changed it, which is a good sign that there's some thinking left to be done. I can make more of it the same by renaming the method inside the anonymous class to 'invoke' or something equally generic. I can move the starting value to the method header - it might not make sense to everyone that the minimum of an empty list is Integer.MAX_VALUE, let's allow the user to decide what value that should be:

public int min(Link<Integer> list,int minimum)
{
    return new Object()
    {
        public int invoke(Link<Integer> remaining,int current)
        {
            return remaining==null ? current : invoke(remaining.next,Math.min(current,remaining.item));
        }
    }.invoke(list,current);
}
There's no reason now not to use the outer method as the target for recursion instead of creating an inner one, so if the anonymous class annoyed you, it goes away now:

public int min(Link<Integer> list,int minimum)
{
    return list==null ? minimum : min(list.next,Math.min(current,remaining.item));
}
Wow, that's short. It's also fairly simple. It says, for an empty list, return the given minimum. For a non-empty list, return ourselves invoked with a smaller list, giving the new invocation a different minimum. Notice that nowhere here have we changed any variables. Still, this throws a StackOverflowError too. Two more steps and we'll get rid of it. The only parts that are specific to minima now are the name of the method, and Math.min. A more general version of what we're doing here is reducing a list to a single value. There are lots and lots of reasons we might do that. Asking whether a list contains a certain item is a reduction.

interface Reducer<T,R>
{
    R reduce(T left,R right);
}

public <T,R> R reduce(Link<T> list,R accumulator,Reducer<T,R> reducer)
{
    return list==null ? accumulator : reduce(list.next,reducer.reduce(accumulator,list.item),reducer);
}
A Reducer is something that knows how to reduce 2 values to one. It doesn't necessarily have to mean two items from the list, just an accumulator and one item from the list. Let's have some imaginary syntax meaning that if I type Math.min with no parentheses, I get a Reducer<Integer,Integer>: int min=reduce(list,0,Math.min); That says find me the minimum value in that list, using Math.min to do so. Unfortunately, Java doesn't support such syntax, so this is what you really end up with:

int min=reduce(list,0,new Reducer<Integer,Integer>()
{
    public Integer reduce(Integer left,Integer right)
    {
        return Math.min(left,right);
    }
});
Unfortunately, this is pretty annoying - so annoying that if you use it more than once, you'll want to wrap the reducer up in a static field or method somewhere, so you end up with: int min=reduce(list,0,Maths.minRef); I'm English, we say Maths instead of Math, which turns out to be a handy conflict resolution here. We still have the StackOverflowError problem, but we can resolve it by rewriting reduce in the 'old' way, with changes to variables. Note that StackOverflowError is really a fault with Java, because at the point where reduce calls itself, there is no need to retain any of reduce's data on the stack. Java could make it work so that reduce overwrites its own parameters and then does a goto to get to the top. Unfortunately, it doesn't, so we have to do that ourselves (except the goto is some loop or other):

public <T,R> R reduce(Link<T> list,R accumulator,Reducer<T,R> reducer)
{
    while (list!=null)
    {
        accumulator=reducer.reduce(accumulator,list.item);
        list=list.next;
    }

    return accumulator;
}
So even though we've ultimately had to make a concession for the technical problems of being in a JVM, we've got something pretty reusable out of the thought experiment. Of course this is nothing new. I actually wouldn't at this time recommend using linked lists in this way in Java, but closures in Java 7 and a solution to the problem of tail-call elimination might make this much more attractive. 2. Cyclic data. If class A needs to hold a B, and B needs to hold an A, there's no way to instantiate them both at the same time. There are clever tricks that make them nearly instantiated at the same time, but you can't make the respective fields final in each. A common solution is to get rid of the cycle by creating a third object, C, that holds A and B, supplying the instances of each other as necessary. Unfortunately, the name for C is often hard to think of, which is a reasonable sign (I'm contradicting another of my blog posts here) that C is a bad thing to have at all. I'd suggest looking at the individual case and deciding what to do. Sometimes it's just that A and B should be merged, or that some functionality should be moved from one class to another. In other cases, it might turn out that some methods can be made static, in which case there's no need to hold an instance. 3. Data that will change over time. Essentially a philosophical question, it is possible to argue that something that is different before a certain time than after it hasn't changed, that there are actually two objects. Monads are a way of treating immutable objects as if they can change, by making the new value replace the old value seamlessly, which effectively solves the philosophical question by letting you write code nearly the same way regardless of the answer. Using Java, though, I'd probably suggest that if data changes over time, you may as well use mutable data. Perhaps Java 7 will make monads more attractive.

3 comments:

David R. MacIver said...

I might comment on some of the rest later, but I thought I'd share a cute trick now:

"If class A needs to hold a B, and B needs to hold an A, there's no way to instantiate them both at the same time. There are clever tricks that make them nearly instantiated at the same time, but you can't make the respective fields final in each."

This is wrong! You can! It only works because Java's behaviour is incorrect in certain ways, but I sometimes feel that 50% of Java programming is just a way of benefiting from the ways in which Java is incorrect.

There are essentially two ways to do it that I can think of right now (I thought I had a third, but I can't remember how it works. :-( ). The first is evil - use reflection. Instantiate one of the fields as null then poke inside it with reflection to change the value (you can change final fields reflectively, even though it's gross and wrong to do so).

The second slightly less evil one is to make one class an inner class of the other. You can then either shadow the outer class with a final field or just use the implicit reference to the outer class.

I seem to recall a third way which involved cunningness with subclass both A and B in such a way as to make one an inner class of the other, but I can't seem to make it work now (it runs into a problem where you can't instantiate an inner class in a super constructor call because it involves a reference to 'this'). I'll keep thinking and see what I can come up with.

Unknown said...

I think it's a missspell:

public int min(Link<Integer> list,int minimum)
{
return new Object()
{
public int invoke(Link<Integer> remaining,int current)
{
return remaining==null ? current : invoke(remaining.next,Math.min(current,remaining.item));
}
}.invoke(remaining,current);

should become

public int min(Link<Integer> list,int minimum)
{
return new Object()
{
public int invoke(Link<Integer> remaining,int current)
{
return remaining==null ? current : invoke(remaining.next,Math.min(current,remaining.item));
}
}.invoke(list,minimum);

(not sure the code is properly formatted)

Ricky Clarkson said...

Giovanni,

You're right, I'll run off and fix that. Thanks.

Blog Archive

About Me

A salsa dancing, DJing programmer from Manchester, England.