General musings on programming languages, and Java.

Saturday, August 25, 2007

Stop Sitting On The Type Fence

Static typing is not just about preventing simple bugs, and dynamic typing is not just about writing millions of tests that you wouldn't have to write with static typing.

Programmers of languages like Java, C#, C, C++, etc., middle-of-the-road statically typed languages with some dynamic features, have some intuitions, built up from years of experience at getting the best out of their language, but don't often stop to see what contortions their language makes them go through, or that they are limited in what their compiler tells them about their code.

They often think that rigorous static typing is about preventing bugs that a few unit tests would solve, or that dynamic typing is about writing tests that static typing would solve.

Static Typing is Not About Low-Hanging Bugs

Many years back, clever mathematicians proved that functions express types. If you have an expressive enough type system, you only need to give the input and output types for a function, and you have its implementation. Consider the identity function, written in Scheme like this:

    (lambda (x) x)
Don't worry if lambdas are unfamiliar to you. No, actually, do worry. Lambdas underpin most of computer science and certainly all of programming, even if that's not obvious. If, as I did, you gained a Computer Science degree without ever hearing about lambdas, ask for your money back, and spend it on a copy of The Structure and Interpretation of Computer Programs (SICP) for you, all your friends and family. It's also available in PDF form for no money, in case you're unsuccessful in getting your money back from your University.

Anyway, a lambda expression is analogous to an anonymous function in many languages, and even an anonymous class (with one method) in Java. The particular lambda expression above is a function that takes a value, and returns that same value.

In Haskell, the syntax isn't much different:

    \x -> x
The \ is Haskell's approximation to the Greek letter lambda, the bit before -> is the parameter list, and the bit after it is the result. Scheme doesn't care about telling you the types of things, but Haskell does. The type of \x -> x is t -> t, which means it's a function that takes a t (which can be anything) and returns a t (which is the same as the first t). All fairly obvious.

Given the above, there's only one logical implementation of t -> t, and that's \x -> x. You could make silly ones, where you store x in a local variable, and then return it later, but they're all equivalent. Or perhaps the implementation could look up the type of x, and grab something similarly-typed from a cache. All a bit silly though. Perhaps if all types in a language had a clone function, then you could provide a different implementation, but without that, or something similar,

What you can see from this is that, at least for simple functions, you only need the types to derive the implementation. Does this scale up to things like \x y -> x/y, etc.? Yes, as it happens, though you then need the concept of dependent types, which is types that depend on runtime values. They can still be checked at compile time. In other words, types and values are isomorphic to each other (isomorphic is a mathematical way of saying 'equivalent').

What does the Java version of the identity function look like? Oddly, it depends on whether you write it how Scheme works, or how Haskell works. In Scheme there are no compile time types (or you can say there is exactly one), and the closest that Java comes to that is Object:

    public static Object identity(Object x)
    {
            return x;
    }
If you write it like Haskell, then you need to make sure that the return type is the same as the input type, with no silent upcasting (to Object):
    public static <T> T identity(T x)
    {
            return x;
    }
There is quite a difference between the two pieces of code, which is particularly disturbing, because the Haskell and Scheme versions look so similar to each other. In fact, there's a project called Liskell, which gives Haskell lispy syntax. There, the two would probably be the same. This is a first hint that Java gets in the way of thinking about static typing, by making it very verbose, in particular by making untyped code look different to typed code.

Haskell is also guilty of this, partly by design, and partly because of some deficiencies in its type system. That's the reason that the side of the fence I sit on is the dynamically typed one, but I keep my eye on the static typing people, because they come up with great ideas, and secretly I'd like to be able to use their type systems on some of my code.

So static typing isn't just about making sure that you don't add two credit card numbers together, it's about expressing code as close as possibly to mathematics. Mathematicians tend to be very good at reducing problems to their smallest representations, and coming up with very general solutions, so there can be no doubt that this approach will (and already has) produced amazing results.

Haskell's type system is certainly imperfect, and so are its programmers - this is clear because of the presence of its unit testing system, QuickCheck. If the type system was perfect, and the programmers using it never took shortcuts, there would be no need to run unit tests. It would be evident by the compiler succeeding, that the code could not crash, or produce any results outside the expected range.

If you look for more and more sophisticated type systems, you will probably come across Coq at some point, which is a proof engine that can generate executable code. There you deal with logic, with set theory, all things that are the pinnacle of static typing, and which my Computer Science degree, and probably yours, did not cover.

Go there. Have a look. Even if you decide that static typing is not for you, or that you'll get by with a lesser static typing system such as Haskell's or Java's, you can learn a very general way of thinking that will apply across a lot of programming, in the same way that understanding lambdas will help you when you see C# delegates, JavaScript's anonymous functions, etc.

Dynamic Typing is Not Just About Writing Tests

Many static typing proponents, even Haskellers, have the opinion that while Lisp, Python, Ruby, JavaScript, et al, might have some nice syntax, the fact that they are devoid of static types makes them ultimately useless. You can't easily get the compiler to tell you if you try to do something stupid. For example, 3/"hello world" will be accepted by the language despite it being obvious (read: provable) that a runtime problem will occur.

Many Java programmers assume that you have to guess at the types being used in a dynamic program, or provide excessive documentation. The greater point is that what will work will work. Dynamic languages get out of your way so that you can explore a problem. Whereas in a statically typed language you have to make the solution consistent before you can try it out, you can try out an incomplete function very easily in dynamically-typed languages.

    (lambda (x)
      (if (< x 0)
          (do-this x)
          (do-that x)))
I can call the above lambda on negative numbers, even if do-that hasn't been written yet (assuming do-this has). The runtime doesn't generally complain unless it has to.

Once you have explored a problem, and you understand it, you probably have a nearly-working program, so you're pretty close to having a working one. You get no guarantees that the program will work for all possible inputs, and there isn't even a general way of restricting inputs (other than adding checks at runtime).

Dynamic programmers tend to think in the language they are using - that is, they write code while they're thinking, and test functions out to see whether they work for the cases they're interested in. If I write a function that adds two numbers together, I'm not even remotely interested in what happens if someone passes strings to it, because I'm not writing it for that use case.

This way of thinking keeps the code focused on the task in hand. If the code starts to get hard to read, or repetitive, the programmer will come up with an abstraction. E.g., beginner Java programmers often try this:

    if (x==3 || 5 || 10)
which doesn't compile, because the compiler sees 3 expressions that should all be booleans, and the 2nd and 3rd are ints instead. The programmer gets told they're wrong, and they should fix it to be:
    if (x==3 || x==5 || x==10)
That's clearly garbage, because x== is repeated. The programmer has just adopted a poorer way of thinking to suit a language, instead of adapting the language.
    if (x in (3,5,10))
would be great, as would:
    if ((3,5,10) contains x)
, or even, as valid Java:
    if (asList(3,5,10).contains(x))
Of course, this latter example is a little ugly. Translated directly to Scheme:
    (if (contains (list 3 5 10) x)
If this was used a lot, a Scheme programmer might write:
    (if (in? x 3 5 10)
A Java programmer could do that too, but they would need to know more of the language to be able to do it. Therefore, the Java programmer is 'corrected', instead of shown how to write their idiom in Java.

Because the language gets out of the way, this kind of code is easy to write. That brings about the worry that because you can write code to be as good as you like, you can also write incredibly bad code, and this is completely true.

The remarkable thing is that the language doesn't try to judge whether your code is good or bad - it just runs the code. That's great, because it lets you learn from your own mistakes, and even better, it lets you come up with things that the language designers didn't think of.

It's become quite clear in recent years that Object-Orientated Programming as Java and C++ do it is not the pinnacle of software engineering - but you can't escape Java's OO system. You can't really implement multiple dispatch for two separate codebases in one central place unless you go outside the language (e.g., reflection, bytecode weaving). If it turns out that Haskell's typeclasses aren't the best way of expressing things, you can't remove them from the language, because existing code depends on them - the compiler always needs to understand them - and probably you will still have to use them because the language's central concepts depend on them.

By the language not telling you that you're wrong, it lets you be right in ways that the language designer might not have envisaged. For example, the code snippet (+ "hello" "world") looks wrong, because + is not defined on strings. But in a language like Scheme, + is only a variable, so I can write:

    (let ((+ string-append))
      (+ "hello" "world"))
and it's no longer wrong (arguably a bit stupid though).

So how about those tests? Well, if you look at Java code:

    public Integer add(Integer a,Integer b)
    {
        return a+b;
    }
Both a and b can possibly be null, so what happens if null is passed? Er, you get a NullPointerException. Some programmers will specify that in documentation for the method, or just write that null is not allowed. The same method in Scheme would be callable with any values at all, though it would fail. That changes the attitude of calling programmers. Instead of looking at the half-hearted type signature, and seeing that null is a possible value, the programmer is more likely to look at the implementation.

In the first half of this entry, I said that types imply implementations - well, the reverse is true! Implementations imply types. You don't need a type signature to see what values can be passed to (lambda (a b) (+ a b)), you can infer the type signature from the implementation. In this case, you can pass anything to that lambda that you can pass to the primitive + procedure, whose type signature actually depends on the implementation of Scheme you're using, but it accepts at least those types in the specification.

So Scheme programmers aren't likely to write tests to see what happens if you pass credit cards to a function expecting addresses, because that's never going to be a use case. They informally restrict the inputs. They're emulating what a static type checker does, but mentally. That lets them organically approach a solution to a problem, instead of having to design it beforehand.

It's certainly true that this mental processing is error-prone, and bugs get into source code because of this. Then, many programmers are tempted to write swathes of tests, in an informal attempt to prove that their code is correct.

The bad part of this is that they aren't benefitting from the static typing camp's results now. They have a solution, but they can't use the machine to prove it. They'll end up writing the same kinds of tests over and over.

All this is why I think that, despite him talking about middle-of-the-road type checkers like Java's, Gilad Bracha was on the ball with his presentation about adding pluggable type checkers to dynamic languages. I thoroughly expect that it will be some time before a great one exists, and largely because too many people are sitting on the fence, instead of knocking it down.

If you ignore mediocre languages, there isn't a great deal of difference between how statically typed programmers think and how dynamically typed programmers think. We all like lambdas, we all like side-effect free code, we all dislike crashing programs. So the next time you see a proposal to add static typing to your favourite language, or to add some new dynamic feature to your static language (including 'get-out' clauses like Haskell's unsafePerformIO), bear in mind that you're just witnessing a step on the road to convergence.

If the next big language has a mandatory static type system, then it will either be a crap one, or Computer Science degrees will suddenly contain a lot more mathematics. That won't happen. The next big language will be crap, or dynamic.

8 comments:

Anonymous coward said...

Great post. Great point of view.

IMHO...

Large systems written in statically typed languages are more easily maintained.

and

Large and poorly designed and or poorly tested systems written in statically typed languages are more easily maintained.

Imagine 1000000 lines of code.

Why is it easier?

1. Navigating code. Who calls this method?
2. Navigating code. Who calls this method?
3. Navigating code. Who calls this method?
(Repeated in three times because this is at least three times as important as the next points)
4. Refactoring: I want to rename this method.
5. Refactoring: I want to move this method to the class of the first parameter.
6. Refactoring: I want to add another mandatory parameter.

You can do all of the above, in eclipse (netbeans?, idea?, visual studio?) in a couple of keystrokes over an arbitrarily large codebase.
While valiant attempts can be made at these tasks in dynamically typed languages, they can technically never be perfect as far as I understand(*). That does not matter too much in a well designed and well tested code base.

But we do not live in a world of perfectly designed and perfectly tested programs.

For large code bases (developed by sometimes average or bored developers), static typing allows for read write coding. For large code bases, dynamic typing is write only. Okay, I'm exaggerating a little but I hope you get my point.

Type inferencing statically typed languages are the way ahead. C# 3.0 and Scala are both showing the way here. I expect that there are others. (I don't know Haskell but I think it has type inferencing?)


(*) Okay, so java, c# et al. can suffer if reflection is used but that's much less common.

Antoine said...

Nice post, but most of the things you say you don't like about statically typed languages really only apply to Java and C#.

Haskell is statically typed, but it has nice syntax for testing list memebership (1 `elem` [1,2,3,4] == True) and doesn't have nulls (unless you ask for them).

I think the real fun parts of dynamic languages are the graceful handling of injecting code into a running system, ala Slime.

And as you said, it's nice to rapidly test out ideas without worrying about the typechecker barfing all over. But the better I get at Haskell, the less that becomes a problem.

Paddy3118 said...

To A.Coward: I get angry at the amount of large systems that are cancelled in the UK before they have even become active. A lot are government agencies watching tens or hundreds of millions of pounds of my tax payers money go down the drain.

It is time for a change. If your development process/language leads to bored programmers then you can't expect more.

I'd suggest a way ahead is to hire socially responsible programmers and use non 'belt-n-braces' languages that constrain you or your IDE to write excessive boilerplate code.
Another is greater emphasis on shared success with the customer.

I'm tired of the 'big systems need static languages' mantra, when yet another Government IT project shells out money for nothing.

- Paddy.

Ricky Clarkson said...

Anonymous coward - consider the largest computing system that we have - the Internet. It's been running for decades, and no single part of it that exists now was there at the start. There was no downtime for redeployment, and it can even (somewhat!) deal with malicious components.

It's actually quite hard to find a million-line codebase in a dynamic language, which might seem to prove you correct - it seems that people can't cope with a million lines of dynamically typed code.

Looked at from another angle, programmers in dynamically-typed languages are more likely to make libraries that their projects can use, they're not likely to reach 1,000,000 lines of code within one of those projects because the dependencies have been extracted, and also because they tend to be able to express more in fewer lines (excepting terse static languages like Haskell).

I'm not interested in writing poorly designed systems, or maintaining them - I'd rather rewrite.

It's possible to restrict yourself in a dynamic language to writing code that lets you easily find usages, rename and move procedures.

It's possible to unrestrict yourself in a static language to writing code where find usages doesn't work. All you need to do is pass a function reference around.

Ricky Clarkson said...

antoine,

Most of what I said I don't like was aimed at Java - quite explicitly, I thought. What I don't like about Haskell is, in short, that its type system is mandatory, but incomplete (no dependent types, no infinite types) and not first-class, in that I cannot use types as values.

When you get better and better at having the typechecker accept your code, I wager that you are constraining your thinking more and more successfully. I don't want to constrain my thoughts - if I do that I will find myself assuming the type system is always right, which it isn't.

Some simple code that works fine in Scheme but not in Haskell:

\f x -> f f x
(lambda (f x) (f f x))

There is a workaround (called fix, but for a different reason) and you can train yourself not to even come up with \f x -> f f x, but I don't think that is useful.

paddy3118,

I think the failure of said government projects is a disconnect between developers and end users. My housemate regularly tells me about the stupid screens he has to go through in his work at a bank, inputting the same data twice, in different formats, not being able to go back, having the wrong date format (MM/DD/YYYY instead of DD/MM/YYYY as is the UK's format).

He's never met the programmers, and I doubt he knows anyone who has. I'd say these issues are orthogonal to typing.

Samuel A. Falvo II said...

"If the type system was perfect, and the programmers using it never took shortcuts, there would be no need to run unit tests."

Wasn't there a philosopher who proved that you can never describe a system in the system itself? Therefore, you can never make a type system that is so thorough that it covers itself. From this, we can easily see that, even with the most sophisticated type system, you are still going to need unit tests or something like them.

@Patti: Name me ONE mission-critical project (military, aerospace, medical, etc.) written with a dynamically typed language. I'm talking about applications where human lives are at stake. I don't think you will find any.

Samuel A. Falvo II said...

Ricky,

I'm not sure what your comment aims to prove; the majority of the Internet software (if you include operating systems and TCP/IP stacks for them) is written in statically typed languages.

Antoine said...

While it's true that a type system constrains the software I write to software that typechecks, I feel like there are things I can do with a static typesystem that would be harder to do without (but maybe I've just drank the kool-aid).

An example would be writing a function polymorphic on it's return value, whose behavior is determined by how the return value is used by the caller.

But yeah, I don't think

(lambda (f x) (f f x))

has an equivalent in the Haskell type system, even though it looks similar to the fixed-point combinator. Is it used in a similar way?

Blog Archive

About Me

A salsa dancing, DJing programmer from Manchester, England.