Structural insight: Kernel

Showing posts with label Kernel. Show all posts

Thursday, July 18, 2013

Bypassing no-go theorems

This is not at all what I had in mind.
— Albert Einstein, in response to David Bohm's hidden variable theory.

A no-go theorem is a formal proof that a certain kind of theory cannot work. (The term no-go theorem seems to be used mainly in physics; I find it useful in a more general context.)

A valid no-go theorem identifies a hopeless avenue of research; but in some cases, it also identifies a potentially valuable avenue for research. This is because in some cases, the no-go theorem is commonly understood more broadly than its actual technical result. Hence the no-go theorem is actually showing that some specific tactic doesn't work, but is interpreted to mean that some broad strategy doesn't work. So when you see a no-go theorem that's being given a very broad interpretation, you may do well to ask whether there is, after all, a way to get around the theorem, by achieving what the theorem is informally understood to preclude without doing what the theorem formally precludes.

In this post, I'm going to look at four no-go theorems with broad informal interpretations. Two of them are in physics; I touch on them briefly here, as examples of the pattern (having explored them in more detail in an earlier post). A third is in programming-language semantics, where I found myself with a result that bypassed a no-go theorem of Mitchell Wand. And the fourth is a no-go theorem in logic that I don't actually know quite how to bypass... or even whether it can be bypassed... yet... but I've got some ideas where to look, and it's good fun to have a go at it: Gödel's Theorem.

John von Neuman's no-go theorem

In 1932, John von Neumann proved that no hidden variable theory can make all exactly the same predictions as quantum mechanics (QM): all hidden variable theories are experimentally distinguishable from QM. In 1952, David Bohm published a hidden variable theory experimentally indistinguishable from QM.

How did Bohm bypass von Neumann's no-go theorem? Simple, really. (If bypassing a no-go theorem is possible at all, it's likely to be very simple once you see how). The no-go theorem assumed that the hidden variable theory would be local; that is, that under the theory, the effect of an event in spacetime cannot propagate across spacetime faster than the speed of light. This was, indeed, a property Einstein wanted out of a hidden variable theory: no "spooky action at a distance". But Bohm's hidden variable theory involved a quantum potential field that obeys Schrödinger's Equation — trivially adopting the mathematical infrastructure of quantum mechanics, spooky-action-at-a-distance and all, yet doing it in a way that gave each particle its own unobservable precise position and momentum. Einstein remarked, "This is not at all what I had in mind."

John Stewart Bell's no-go theorem

In 1964, John Stewart Bell published a proof that all local hidden variable theories are experimentally distinguishable from QM. For, of course, suitable definition of "local hidden variable theory". Bell's result can be bypassed by formulating a hidden variable theory in which signals can propagate backwards in time — an approach advocated by the so-called transactional interpretation of QM, and which, as noted in my earlier post on metaclassical physics, admits the possibility of a theory that is still "local" with respect to a fifth dimension of meta-time.

Mitchell Wand's no-go theorem

In 1998, Mitchell Wand published a paper The Theory of Fexprs is Trivial.

The obvious interpretation of the title of the paper is that if you include fexprs in your programming language, the theory of the language will be trivial. When the paper first came out, I had recently hit on my key insight about how to handle fexprs, around which the Kernel programming language would grow, so naturally I scrutinized Wand's paper very closely to be sure it didn't represent a fundamental threat to what I was doing. It didn't. I might put Wand's central result this way: If a programming language has reflection that makes all computational states observable in the syntactic theory of the language, and if computational states are in one-to-one correspondence with syntactic forms, then the syntactic theory of the language is trivial. This isn't a problem for Kernel because neither of these conditions holds: not all computational states are observable, and computational states are not in one-to-one correspondence with syntactic forms. I could make a case, in fact, that in S-expression Lisp, input syntax represents only data: computational states cannot be represented using input syntax at all, which means both that the syntactic theory of the language is already trivial on conceptual grounds, and also that the theory of fexprs is not syntactic.

At the time I started writing my dissertation, the best explanation I'd devised for why my theory was nontrivial despite Wand was that Wand did not distinguish between Lisp evaluation and calculus term rewriting, whereas for me Lisp evaluation was only one of several kinds of term rewriting. Quotation, or fexprs, can prevent an operand from being evaluated; but trivialization results from a context preventing its subterm from being rewritten. It's quite possible to prevent operand evaluation without trivializing the theory, provided evaluation is a specific kind of rewriting (requiring, in technical parlance, a redex that includes some context surrounding the evaluated term).

Despite myself, though, I was heavily influenced by Wand's result and by the long tradition in which it followed. Fexprs had been rejected circa 1980 as a Lisp paradigm, in favor of macros. A rejected paradigm is usually ridiculed in order to rally followers more strongly behind the new paradigm (see here). My pursuit of $vau as a dissertation topic involved a years-long process of gradually ratcheting up expectations. At first, I didn't think it would be worth formulating a vau-calculus at all, because of course it wouldn't be well-enough behaved to justify the formulation. Then I thought, well, an operational semantics for an elementary subset of Kernel would be worth writing. Then I studied Plotkin's and Felleisen's work, and realized I could provide a semantics for Kernel that would meet Plotkin's well-behavedness criteria, rather than the slightly weakened criteria Felleisen had used for his side-effectful calculus. And then came the shock. When I realized that the vau-calculus I'd come up with, besides being essentially as well-behaved as Plotkin's call-by-value lambda-calculus, was actually (up to isomorphism) a conservative extension of call-by-value lambda-calculus. In other words, my theory of fexprs consisted of the entire theory of call-by-value lambda-calculus plus additional theorems.

I was boggled. And I was naively excited. I figured, I'd better get this result published, quick, before somebody else notices it and publishes it first — because it's so incredibly obvious, it can't be long before someone else does find it. Did I say "naively"? That's an understatement. There's some advice for prospective graduate students floating around, which for some reason I associate with Richard Feynman (though I could easily be wrong about that [note: a reader points out this]), to the effect that you shouldn't be afraid people will steal your ideas when you share them, because if your ideas are any good you'll have trouble getting anyone to listen to them at all. In studying this stuff for years on end I had gone so far down untrodden paths that I was seeing things from a drastically unconventional angle, and if even so I had only just come around a corner to where I could see this thing, others were nowhere close to any vantage from which they could see it.

[note: I've since written a post elaborating on this, Explicit evaluation.]

Kurt Gödel's no-go theorem

Likely the single most famous no-go theorem around is Gödel's Theorem. (Actually, it's Gödel's Theorems, plural, but the common informal understanding of the result doesn't require this distinction — and Gödel's result lends itself spectacularly to informal generalization.) This is what I'm going to spend most of this post on, because, well, it's jolly good fun (recalling the remark attributed to Abraham Lincoln: People who like this sort of thing will find it just the sort of thing they like).

The backstory to Gödel was that in the early-to-mid nineteenth century, mathematics had gotten itself a shiny new foundation in the form of a formal axiomatic approach. And through the second half of the nineteenth century mathematicians expanded on this idea. Until, as the nineteenth century gave way to the twentieth, they started to uncover paradoxes implied by their sets of axioms.

A perennial favorite (perhaps because it's easily explained) is Russell's Paradox. Let A be the set of all sets that do not contain themselves. Does A contain itself? Intuitively, one can see at once that if A contains itself, then by its definition it does not contain itself; and if it does not contain itself, then by its definition it does contain itself. The paradox mattered for mathematicians, though, for how it arose from their logical axioms, so we'll be a bit more precise here. The two key axioms involved are reductio ad absurdum and the Law of the Excluded Middle.

Reductio ad absurdum says that if you suppose a proposition P, and under this supposition you are able to derive a contradiction, then not-P. Supposing A contains itself, we derive a contradiction, therefore A does not contain itself. Supposing A does not contain itself, we derive a contradiction, therefore —careful!— A does not not contain itself. This is where the Law of the Excluded Middle comes in: A either does or does not contain itself, therefore since it does not not contain itself, it does contain itself. We have therefore an antinomy, that is, we've proved both a proposition P and its negation not-P (A both does and does not contain itself). And antinomies are really bad news, because according to these axioms we've already named, if there is some proposition P for which you can prove both P and not-P, then you can prove every proposition, no matter what it is. Like this: Take any proposition Q. Suppose not-Q; then P and not-P, which is a contradiction, therefore by reductio ad absurdum, not-not-Q, and by the Law of the Excluded Middle, Q.

When Russell's Paradox was published, the shiny new axiomatic foundations of mathematics were still less than a human lifetime old. Mathematicians started trying to figure out where things had gone wrong. The axioms of classical logic were evidently inconsistent, leading to antinomies, and the Law of the Excluded Middle was identified as a problem.

One approach to the problem, proposed by David Hilbert, was to back off to a smaller set of axioms that were manifestly consistent, then use that smaller set of axioms to prove that a somewhat larger set of axioms was consistent. Although clearly the whole of classical logic was inconsistent, Hilbert hoped to salvage as much of it as he could. This plan to use a smaller set of axioms to bootstrap consistency of a larger set of axioms was called Hilbert's program, and I'm remarking it because we'll have occasion to come back to it later.

Unfortunately, in 1931 Kurt Gödel proved a no-go theorem for Hilbert's program: that for any reasonably powerful system of formal logic, if the logic is consistent, then it cannot prove the consistency of its own axioms, let alone its own axioms plus some more on the side. The proof ran something like this: For any sufficiently powerful formal logic M, one can construct a proposition A of M that amounts to "this proposition is unprovable". If A were provable, that would prove that A is false, an antinomy; if not-A were provable, that would prove that A is true, again an antinomy; so M can only be consistent if both A and not-A are unprovable. But if M were able to prove its own consistency, that would prove that A is unprovable (because A must be unprovable in order for M to be consistent), which would prove that A is true, producing an antinomy, and M would in fact be inconsistent. Run by that again: If M can prove its own consistency, then M is in fact inconsistent.

Typically, on completion of a scientific paradigm shift, the questions that caused the shift cease to be treated as viable questions by new researchers; research on those questions tapers off rapidly, pushed forward only by people who were already engaged by those questions at the time of the shift. So it was with Gödel's results. Later generations mostly treated them as the final word on the foundations of mathematics: don't even bother, we know it's impossible. That was pretty much the consensus view when I began studying this stuff in the 1980s, and it's still pretty much the consensus view today.

Going there

Having been trained to think of Gödel's Theorem as a force of nature, I nevertheless found myself studying it more seriously when writing the theoretical background material for my dissertation. I found myself discoursing at length on the relationship between mathematics, logic, and computation, and a curious discrepancy caught my eye. Consider the following Lisp predicate.

($define! A ($lambda (P) (not? (P P))))

Predicate A takes one argument, P, which is expected to be a predicate of one argument, and returns the negation of what P would return when passed to itself. This is a direct Lisp translation of Russell's Paradox. What happens when A is passed itself?

The answer is, when A is passed itself, (A A), nothing interesting happens — which is really very interesting. The predicate attempts to recurse forever, never terminating, and in theory it will eventually fill up all available memory with a stack of pending continuations, and terminate with an error. What it won't do is cause mathematicians to despair of finding a solid foundation for their subject. If asking whether set A contains itself is so troublesome, why is applying predicate A to itself just a practical limit on how predicate A should be used?

That question came from my dissertation. Meanwhile, another question came from the other major document I was developing, the R-1RK. I wanted to devise a uniquely Lisp-ish variant of the concept of eager type-checking. It seemed obvious to me that there should not be a fixed set of rules of type inference built into the language; that lacks generality, and is not extensible. So my idea was this: In keeping with the philosophy that everything should be first-class, let theorems about the program be an encapsulated type of first-class objects. And carefully design the constructors for this theorem type so that you can't construct the object unless it's provable. In effect, the constructors are the axioms of the logic. Modus ponens, say, is a constructor: given a theorem P and a theorem P-implies-Q, the modus-ponens constructor allows you to construct a theorem Q. As desired, there is no built-in inference engine: the programmer takes ultimate responsibility for figuring out how to prove things.

Of the many questions in how to design such a first-class theorem type, one of the early ones has to be, what system of axioms should we use? Clearly not classical logic, because we know that would give us an inconsistent system. This though was pretty discouraging, because it seemed I would find myself directly confronting in my design the very sort of problems that Gödel's Theorem says are ultimately unsolvable.

But then I had a whimsical thought; the sort of thing that seems at once not-impossible and yet such a long shot that one can just relax and enjoy exploring it without feeling under pressure to produce a result in any particular timeframe (and yet, I have moved my thinking forward on this over the years, which keeps it interesting). What if we could find a way to take advantage of the fact that our logic is embedded in a computational system, by somehow bleeding off the paradoxes into mere nontermination? So that they produce the anticlimax of functions that don't terminate instead of the existential angst of inconsistent mathematical foundations?

Fragments

At this point, my coherent vision dissolves into fragments of tentative insight, stray puzzle pieces I'm still pushing around hoping to fit together.

One fragment: Alfred Tarski —who fits the aforementioned profile of someone already engaged by the questions when Gödel's results came out— suggested post-Gödel that the notion of consistency should be derived from common sense. Hilbert's program had actually pursued a formal definition of consistency, as the property that not all propositions are provable; this does have a certain practicality to it, in that the practical difficulty with the classical antinomies was that they allowed all propositions Q to be proven, so that "Q can be proven" ceased to be a informative statement about Q. Tarski, though, remarked that when a non-mathematician is told that both P and not-P are true, they can see that something is wrong without having to first receive a lecture on the formal consequences of antinomies in interaction with reductio ad absurdum.

So, how about we resort to some common sense, here? A common-sensical description of Russell's Paradox might go like this:

A is the set of all sets that do not contain themselves. If A contains itself, then it does not contain itself. But if it does not contain itself, then it does contain itself. But if it does contain itself, then it does not contain itself. But if it does not contain itself...

And that is just what we want to happen: instead of deriving an antinomy, the reasoning just regresses infinitely. A human being can see very quickly that this is going nowhere, and doesn't bother to iterate beyond the first four sentences at most (and once they've learned the pattern, next time they'll probably stop after even fewer sentences), but they don't come out of the experience believing that A both does and does not belong to itself; they come out believing that there's no way of resolving the question.

So perhaps we should be asking how to make the conflict here do an infinite regress instead of producing a (common-sensically wrong) answer after a finite number of steps. This seems to be some sort of deep structural change to how logical reasoning would work, possibly not even a modification of the axioms at all but rather of how they are used. If it does involve tampering with an axiom, the axiom tampered with might well be reductio ad absurdum rather than the Law of the Excluded Middle.

This idea — tampering with reductio ad absurdum rather than the Law of the Excluded Middle — strikes a rather intriguing historical chord. Because, you see, one of the mathematicians notably pursuing Hilbert's program pre-Gödel did try to eliminate the classical antinomies by leaving intact the Law of the Excluded Middle and instead modifying reductio ad aburdum. His name was Alonzo Church —you may have heard of him— and the logic he produced had, in retrospect, a curiously computational flavor to it. While he was at it, he took the opportunity to simplify the treatment of variables in his logic, by having only a single structure that binds variables, which (for reasons lost in history) he chose to call λ. Universal and existential quantifiers in his logic were higher-order functions, which didn't themselves bind variables but instead operated on functions that did the binding for them. Quite a clever device, this λ.

Unfortunately, it didn't take many years after Church's publication to show that antinomies arose in his system after all. Following the natural reflex of Hilbert's program, Church tried to find a subset of his logical axioms that could be proven consistent — and succeeded. It turned out that if you left out all the operators except λ you could prove that each proposition P was only equivalent to at most one irreducible form. This result was published in 1936 by Church and one of his students, J. Barkley Rosser, and today is known as the Church–Rosser Theorem (you may have heard of that, too). In the long run, Church's logic is an obscure historical footnote, while its λ-only subset turned out to be of great interest for computation, and is well-known under the name "λ-calculus".

So evidently this idea of tampering with reductio ad absurdum and bringing computation into the mix is not exactly unprecedented. Is it possible that there is something there that Alonzo Church didn't notice? I'd have to say, "yes". Alonzo Church is one of those people who (like Albert Einstein, you'll recall he came up in relation to the first no-go theorem I discussed) in retrospect appears to have set a standard of intelligence none of us can possibly aspire to — but all such people are limited by the time they live in. Einstein died years before Bell's Theorem was published. Heck, Aristotle was clearly a smart guy too, and just think of everything he missed through the inconvenience of being born about two millennia before the scientific revolution. And Alonzo Church couldn't, by the nature of the beast, have created his logic based on a modern perspective on computation and logic since it was in part the further development of his own work over many decades that has produced that modern perspective.

I've got one more puzzle piece I'm pushing around, that seems like it ought to fit in somewhere. Remember I said Church's logic was shown to have antinomies? Well, at the time the antinomy derivation was rather baroque. It involved a form of the Richard Paradox, which concerns the use of an expression in some class to designate an object that by definition cannot be designated by expressions of that class. (A version due to G.G. Berry concerns the twenty-one syllable English expression "the least natural number not nameable in fewer than twenty-two syllables".) The Richard Paradox is naturally facilitated by granting first-class status to functions, as λ-calculus and Lisp do. But, it turns out, there is a much simpler paradox contained in Church's logic, involving less logical machinery and therefore better suited for understanding what goes wrong when λ-calculus is embedded in a logic. This is Curry's Paradox.

I'll assume, for this last bit, that you're at least hazily familiar with λ-calculus, so it'll come back to you when you see it.

For Curry's Paradox, we need one logical operator, three logical axioms, and the machinery of λ-calculus itself. Our one logical operator is the binary implication operator, ⇒. The syntax of the augmented λ-calculus is then

T ::= x | c | (λx.T) | (TT) | (T⇒T)

That is, a term is either a variable, or a constant, or a lambda-expression, or an application, or an implication. We don't need a negation operator, because we're sticking with the generalized notion of inconsistency as the property that all propositions are provable. Our axioms assert that certain terms are provable:

For all terms P and Q, if provably P and provably (P⇒Q), then provably Q. (modus ponens)
For all terms P, provably P⇒P.
For all terms P and Q, provably ((P⇒(P⇒Q))⇒(P⇒Q)).

The sole rewriting axiom of λ-calculus, lest we forget, is the β-rule:

(λx.F)G → F[x ← G]

That is, to apply function (λx.F) to argument G, substitute G for all free occurrences of x in F.

To prove inconsistency, first we need a simple result that comes entirely from λ-calculus itself, called the Fixpoint Theorem. This result says that for every term F, there exists a term G such that FG = G (that is, every term F has a fixpoint). The proof works like this:

Suppose F is any term, and let G = (λx.(F(xx)))(λx.(F(xx))), where x doesn't occur in F. Then G = (λx.(F(xx)))(λx.(F(xx))) → (F(xx))[x ← (λx.(F(xx)))] = F((λx.(F(xx)))(λx.(F(xx)))) = FG.

Notice that although the Fixpoint Theorem apparently says that every F has a fixpoint G, it does not actually require F to be a function at all: instead of providing a G to which F can be applied, it provides a G from which FG can be derived. And —moreover— for most F, derivation from G is a divergent computation (G → FG → F(FG) → F(F(FG)) → ...).

Now we're ready for our proof of inconsistency: that for every term P, provably P.

Suppose P is any term. Let Q = λx.(x⇒P). By the Fixpoint Theorem, let R be a term such that QR = R. By writing out the definition of Q and then applying the β-rule, we have QR = (λx.(x⇒P))R → (R⇒P), therefore R = (R⇒P).

By the second axiom, provably (R⇒R); but R = R⇒P, so, by replacing the right hand R in (R⇒R) with (R⇒P), provably (R⇒(R⇒P)).

By the third axiom, provably ((R⇒(R⇒P))⇒(R⇒P)); and we already have provably (R⇒(R⇒P)), so, by modus ponens, provably (R⇒P). But R = (R⇒P), so provably R.

Since provably R and provably (R⇒P), by modus ponens, provably P.

Note: I've had to fix errors in this proof twice since publication; there's some sort of lesson there about either formal proofs or paradoxes.

So, why did I go through all this in detail? Besides, of course, enjoying a good paradox. Well, mostly, this: The entire derivation turns on the essential premise that derivation in the calculus, as occurs (oddly backwards) in the proof of the Fixpoint Theorem, is a relation between logical terms — which is to say that all terms in the calculus have logical meaning.

And we've seen something like that before, in my early explanation of Mitchell Wand's no-go theorem: trivialization of theory resulted from assuming that all calculus derivation was evaluation. So, if we got around Wand's no-go theorem by recognizing that some derivation is not evaluation, what can we do by recognizing that some derivation is not deduction?

Friday, January 27, 2012

Guarded continuations

Go to, let us go down, and there confound their language, that they may not understand one another's speech.

— Genesis 11:7 (King James Version)

I've been asked about Kernel's guarded continuations. In writing that section of the Kernel Report, I knew it wasn't as lucid as one might like, but getting it all down while I had it clearly in mind was something of an achievement in itself. I'm not sure I've ever seen a really lucid explanation of continuations; explaining them should probably be considered a major challenge in its own right, above and beyond figuring out how the mechanism should work. It's high time I tried to do better, though; however many tries it takes to get the explanation to come clear, I won't get there if my number of tries never increases.

On my Kernel web page I describe guarded continuations as 'Exception handling "done right"'. For an ultra-short tag line, that may be about as good as any; but guarded continuations are more a simultaneous blend-and-improvement on Scheme continuations and Java exception handling, rather than a supposedly-optimal exception facility. We may still be several generations of exception facilities away from getting exceptions right; but as may be, I think guarded continuations have something to add to the discussion.

It's also worth saying up front that guarded continuations are lower-level than a good exception facility would be. Kernel tries for very simple low-level facilities plus the ability to build higher-level abstractions from them. It's an open, and fascinating, question how well Kernel's abstraction facilities will actually apply to exception handling — and I hope to learn a lot from that question. (I expect shortcomings of Kernel's abstractive facilities to be not about the facilities' power (which I maintain is extraordinary) but about the directions in which that power can be applied. When I figure out what the missing directions are, I mean to go that way.)

Stacks and trees

Classically, programming languages deal with function calls using a stack. There are two kinds of stuff on the stack: local variables (data), and return addresses (control). The local data for a function call produces a single set of local variables; but the algorithm for that same call may produce many different return addresses, one for each point in the algorithm where it, in turn, calls a function. In principle, one might have two stacks, one for data and one for control; but classically, all local variables and all return addresses are use-once, to be discarded in the same order they were introduced, so it may seem natural to store both on a single stack (with resultant security problems, malware accessing data as control and vice versa).

First-class (lexically scoped) functions require local data to be stored, in general, as a tree rather than a stack. When a first-class function is returned as the result of a function call (in vintage parlance, as an upward funarg), lexical scoping expects that function to have access to —in general— the entire column of variables that were on the stack when the function was created, even though some of the function calls that created those variables have since ended. So data that used to be popped off the stack must now, potentially, persist. A set of local variables, which used to be the data part of a stack frame, is now an environment, with a fixed set of ancestors (variables further up the column) and potentially any number of children (one for each call to any function defined in that environment). (In Kernel, btw, environments are first-class objects; there was a lively LtU discussion on first-class environments, here.)

First-class continuations do to the control stack what first-class functions did to the data stack: they turn it into a tree. (First-class continuations do this to the data stack too, but I'm not aware of any language that has first-class continuations without first-class functions.) Just as first-class functions required entire columns of local variables to persist after the ends of function calls that created them, first-class continuations require entire columns of "return addresses" to persist after the ends of function calls that created them.

In visualizing the tree of continuations, the actual algorithms of the program hover in the areas beneath the nodes of the tree (possibly, multiple algorithms under a single node). Each child continuation is a socket waiting for a value to be plugged into it, to be chewed on by the algorithm above that child — which may, eventually, terminate and plug a value into its parent continuation. Primitive operations hover beneath the leaves of the tree. Kernel pretty much only needs one non-primitive operation: combination evaluation, which spins off a subtree to evaluate the operator, spins off subtrees as appropriate to evaluate operands, and finally hands the situation over to another algorithm to make the operative call (using for the operative call its parent continuation, in properly tail-recursive fashion).

Guards

When a function f calls another function g, f gives up some of its ability to regulate the future evolution of the computation. The programmer tends to assume that not all such power has been lost — that (1) the call to g won't terminate without control passing back through the point at which f called it, and (2) that particular call to g won't return through that point more than once. In the presence of first-class continuations, though, even these weak assumptions needn't hold. Once f calls g, the dynamic extent of that call might be exited, and/or later reentered any number of times, in ways that bypass f entirely. Perhaps f opens a file, calls g, and then closes the file before relinquishing control to whoever called f; if g terminates bypassing f, the file might not be closed outside the call to f, and if g is reentered bypassing f, the file might not be open inside the call to g.

Scheme addresses this sort of problem by means of dynamic-wind, a device that causes specified actions to be performed whenever a continuation-call tries to enter or leave the dynamic extent of the call to g. In R5RS Scheme (the version Kernel primarily diverged from), there were a couple of problems with dynamic-wind.

Although dynamic-wind allowed something to be done every time the dynamic extent of the call to g was entered, and every time it was exited, these before and after actions were unable to observe (let alone affect) what value was being passed into or out of the dynamic extent, so failing to support context-sensitive regulation of access to/from the dynamic extent. (In Javascript terms <shudder>, before and after are akin to finally rather than catch.)
The R5RS was explicitly baffled by what would happen if continuations were used to pass into or out of the dynamic extents of the calls to before and after — which suggested to me that the whole thing was something of a kludge. The correct behavior in such cases ought to follow naturally from the conceptual framework, and if it didn't follow naturally one had to wonder if the conceptual framework was altogether coherent.

Java exception handling differs from this arrangement in several ways.

Mostly obviously and trivially, it only allows exiting a dynamic extent, not reentering one.
An intercepting catch has access to the value being thrown (unlike the after of dynamic-wind).
A catch selectively intercepts throws, based on the "target" (one might even say, "destination") of the throw. Whereas Lisp cleanly separates the target of an abnormal transfer of control from the value being transferred —the continuation called, versus the value passed to it— Java entangles them in a single thrown object whose type is the "target".
Once a Java exception is caught, that's the end of the exception unless it is explicitly rethrown — in contrast to the after of dynamic-wind, which ordinarily does not disrupt the abnormal transfer of control during which it is called.

Interestingly, one fairly intricate characteristic shared by both Scheme's dynamic-wind and Java's catch is that the target is a node of a tree. This is, perhaps, all the more interesting since the trees are seemingly very different in nature. The Java target is a type in a tree of exception types; the Scheme target is a continuation in a tree of continuations.

Kernel's guarded continuations drew on both these models for inspiration. An applicative guard-continuation allows the construction of a continuation with entry- and exit-guard clauses. Exit-guard clauses selectively intercept continuation-calls from inside to outside the guarded dynamic extent (the branch of the continuation tree descending from the guarded continuation), and are therefore analogous to Java catch clauses, with some adjustments. Entry-guard clauses are symmetrically designed to selectively intercept continuation-calls from outside to inside the guarded extent.

Each guard clause has a selector continuation. Keep in mind that a continuation-call has two ends: it goes from one point in the continuation-tree to another, and in this case we know that one end of the call is inside the guarded extent, while the other end is outside. A continuation-call matches the selector of a clause just if the outside end of the call lies within the tree branch descending from the selector. For an exit-guard clause this means that the called continuation must be in the selector's branch of the tree. For an entry-guard, the starting point of the continuation-call must be within the selector's branch of the tree.

Because entry guards are selected based on the source of the continuation-call, a list of interceptors selected when the continuation-call is initiated will be different, in general, from what one would get by stopping at some intermediate interceptor and then "rethrowing" the call (in Java catch style). This suggests that normal termination of an intercepting guard should continue along the path of the original continuation-call, reserving explicit diversion for when the interceptor doesn't want to follow the original path. Secondarily, although we do want it to be convenient (as well as possible) for an interceptor to continue along the path of the original continuation-call, we don't really want to give the interceptor first-class continuations for either end of the call (source or destination). We may have built some abstract device on top of the continuation primitives, and as a result, the source/destination continuations may be internal details that really oughtn't be casually spread about.

Therefore, normal termination of an intercepting guard continues along the path of the continuation-call, and the interceptor when called is provided with an explicit continuation for diverting control to just outside the guarded extent (Java-style).

For versatile guard selection, each guarded gateway (entry or exit for a dynamic extent) has a list of guard clauses. When a continuation-call goes through the gate (entering or exiting), the selectors of these clauses are checked, one at a time, until one is found that matches the call, or until there are no more clauses so that no interception occurs. No more than one clause is selected. A root continuation is provided, which can be used as a selector that matches all possible calls. Using the root continuation, plus the ability for an interceptor to continue along the continuation-call path, one can (for example) arrange that a clause will be selected except when the call matches some narrower selector (for example, the error continuation, which is the ancestor of all evaluation-aborting error handlers). One provides a clause that catches errors and causes them to continue along their path, followed by a clause that catches everything and does whatever it is one wants done only in non-error situations.

When a continuation-call is made, a list of intercepting guard clauses can be immediately constructed, since continuation guards are fixed at the time the guarded continuation is constructed. The list doesn't have to be constructed then, of course, exactly because all the relevant parts of the continuation tree are immutable; but if it isn't constructed immediately, one will have to store the information needed to select the clauses later — which includes the source of the call as well as its destination (destination to select exit guards, source to select entry guards).

The Kernel Report carefully works out subtleties of the device, such that when a guard intercepts an entry/exit to dynamic extent d, and then itself makes a continuation-call, that call does not select any guard on d.

The traditional Scheme device for acquiring a first-class continuation object is call/cc, which calls a procedure and passes to that procedure the continuation to which that call would normally return. Frankly, this was always a very clumsy way to work with continuations; one might almost suspect it was devised as an "esoteric programming language" feature, akin to INTERCAL's COME FROM statement. With guarded continuations, though, dependence on call/cc finally becomes untenable. If I'm working in Java, and I want to create a new class of exception, I do that by adding a subtype to the tree of exception types; to do the same thing in Kernel, it seems I want to add a new continuation to the tree branch of continuations descending from the error continuation. If I had to use call/cc to acquire continuations, I could only add a new descendant of the error continuation if I were actually in the dynamic extent of the error continuation — so, to define a new "type" of error during normal computation, I'd have to first "throw" an error exception, passing through all the exit- and entry-guards to get to the point in the tree where I want to add the new continuation. What's wanted is ability to add a child continuation to an existing continuation without having to be inside the dynamic extent of the parent. Kernel provides just that ability, with extend-continuation (to add some algorithm between child and parent) and guard-continuation (to add guards between child and parent). (Interestingly, call/cc still has to be provided as a primitive, since it can't be derived from the other primitives.)

An illustrative example

Alas, no. Source code for continuation guarding is generally bulky and hard to read, and attempting to illuminate the basic concepts of continuation guarding that way would more likely just create an unhelpful mess. For me, at least, it's all best understood by visualizing the tree of nested dynamic scopes, whereas the source code only directly reveals a tree of nested static syntax; the dynamic tree then has to be deduced, and the two trees don't correspond as closely as one might think.

How best to arrange for a clear illustration, I'll have to give some more thought. Meanwhile, imagine there was an illustration here that made it all crystal clear to you.

Delimited continuations

I've also been asked to discuss guarded continuations versus delimited continuations. <cough> That's a rather large topic. I'd probably have to start by explaining why I think "delimited continuations" is a misnomer — by my reckoning, delimited continuations aren't continuations at all. They're functions, and functions are a completely different sort of creature than continuations. Continuations only have one side (input), which is why it's simply meaningless to talk about composing them. (Can you see I'm working up to a major rant, here?)

So I'm going to end this post here —it's already the longest post I've yet written on this blog— and hope to put together something on delimited continuations later. Perhaps that post will also help to further clarify guarded continuations, but we'll see. It's yet another item for my collection of in-development blog posts (along with guarded environments, another interesting topic this has reminded me of).

Tuesday, May 17, 2011

Dangerous things should be difficult to do by accident

Although this is a key principle for design in general —scarcely behind supporting what the system is being designed to do— I'm mainly interested in it here as a principle for designing programming languages.

Bicycles

That said, the metaphor I use to ground the principle is from mechanical engineering.

A once-popular bicycle design was the "penny-farthing", with a great big front wheel that the rider essentially sat on top of, and a small rear wheel. (Why "penny farthing"? The British penny was a large coin, and the farthing a small one.) The pedals were directly on the front wheel, and the handlebars were directly over it, turning on a vertical axis. What's wrong with that? Obvious problems are that the rider is too far up for their feet to reach the ground, so it's not easy to stop safely; there's a long way to fall; the rider is so far forward that it's easy to fall forward over the front; and it's easy to get one's feet, or clothing, caught in the spokes of the front wheel, especially when turning (as this causes the spokes to move in relation to the rider). All of which obvious problems are eliminated by the later "safety bicycle" design, which has two smaller wheels with the rider sitting between them, feet well away from the parts that turn when steering, and low enough that the rider can simply plant their feet on the ground when stopped.

The safety bike design also uses a chain to multiply the turning of the pedals into a higher speed than can be achieved with the penny-farthing (and multiplying speed was the reason the front wheel of the penny-farthing was made so big in the first place).

But another thing I find especially interesting about the safety bicycle design is another innovation: its steering axis —the axis along which the handle bars rotate the front wheel— is well off the vertical. This angle off the vertical (called the caster) means that the force of gravity, pulling the rider down toward the ground, tends to pull the front wheel toward the straight-forward position. In fact, the further the handle bars are turned to either side, the more gravity pushes them back out of the turn. That's inherently stable. (The fact that, riding at speed, both wheels act as gyroscopes doesn't hurt either.)

So the significant caster of the safety bicycle actively helps the rider to limit turns to intentional turns. That's the kind of designed-in safety factor a programming language should aspire to (and it's a tough standard to live up to).

Programming languages

Douglas McIlroy, at the 1969 Extensible Languages Symposium, described opposing philosophies of programming language design as "anarchist" and "fascist". The principle of accident avoidance illuminates both philosophies, and the relation between them. (McIlroy was not, BTW, playing favorites to either philosophy, so if you think this terminology is harder on one side or the other that may tell you something about your politics. :-).

C makes it easy to shoot yourself in the foot; C++ makes it harder, but when you do it blows your whole leg off.

— Bjarne Stroustrup (attributed)

The fascist approach to accident prevention is to simply prohibit dangerous behaviors — which prevents accidents caused by the prohibited behaviors, at the cost of, at least, (a) forcing the programmer to work around the prohibition and (b) prohibiting whatever gainful employment, if any, might otherwise be derived by exploiting the prohibited behavior. (There's a case claimed for the fascist approach based on ability to prove things about programs, but that's a subject for a different post; here I'm considering what may be done deliberately versus what may be done by accident, while that other post would concern what can be done versus what can be proven about it.) Drawbacks to this arrangement accrue from both (a) and (b), as workarounds in (a) are something else to get wrong and, moreover, when the programmer wants to overcome limitations of (b), this tends to involve subverting the fascist restrictions, rather than working with them, producing an inherently unstable situation (in contrast to the ideal of stability we saw with the significant caster of the safety bicycle).

The anarchist philosophy makes reliance on this design principle more obvious. You've got more opportunities to do dangerous things, so if there's something in the language design that causes you to do those things when you don't mean to —or that pushes you to mean to do them routinely, multiplying opportunities to do them wrong— that's going to be a highly visible problem with the language.

Most Lisps are pretty anarchic, and certainly my Kernel programming language is, supporting fexprs as it does. Dangerous things are allowed on general principle; and the whole language is, among other things, an exercise in strongly principled language design, so some explicit principle was clearly needed to keep the anarchy sane. Dangerous things should be difficult to do by accident was crafted for that sanity.

Which brings me back to something I said I'd post separately (in my earlier post about fexprs).

Hygiene in Kernel

Kernel is statically scoped, which is key to making fexprs manageable.

The static environment of a combiner is the environment of symbol-value bindings in effect at the point (in source code, and in time) where the combiner is defined. The dynamic environment of a call to a combiner is the environment of symbol-value bindings in effect at the point (in source code and in time) from which the combiner is called. The local environment of a combiner call is the environment of symbol-value bindings that are used, for that call, to interpret the body of the combiner. The local environment has some call-specific bindings, of the combiner's formal parameters to arguments (or operands) passed to the call; but then, for other symbols, the local environment refers to a parent environment. In a statically scoped combiner, the parent is the static environment; in a dynamically scoped combiner, the parent is the dynamic environment.

When a programmer writes a symbol into source code, their understanding of what that symbol refers to tends to be based on what else is declared in the same region of source code. For the actual meaning of the symbol to coincide with the programmer's expectation, most programming languages use statically scoped applicatives (applicatives meaning, as explained in my earlier post, that the operands are evaluated in the dynamic environment, and the resulting arguments are passed to the underlying combiner). This behavior —operands interpreted in the dynamic environment, combiner body interpreted in a local child of the static environment— is commonly called good hygiene.

Early Lisp was dynamically scoped. (How and why that happened, and then for a couple of decades got worse instead of better, are explored in Section 3.3 of my dissertation.) Under dynamic scope, when one writes a combiner, one doesn't generally know anything about what it will do when called: none of the non-local symbols have any fixed meaning that can be determined when-and-where the combiner is defined. Sometimes one actually wants this sort of dynamic behavior; but good hygiene is the "straight forward" behavior that anarchic Kernel gravitates toward as a stable state.

So, what does Kernel do to allow hygiene violations while maintaining a steady stabilizing gravity toward good hygiene?

The primitive constructor of compound operatives, $vau, is statically scoped; in addition to a formal parameter tree that matches the operands of the call, there is an extra environment parameter, a symbol that is bound in the local environment to the dynamic environment of the call. That makes it possible to violate hygiene — although it is actually commonly used to maintain hygiene, as will be shown. If you don't want to use the dynamic environment, you can use a special Kernel value, #ignore, in place of the environment parameter to prevent any local binding to the dynamic environment, so it's then impossible to accidentally invoke the dynamic environment.

The usual constructor for compound applicatives, $lambda, can be defined as follows; it simply transforms a combination ($lambda formals . body) into (wrap ($vau formals #ignore . body)).

($define! $lambda ($vau (formals . body) env (wrap (eval (list* $vau formals #ignore body) env))))

The very existence of $lambda is the first and simplest way Kernel pushes the programmer toward good hygiene: because $lambda is easier to use than $vau —and because they can't readily be mistaken for each other— the programmer will naturally use $lambda for most purposes, turning out hygienic combiners as a matter of course. Constructing an operative takes more work. Constructing an applicative that doesn't ignore its dynamic environment is even more laborious, requiring a composition of $vau with wrap, as in the standard derivation of get-current-environment:

($define! get-current-environment (wrap ($vau () e e)))

So good hygiene just takes less work than bad hygiene.

Once the programmer has decided to use $vau, gravitating toward good hygiene gets subtler. The above derivation of $lambda exemplifies the main tactics. In order to evaluate the operands in any environment at all, you typically use eval — and eval requires an explicit second argument specifying the environment in which to do the evaluation. And the most immediately available environment that can be specified is the one for which a local binding has been explicitly provided: the dynamic environment of the call.

A nuance here is that the expression to be evaluated will typically be cobbled together within the combiner, using some parts of the operand tree together with other elements introduced from elsewhere. This is the case for our derivation of $lambda, where the target expression has four elements — two parts of the operand tree, formals and body; literal constant #ignore; and $vau. For these cases, a key stabilizing factor is that the construction is conducted using applicative combinations, in which the parts are specified using symbols. Since the constructing combinations are applicative, those symbols are evaluated in the local environment during construction — so by the time the constructed expression is evaluated, all dependencies on local or static bindings have already been disposed of. The additional elements typically introduced this way are, in fact, combiners (and occasional constants), such as in this case $vau; and these elements are atomic, evaluating to themselves, so they behave hygienically when the constructed expression is later evaluated (as I noted in my earlier post under the subheading "fluently doing nothing").