Structural insight

Wednesday, July 31, 2013

Explicit evaluation

...if you cannot say what you mean, you can never mean what you say.
— Centauri Minister of Intelligence, Babylon 5.

Ever since I started this blog, I've had in mind to devote a post to the relationship between the strong theory of vau-calculus and the no-go theorem of Mitchell Wand's 1998 paper The Theory of Fexprs is Trivial. Some of my past attempts to explain this relationship have failed miserably, though, so before trying again I wanted some new insight into how to approach the explanation. Following my mention of this issue in my recent post on bypassing no-go theorems, and some off-blog discussion it provoked, I'm going to take a stab at it.

The trouble has always been that while my result is very simple, and Wand's result is very simple, and both treatments use quite conventional mathematical techniques (playing the game strictly according to Hoyle), they're based on bewilderingly different conceptual frameworks. There's not just one difference, but several, and if you try to explain any one of them, your audience is likely to get hung up on one or more of the others.

Here are some highlights of how these conceptual differences stack up; I'll look at things in more detail hereafter.

Wand's paper assumes that the only equivalences of interest are those between source expressions; he acknowledges this explicitly in the paper.
Wand's paper isn't really about fexprs — it's about computational reflection, as can be seen in the Observations and Conclusions section at the end of the paper.
Wand's operational semantics has no terms that aren't source expressions. This isn't a realistic depiction of fexprs as they occur in Lisp, but it is consistent with both his exclusive interest in the theory of source expressions and his interest in computational reflection.
Wand's operational semantics uses implicit evaluation rather than explicit evaluation: that is, to distinguish between an expression to be evaluated and an expression not to be evaluated, he uses special contexts marking expressions that aren't to be evaluated, rather than special contexts marking expressions that are to be evaluated. From a purely mathematical standpoint, a rewriting system using implicit evaluation always has a trivial theory, while a rewriting system using explicit evaluation can have quite a strong theory. However, Wand's conceptual commitments prevent him both from using explicit evaluation and from caring about its consequences. He cannot use explicit evaluation because the explicit evaluation contexts would not be representable in source expressions, violating his prohibition against non-source terms. And, even if such expressions were allowed, all of the nontrivial equivalences in the theory would involve non-source terms, so that all the equivalences Wand is interested in would still be trivial (though for a different technical reason).

The deceptively simple math

Suppose we want a term-rewriting system in which any source expression S may sometimes appear as data and other times appear as program code to be executed. We can do this in either of two ways: either use a special context to mark a subterm that isn't to be evaluated, or use a special context to mark a subterm that is to be evaluated.

What Wand calls contextual equivalence, and I call operational equivalence, is a binary relation on terms defined thus (up to unimportant differences between treatments):

T1 ≅ T2 iff for every context C and observable V, C[T1] ↦* V iff C[T2] ↦* V

But this formal definition interacts differently with the two different approaches to marking whether S is to be evaluated.

Suppose we use a special context to mark a subterm that isn't to be evaluated. I call this strategy implicit evaluation, because a subterm is implicitly to be evaluated unless we go out of our way to say otherwise. This strategy would naturally occur to a traditional Lisp programmer, who is used to quotation. If this special context is identified by an operator, likely we'd call it "quotation"; but whatever we call it, suppose Q is a context that marks its subterm as data. This means, formally, that Q[S] is itself an observable, and that Q[S1] and Q[S2] are different observables unless S1 and S2 are syntactically identical. It immediately follows, from our definition of operational equivalence, that no two source terms S1 and S2 can ever be operationally equivalent unless they are syntactically identical. When you add all the trappings in Wand's paper, this trivialization of theory might look more complicated, but it still comes down to this: if there's a context that converts arbitrary terms into irreducible observables, then the operational equivalence relation is trivial.

But suppose we use the second strategy, a special context to mark a subterm that is to be evaluated. I call this strategy explicit evaluation because a subterm is only evaluated if it's immediately surrounded by a context that explicitly says to evaluate it. Then the cause of Wand's trivialization of theory "simply" vanishes. Except that the consequences aren't at all simple. The operational equivalence relation is really measuring something different under explicit evaluation than it did under implicit evaluation (under explicit evaluation, source terms have trivial theory because they're irreducible, whereas under implicit evaluation they have trivial theory even though they aren't irreducible). So now I'll go back to square one and build the mathematical machinery with more care.

Dramatis personae

In classical small-step operational semantics, we have a term set, and a number of binary relations on terms.

The term set includes both source expressions and the observable results of computations, and possibly also some terms that represent states of computation that are neither source expressions nor observables.

Six key properties that a binary relation on terms might have:

Reflexive: T1 > T1
Transitive: if T1 > T2 and T2 > T3 then T1 > T3
The reflexive transitive closure ("zero or more steps") is written by suffixing "*" to the relation, as T1 >* T2
Commutative: if T1 > T2 then T2 > T1
A reflexive transitive commutative relation is called an equivalence.
Compatible: if T1 > T2 then for all contexts C, C[T1] > C[T2]
A compatible equivalence is called a congruence.
Church-Rosser: If T1 >* T2 and T1 >* T3 then there exists T4 such that T2 >* T4 and T3 >* T4
Deterministic: For each T1, there is at most one T2 such that T1 > T2
This is not usually a mathematically desirable property, because it interferes with other properties like compatibility. However, determinism may be semantically desirable (i.e., we want a program to behave the same way each time it's run).

Four key binary relations in the small-step semantics of a programming language:

Operational step, written ↦
A directed relation meant to "obviously" describe what programs in the language are supposed to do. Because it's meant to be obviously right, it doesn't try to have nice mathematical properties like Church-Rosser-ness. Expected to be deterministic and non-compatible.
Calculus step, written →
A relation meant to have lots of nice mathematical properties, like the reduction step relation of lambda-calculus. The calculus step is generally defined to be compatible, and if it isn't Church-Rosser, something's wrong.
Calculus equality, written =
The reflexive transitive commutative closure of the calculus step (thus, the smallest equivalence relation containing the calculus step).
Operational equivalence, written ≅
T1 ≅ T2 iff for every context C and observable V, C[T1] ↦* V iff C[T2] ↦* V

The master theorems one wants to prove about these relations are

completeness:
↦* implies →*
Church-Rosser-ness:
→ is Church-Rosser
soundness:
= implies ≅

(There's another "master theorem" we could include on this list, called standardization, but that really is way more infrastructure than we need here.)

Wand's operational semantics

The syntax of terms in Wand's semantics is:

T ::= x | (λx.T) | (TT) | (fexpr T) | (eval T)

That is, a term is either a variable, or a lambda-expression, or a combination, or a fexpr, or a call to eval. Wand explained he didn't want to have two different binding constructs, one for ordinary procedures and one for fexprs. Well, I didn't either; it's characteristic of our different approaches, though, that where Wand made his binding construct a constructor of applicatives (that evaluate their argument), and put a wrapper around it to cause the operand to be quoted, I made my binding construct a constructor of operatives (i.e., fexprs), and put a wrapper around it to cause the operand to be evaluated.

Wand has three rewriting rules: one for applying a λ-expression, one for applying a fexpr, and one for applying eval. If we were being naive, we might try to define a calculus step like this:

((λx.T)V) → T[x ← V]
(fexpr V)T → (V encode(T))
(eval T) → decode(T)

Here, T is a term, and V is a value, that is, a term that's been reduced as much as it can be. (Obscure point: a value is a term that's irreducible by the operational step. Don't worry about it.)

The encode/decode functions require some explanation. The point of not evaluating certain subterms is to be able to use those unevaluated subterms as data. So somehow you have to represent them in an accessible data form. One way to do this would be to use a special quotation operator, and then have accessors that act on quoted expresions, like (car (quote (T1 . T2))) → (quote T1). However, when you're proving a no-go theorem, as Wand was doing, you want to minimize the assumptions you have to make to demonstrate the no-go result. So Wand would naturally want to avoid introducing those extra operators, quote car and so on. Instead, Wand used a Mogensen-Scott encoding, which, given an arbitrary term T, produces a term built up out of lambda-expressions that is not itself reducible at all but which can be queried to extract information about T.

Unfortunately this definition of a calculus step doesn't work, exactly because these rules assume → is compatible, and the whole point of Wand's exercise is that subterm rewriting isn't allowed in all contexts. However, we can define the operational semantics step, which isn't compatible anyway. For the deterministic operational step, we define an "evaluation context", which is a context that determines where the next redex (reducible expression) can occur. We have

E ::= ⎕ | (ET) | ((λx.T)E) | (fexpr E) | (eval E)

That is, starting from the top of the syntax tree of a term, we can reduce a redex at the top of the tree, or we can descend into the operator of a combination, or into the operand of a λ-expression, or into the body of a fexpr, or into the operand of eval. Using this, we can define the operational step.

E[((λx.T)V)] ↦ E[T[x ← V]]
E[(fexpr V)T] ↦ E[(V encode(T))]
E[(eval T)] ↦ E[decode(T)]

Preprocessing as a cure for trivialization

Wand credited Albert Meyer for noting the trivialization effect. Meyer, though, noted that quote-eval (as opposed to fexprs) need not cause trivialization. (Meyer made this supplementary observation at least as early as 1986; see here, puzzle 3.)

The key to this non-trivialization by quotation is that quotation in traditional Lisp is a "special form", which is to say that the quotation operator, which identifies its argument as not to be evaluated, is fixed at compile-time (that is, before program evaluation begins). So, given a Lisp source program, we can run a preprocessor over it and rewrite each quoted expression to avoid quoting more than a single symbol at a time. (For example, one might rewrite ($quote (+ 1 2)) as (cons ($symbol-quote +) (cons 1 (cons 2 ()))), a style of rewriting that requires only quotation of individual symbols.) And then the term-rewriting system doesn't have to have quotation in it at all (only, in the example, symbol-quotation, a much more restricted facility that would only trivialize the formal theory of individual symbols, not of arbitrary source expressions.)

Wand's Mogensen-Scott encoding is a similar transformation to this preprocessing of Lisp expresions using cons etc. (setting aside how the encoding treats variables, which is a whole other can of worms because lambda-calculus variables behave very differently from Lisp symbols).

However, the encoding in Wand's paper doesn't prevent trivialization, because it's used during term reduction. Our Lisp preprocessor eliminated general quotation before term reduction ever started, so that general quotation didn't even have to be represented in the term syntax for the operational semantics. If you wait until reduction has already started, it's too late for the encoding to matter to trivialization; you already have a trivializing context, and the encoding merely facilitates data access.

This however raises a curious possiblity: what if we preprocessed Wand's language by encoding the arguments to all the operators? Then, instead of encoding an operand when applying a fexpr to it, we decode an operand when applying a non-fexpr to it (or when evaluating it, a case that Wand already provides for — except that when evaluating a term, we would only decode the operators, not their operands, since we don't decode operands until we know they're to be evaluated).

One thing about this: if we're going to decode an operand of a non-fexpr, we need to keep track of that by somehow changing the operator to indicate we've already done the decoding. The simplest way to do this is to wrap non-fexprs (as my treatment does) instead of wrapping fexprs (as Wand's does). There are other, clumsier ways to do it without altering Wand's syntax, but I'll go ahead and change the syntax.

T ::= x | (λx.T) | (TT) | (wrap T) | (eval T)

Notice the wrapper is now called wrap instead of fexpr, but the basic binding constructor of combiners is still called λ. Actually, in my dissertation it's called vau rather than lambda (hence "vau-calculus"), but there's also an interesting point to be made by leaving it as λ (setting aside that unicode doesn't have a character for the way I write a lower-case vau). It's just arbitrarily chosen syntax, after all... right?

Our evaluation contexts for the operational step change slightly because, beyond the switch from fexpr to wrap, we're now willing for the operational step to reduce an operand whenever the operator is a value, regardless of whether the operator is a fexpr.

E ::= ⎕ | (ET) | (VE) | (wrap E) | (eval E)

The operational step rules differ mainly in their treatment of encoding and decoding. There is no encoding during reduction. As for decoding, it is only partial. We decode everything except the operands of combinations. (Specifying this precisely would be tedious and contribute nothing to the discussion, but since that was already true of the encoding/decoding, we haven't specified it in the first place.)

E[((λx.T)V)] ↦ E[T[x ← V]]
E[(wrap V)T] ↦ E[(V partial-decode(T))]
E[(eval T)] ↦ E[partial-decode(T)]

Note that the first rule, for applying a λ-expression to an operand, hasn't changed at all, even though a λ-expression is now an operative where in Wand's treatment it was an applicative.

Given a source program, before we set the operational step loose on the term, we preprocess the program by fully encoding it, and then partial-decoding it. Hence the complete absence of encoding in the rules of the operational semantics. This has the expected, but to me still rather stunning, effect that there is no longer any context in the rewriting system that has to be off-limits to rewriting. We can therefore now define a valid compatible calculus step relation by simply removing the evaluation contexts from the operational step rules:

((λx.T)V) → T[x ← V]
(wrap V)T → (V partial-decode(T))
(eval T) → partial-decode(T)

I'll go only slightly out on a limb and say that all three master theorems probably hold for this arrangement — completeness, Church-Rosser-ness, and soundness. (Completeness is obvious by construction.) But what's really interesting here is that this arrangement both violates Wand's restriction on the term set and produces an equivalence relation that doesn't correspond to the one in his paper.

The term-set restriction is violated because source expressions, which Wand was studying exclusively, are now the things that are input to the preprocessor, and the preprocessor maps those source expressions into a proper subset of the terms in our semantics. And the contextual equivalence on source expressions that Wand studied is concerned with these source expressions that don't even effectively belong to our term set at all, since they're mapped through the refractive lens of this encode/partial-decode preprocessor. Yes, our ≅ is no longer a trivial equivalence — but Wand's notion of contextual equivalence of source expressons would require those source expressions to have interchangeable full encodings (in case they both appear as an operand to a fexpr), and that still won't happen unless the source expressions before preprocessing were syntactically identical. The source-expression equivalence Wand was interested in is still trivial, and our non-trivial ≅ is simply not relevant under the treatment used in his paper.

One last thought about this preprocessed, pseudo-explicit-evaluation variant of Wand's semantics. The first rule of the calculus step is the call-by-value β-rule. So this calculus is (if I've not made a goof somewhere) a conservative extension of call-by-value lambda-calculus, and its equational theory includes all the equations in the theory of call-by-value lambda-calculus plus others.

Vau-calculus

Although my pseudo-explicit-evaluation variant of Wand's system does demonstrate some of the principles involved, the encoding/decoding, and the preprocessor, are obstacles to extracting clear insights from it. Unlike Wand, I'm not trying to prove a no-go theorem; I'm showing that something can be done, so my interest lies in making it clear how to do it, and I'll happly introduce more syntax in order to avoid a complication such as Mogensen-Scott encoding, or preprocessing.

Instead of distinguishing data from to-be-evaluated expressions by means of a deep encoding (in which the entire term is transformed, from its root to its leaves), I simply introduce syntax for fully representing Lisp source expressions as data structures, entirely separate from the syntax for things like combinations and procedures (noting that procedures cannot be expressed by Lisp source expressons: you write a source expression that would evaluate to a procedure, but until you commit to evaluating it, it's not a procedure but a structure built out of atomic data, symbols, pairs, and the empty list.)

S ::= d | ()
T ::= S | s | (T . T)

Here, d is any atomic datum (archetypically, a numeric constant), and s is any symbol (which is another kind of datum, and not to be confused with a variable which is not a source-code element at all). I've separated out the atomic data and the empty list into a separate nonterminal S, mnemonic for "Self-evaluating", because these kinds of terms will be treated separately for purposes of evaluation.

All of the terms constructed by the above rules are data, by which I mean, they're irreducible observables; none of those terms can ever occur on the left side of a calculus step. (Therefore, given the master theorems, the operational theory of those terms is trivial.) There are just two syntactic contexts that can ever form a redex; I'll use nonterminal symbol A for these, mnemonic for "active":

S ::= d | ()
A ::= [eval T T] | [combine T T T]
T ::= S | s | (T . T) | A

The intent here (to be realized by calculus step rules) is that [eval T1 T2] represents scheduled evaluation of term T1 in environment T2, while [combine T1 T2 T3] represents scheduled calling of combiner T1 with parameters T2 in environment T3.

There's also one other class of syntax: computational results that can't be represented by source code. There are three kinds of these: environments (which we will clearly need to support our evaluation rules), operatives, and applicatives. We'll use "e" for environments and "O" for operatives.

O ::= [vau x.T] | ...
S ::= d | () | e | O
A ::= [eval T T] | [combine T T T]
T ::= x | S | s | (T . T) | [wrap T] | A

The syntactic form of operative expressions, O, is interesting in itself, and in fact even the inclusion of traditional λ-expressions (which is what you see there, although the operator is called "vau" instead of "λ") is worthy of discussion. I figure those issues would distract from the focus of this post, though, so I'm deferring them for some future post. [Those issues arise in a later post, here.]

Here, then, are the rules of the calculus step (omitting rules for forms of operatives that I haven't enumerated above).

[eval S e] → S
[eval s e] → lookup(s,e) if lookup(s,e) is defined
[eval (T1 . T2) e] → [combine [eval T1 e] T2 e]
[eval [wrap T] e] → [wrap [eval T e]]
[combine [vau x.T] V e] → T[x ← V]
[combine [wrap T0] (T1 ... Tn) e] → [combine T0 ([eval T1 e] ... [eval Tn e]) e]

In the fully expounded vau-calculus, the additional forms of operatives do things like parsing the operand list, and invoking primitives (such as car, and of course $vau).

Under these rules, every redex —that is, every term that can occur on the left side of a calculus step rule— is either an eval or a combine: reduction occurs only at points where it is explicitly scheduled via eval or combine; and if it is explicitly scheduled to occur at a point in the syntax tree, that reduction cannot be prevented by any surrounding context.

The calculus step rules are, in essence, the core logic of a Lisp interpreter.

This calculus allows us to make much more fluent statements about the evaluation-behavior of terms than mere operational equivalence. We can say, for example, that two terms T1 and T2 would be indistingusihable if evaluated in any environment, no matter what that environment is (and even though T1 and T2 may not themselves be operationally equivalent).

For all environments e, [eval T1 e] ≅ [eval T2 e]

Thanks to the definition of ≅, when we say [eval T1 e] ≅ [eval T2 e], we mean

For all contexts C and observables V, C[eval T1 e] →* V iff C[eval T2 e] →* V

This is evidently not the question Wand's treatment asks with its contextual equivalence. However, we can also define Wand's relation in this framework. Let e0 be a standard environment. Then,

T1∼T2 iff for all contexts C, [eval C[T1] e0] ≅ [eval C[T2] e0]

Relation ∼ is Wand's contextual equivalence. And, indeed, for all source expressions S1 and S2, S1∼S2 iff S1 and S2 are syntactically identical.

Finally, note what happens if we omit from this calculus all of the syntax for symbols and lists, along with its associated machinery (notably, eval and environments; the only change this makes to the remaining elements is that we drop the third operand to [combine ...]).

O ::= [vau x.T]
S ::= d | O
A ::= [combine T T]
T ::= x | S | A

[combine [vau x.T] V] → T[x ← V]

Do you recognize it? You should. It's the call-by-value lambda-calculus.

[edit: besides some minor typos here and there, this post had originally omitted the vau-calculus rule for combining an applicative, [combine [wrap T0] ...].

Thursday, July 18, 2013

Bypassing no-go theorems

This is not at all what I had in mind.
— Albert Einstein, in response to David Bohm's hidden variable theory.

A no-go theorem is a formal proof that a certain kind of theory cannot work. (The term no-go theorem seems to be used mainly in physics; I find it useful in a more general context.)

A valid no-go theorem identifies a hopeless avenue of research; but in some cases, it also identifies a potentially valuable avenue for research. This is because in some cases, the no-go theorem is commonly understood more broadly than its actual technical result. Hence the no-go theorem is actually showing that some specific tactic doesn't work, but is interpreted to mean that some broad strategy doesn't work. So when you see a no-go theorem that's being given a very broad interpretation, you may do well to ask whether there is, after all, a way to get around the theorem, by achieving what the theorem is informally understood to preclude without doing what the theorem formally precludes.

In this post, I'm going to look at four no-go theorems with broad informal interpretations. Two of them are in physics; I touch on them briefly here, as examples of the pattern (having explored them in more detail in an earlier post). A third is in programming-language semantics, where I found myself with a result that bypassed a no-go theorem of Mitchell Wand. And the fourth is a no-go theorem in logic that I don't actually know quite how to bypass... or even whether it can be bypassed... yet... but I've got some ideas where to look, and it's good fun to have a go at it: Gödel's Theorem.

John von Neuman's no-go theorem

In 1932, John von Neumann proved that no hidden variable theory can make all exactly the same predictions as quantum mechanics (QM): all hidden variable theories are experimentally distinguishable from QM. In 1952, David Bohm published a hidden variable theory experimentally indistinguishable from QM.

How did Bohm bypass von Neumann's no-go theorem? Simple, really. (If bypassing a no-go theorem is possible at all, it's likely to be very simple once you see how). The no-go theorem assumed that the hidden variable theory would be local; that is, that under the theory, the effect of an event in spacetime cannot propagate across spacetime faster than the speed of light. This was, indeed, a property Einstein wanted out of a hidden variable theory: no "spooky action at a distance". But Bohm's hidden variable theory involved a quantum potential field that obeys Schrödinger's Equation — trivially adopting the mathematical infrastructure of quantum mechanics, spooky-action-at-a-distance and all, yet doing it in a way that gave each particle its own unobservable precise position and momentum. Einstein remarked, "This is not at all what I had in mind."

John Stewart Bell's no-go theorem

In 1964, John Stewart Bell published a proof that all local hidden variable theories are experimentally distinguishable from QM. For, of course, suitable definition of "local hidden variable theory". Bell's result can be bypassed by formulating a hidden variable theory in which signals can propagate backwards in time — an approach advocated by the so-called transactional interpretation of QM, and which, as noted in my earlier post on metaclassical physics, admits the possibility of a theory that is still "local" with respect to a fifth dimension of meta-time.

Mitchell Wand's no-go theorem

In 1998, Mitchell Wand published a paper The Theory of Fexprs is Trivial.

The obvious interpretation of the title of the paper is that if you include fexprs in your programming language, the theory of the language will be trivial. When the paper first came out, I had recently hit on my key insight about how to handle fexprs, around which the Kernel programming language would grow, so naturally I scrutinized Wand's paper very closely to be sure it didn't represent a fundamental threat to what I was doing. It didn't. I might put Wand's central result this way: If a programming language has reflection that makes all computational states observable in the syntactic theory of the language, and if computational states are in one-to-one correspondence with syntactic forms, then the syntactic theory of the language is trivial. This isn't a problem for Kernel because neither of these conditions holds: not all computational states are observable, and computational states are not in one-to-one correspondence with syntactic forms. I could make a case, in fact, that in S-expression Lisp, input syntax represents only data: computational states cannot be represented using input syntax at all, which means both that the syntactic theory of the language is already trivial on conceptual grounds, and also that the theory of fexprs is not syntactic.

At the time I started writing my dissertation, the best explanation I'd devised for why my theory was nontrivial despite Wand was that Wand did not distinguish between Lisp evaluation and calculus term rewriting, whereas for me Lisp evaluation was only one of several kinds of term rewriting. Quotation, or fexprs, can prevent an operand from being evaluated; but trivialization results from a context preventing its subterm from being rewritten. It's quite possible to prevent operand evaluation without trivializing the theory, provided evaluation is a specific kind of rewriting (requiring, in technical parlance, a redex that includes some context surrounding the evaluated term).

Despite myself, though, I was heavily influenced by Wand's result and by the long tradition in which it followed. Fexprs had been rejected circa 1980 as a Lisp paradigm, in favor of macros. A rejected paradigm is usually ridiculed in order to rally followers more strongly behind the new paradigm (see here). My pursuit of $vau as a dissertation topic involved a years-long process of gradually ratcheting up expectations. At first, I didn't think it would be worth formulating a vau-calculus at all, because of course it wouldn't be well-enough behaved to justify the formulation. Then I thought, well, an operational semantics for an elementary subset of Kernel would be worth writing. Then I studied Plotkin's and Felleisen's work, and realized I could provide a semantics for Kernel that would meet Plotkin's well-behavedness criteria, rather than the slightly weakened criteria Felleisen had used for his side-effectful calculus. And then came the shock. When I realized that the vau-calculus I'd come up with, besides being essentially as well-behaved as Plotkin's call-by-value lambda-calculus, was actually (up to isomorphism) a conservative extension of call-by-value lambda-calculus. In other words, my theory of fexprs consisted of the entire theory of call-by-value lambda-calculus plus additional theorems.

I was boggled. And I was naively excited. I figured, I'd better get this result published, quick, before somebody else notices it and publishes it first — because it's so incredibly obvious, it can't be long before someone else does find it. Did I say "naively"? That's an understatement. There's some advice for prospective graduate students floating around, which for some reason I associate with Richard Feynman (though I could easily be wrong about that [note: a reader points out this]), to the effect that you shouldn't be afraid people will steal your ideas when you share them, because if your ideas are any good you'll have trouble getting anyone to listen to them at all. In studying this stuff for years on end I had gone so far down untrodden paths that I was seeing things from a drastically unconventional angle, and if even so I had only just come around a corner to where I could see this thing, others were nowhere close to any vantage from which they could see it.

[note: I've since written a post elaborating on this, Explicit evaluation.]

Kurt Gödel's no-go theorem

Likely the single most famous no-go theorem around is Gödel's Theorem. (Actually, it's Gödel's Theorems, plural, but the common informal understanding of the result doesn't require this distinction — and Gödel's result lends itself spectacularly to informal generalization.) This is what I'm going to spend most of this post on, because, well, it's jolly good fun (recalling the remark attributed to Abraham Lincoln: People who like this sort of thing will find it just the sort of thing they like).

The backstory to Gödel was that in the early-to-mid nineteenth century, mathematics had gotten itself a shiny new foundation in the form of a formal axiomatic approach. And through the second half of the nineteenth century mathematicians expanded on this idea. Until, as the nineteenth century gave way to the twentieth, they started to uncover paradoxes implied by their sets of axioms.

A perennial favorite (perhaps because it's easily explained) is Russell's Paradox. Let A be the set of all sets that do not contain themselves. Does A contain itself? Intuitively, one can see at once that if A contains itself, then by its definition it does not contain itself; and if it does not contain itself, then by its definition it does contain itself. The paradox mattered for mathematicians, though, for how it arose from their logical axioms, so we'll be a bit more precise here. The two key axioms involved are reductio ad absurdum and the Law of the Excluded Middle.

Reductio ad absurdum says that if you suppose a proposition P, and under this supposition you are able to derive a contradiction, then not-P. Supposing A contains itself, we derive a contradiction, therefore A does not contain itself. Supposing A does not contain itself, we derive a contradiction, therefore —careful!— A does not not contain itself. This is where the Law of the Excluded Middle comes in: A either does or does not contain itself, therefore since it does not not contain itself, it does contain itself. We have therefore an antinomy, that is, we've proved both a proposition P and its negation not-P (A both does and does not contain itself). And antinomies are really bad news, because according to these axioms we've already named, if there is some proposition P for which you can prove both P and not-P, then you can prove every proposition, no matter what it is. Like this: Take any proposition Q. Suppose not-Q; then P and not-P, which is a contradiction, therefore by reductio ad absurdum, not-not-Q, and by the Law of the Excluded Middle, Q.

When Russell's Paradox was published, the shiny new axiomatic foundations of mathematics were still less than a human lifetime old. Mathematicians started trying to figure out where things had gone wrong. The axioms of classical logic were evidently inconsistent, leading to antinomies, and the Law of the Excluded Middle was identified as a problem.

One approach to the problem, proposed by David Hilbert, was to back off to a smaller set of axioms that were manifestly consistent, then use that smaller set of axioms to prove that a somewhat larger set of axioms was consistent. Although clearly the whole of classical logic was inconsistent, Hilbert hoped to salvage as much of it as he could. This plan to use a smaller set of axioms to bootstrap consistency of a larger set of axioms was called Hilbert's program, and I'm remarking it because we'll have occasion to come back to it later.

Unfortunately, in 1931 Kurt Gödel proved a no-go theorem for Hilbert's program: that for any reasonably powerful system of formal logic, if the logic is consistent, then it cannot prove the consistency of its own axioms, let alone its own axioms plus some more on the side. The proof ran something like this: For any sufficiently powerful formal logic M, one can construct a proposition A of M that amounts to "this proposition is unprovable". If A were provable, that would prove that A is false, an antinomy; if not-A were provable, that would prove that A is true, again an antinomy; so M can only be consistent if both A and not-A are unprovable. But if M were able to prove its own consistency, that would prove that A is unprovable (because A must be unprovable in order for M to be consistent), which would prove that A is true, producing an antinomy, and M would in fact be inconsistent. Run by that again: If M can prove its own consistency, then M is in fact inconsistent.

Typically, on completion of a scientific paradigm shift, the questions that caused the shift cease to be treated as viable questions by new researchers; research on those questions tapers off rapidly, pushed forward only by people who were already engaged by those questions at the time of the shift. So it was with Gödel's results. Later generations mostly treated them as the final word on the foundations of mathematics: don't even bother, we know it's impossible. That was pretty much the consensus view when I began studying this stuff in the 1980s, and it's still pretty much the consensus view today.

Going there

Having been trained to think of Gödel's Theorem as a force of nature, I nevertheless found myself studying it more seriously when writing the theoretical background material for my dissertation. I found myself discoursing at length on the relationship between mathematics, logic, and computation, and a curious discrepancy caught my eye. Consider the following Lisp predicate.

($define! A ($lambda (P) (not? (P P))))

Predicate A takes one argument, P, which is expected to be a predicate of one argument, and returns the negation of what P would return when passed to itself. This is a direct Lisp translation of Russell's Paradox. What happens when A is passed itself?

The answer is, when A is passed itself, (A A), nothing interesting happens — which is really very interesting. The predicate attempts to recurse forever, never terminating, and in theory it will eventually fill up all available memory with a stack of pending continuations, and terminate with an error. What it won't do is cause mathematicians to despair of finding a solid foundation for their subject. If asking whether set A contains itself is so troublesome, why is applying predicate A to itself just a practical limit on how predicate A should be used?

That question came from my dissertation. Meanwhile, another question came from the other major document I was developing, the R-1RK. I wanted to devise a uniquely Lisp-ish variant of the concept of eager type-checking. It seemed obvious to me that there should not be a fixed set of rules of type inference built into the language; that lacks generality, and is not extensible. So my idea was this: In keeping with the philosophy that everything should be first-class, let theorems about the program be an encapsulated type of first-class objects. And carefully design the constructors for this theorem type so that you can't construct the object unless it's provable. In effect, the constructors are the axioms of the logic. Modus ponens, say, is a constructor: given a theorem P and a theorem P-implies-Q, the modus-ponens constructor allows you to construct a theorem Q. As desired, there is no built-in inference engine: the programmer takes ultimate responsibility for figuring out how to prove things.

Of the many questions in how to design such a first-class theorem type, one of the early ones has to be, what system of axioms should we use? Clearly not classical logic, because we know that would give us an inconsistent system. This though was pretty discouraging, because it seemed I would find myself directly confronting in my design the very sort of problems that Gödel's Theorem says are ultimately unsolvable.

But then I had a whimsical thought; the sort of thing that seems at once not-impossible and yet such a long shot that one can just relax and enjoy exploring it without feeling under pressure to produce a result in any particular timeframe (and yet, I have moved my thinking forward on this over the years, which keeps it interesting). What if we could find a way to take advantage of the fact that our logic is embedded in a computational system, by somehow bleeding off the paradoxes into mere nontermination? So that they produce the anticlimax of functions that don't terminate instead of the existential angst of inconsistent mathematical foundations?

Fragments

At this point, my coherent vision dissolves into fragments of tentative insight, stray puzzle pieces I'm still pushing around hoping to fit together.

One fragment: Alfred Tarski —who fits the aforementioned profile of someone already engaged by the questions when Gödel's results came out— suggested post-Gödel that the notion of consistency should be derived from common sense. Hilbert's program had actually pursued a formal definition of consistency, as the property that not all propositions are provable; this does have a certain practicality to it, in that the practical difficulty with the classical antinomies was that they allowed all propositions Q to be proven, so that "Q can be proven" ceased to be a informative statement about Q. Tarski, though, remarked that when a non-mathematician is told that both P and not-P are true, they can see that something is wrong without having to first receive a lecture on the formal consequences of antinomies in interaction with reductio ad absurdum.

So, how about we resort to some common sense, here? A common-sensical description of Russell's Paradox might go like this:

A is the set of all sets that do not contain themselves. If A contains itself, then it does not contain itself. But if it does not contain itself, then it does contain itself. But if it does contain itself, then it does not contain itself. But if it does not contain itself...

And that is just what we want to happen: instead of deriving an antinomy, the reasoning just regresses infinitely. A human being can see very quickly that this is going nowhere, and doesn't bother to iterate beyond the first four sentences at most (and once they've learned the pattern, next time they'll probably stop after even fewer sentences), but they don't come out of the experience believing that A both does and does not belong to itself; they come out believing that there's no way of resolving the question.

So perhaps we should be asking how to make the conflict here do an infinite regress instead of producing a (common-sensically wrong) answer after a finite number of steps. This seems to be some sort of deep structural change to how logical reasoning would work, possibly not even a modification of the axioms at all but rather of how they are used. If it does involve tampering with an axiom, the axiom tampered with might well be reductio ad absurdum rather than the Law of the Excluded Middle.

This idea — tampering with reductio ad absurdum rather than the Law of the Excluded Middle — strikes a rather intriguing historical chord. Because, you see, one of the mathematicians notably pursuing Hilbert's program pre-Gödel did try to eliminate the classical antinomies by leaving intact the Law of the Excluded Middle and instead modifying reductio ad aburdum. His name was Alonzo Church —you may have heard of him— and the logic he produced had, in retrospect, a curiously computational flavor to it. While he was at it, he took the opportunity to simplify the treatment of variables in his logic, by having only a single structure that binds variables, which (for reasons lost in history) he chose to call λ. Universal and existential quantifiers in his logic were higher-order functions, which didn't themselves bind variables but instead operated on functions that did the binding for them. Quite a clever device, this λ.

Unfortunately, it didn't take many years after Church's publication to show that antinomies arose in his system after all. Following the natural reflex of Hilbert's program, Church tried to find a subset of his logical axioms that could be proven consistent — and succeeded. It turned out that if you left out all the operators except λ you could prove that each proposition P was only equivalent to at most one irreducible form. This result was published in 1936 by Church and one of his students, J. Barkley Rosser, and today is known as the Church–Rosser Theorem (you may have heard of that, too). In the long run, Church's logic is an obscure historical footnote, while its λ-only subset turned out to be of great interest for computation, and is well-known under the name "λ-calculus".

So evidently this idea of tampering with reductio ad absurdum and bringing computation into the mix is not exactly unprecedented. Is it possible that there is something there that Alonzo Church didn't notice? I'd have to say, "yes". Alonzo Church is one of those people who (like Albert Einstein, you'll recall he came up in relation to the first no-go theorem I discussed) in retrospect appears to have set a standard of intelligence none of us can possibly aspire to — but all such people are limited by the time they live in. Einstein died years before Bell's Theorem was published. Heck, Aristotle was clearly a smart guy too, and just think of everything he missed through the inconvenience of being born about two millennia before the scientific revolution. And Alonzo Church couldn't, by the nature of the beast, have created his logic based on a modern perspective on computation and logic since it was in part the further development of his own work over many decades that has produced that modern perspective.

I've got one more puzzle piece I'm pushing around, that seems like it ought to fit in somewhere. Remember I said Church's logic was shown to have antinomies? Well, at the time the antinomy derivation was rather baroque. It involved a form of the Richard Paradox, which concerns the use of an expression in some class to designate an object that by definition cannot be designated by expressions of that class. (A version due to G.G. Berry concerns the twenty-one syllable English expression "the least natural number not nameable in fewer than twenty-two syllables".) The Richard Paradox is naturally facilitated by granting first-class status to functions, as λ-calculus and Lisp do. But, it turns out, there is a much simpler paradox contained in Church's logic, involving less logical machinery and therefore better suited for understanding what goes wrong when λ-calculus is embedded in a logic. This is Curry's Paradox.

I'll assume, for this last bit, that you're at least hazily familiar with λ-calculus, so it'll come back to you when you see it.

For Curry's Paradox, we need one logical operator, three logical axioms, and the machinery of λ-calculus itself. Our one logical operator is the binary implication operator, ⇒. The syntax of the augmented λ-calculus is then

T ::= x | c | (λx.T) | (TT) | (T⇒T)

That is, a term is either a variable, or a constant, or a lambda-expression, or an application, or an implication. We don't need a negation operator, because we're sticking with the generalized notion of inconsistency as the property that all propositions are provable. Our axioms assert that certain terms are provable:

For all terms P and Q, if provably P and provably (P⇒Q), then provably Q. (modus ponens)
For all terms P, provably P⇒P.
For all terms P and Q, provably ((P⇒(P⇒Q))⇒(P⇒Q)).

The sole rewriting axiom of λ-calculus, lest we forget, is the β-rule:

(λx.F)G → F[x ← G]

That is, to apply function (λx.F) to argument G, substitute G for all free occurrences of x in F.

To prove inconsistency, first we need a simple result that comes entirely from λ-calculus itself, called the Fixpoint Theorem. This result says that for every term F, there exists a term G such that FG = G (that is, every term F has a fixpoint). The proof works like this:

Suppose F is any term, and let G = (λx.(F(xx)))(λx.(F(xx))), where x doesn't occur in F. Then G = (λx.(F(xx)))(λx.(F(xx))) → (F(xx))[x ← (λx.(F(xx)))] = F((λx.(F(xx)))(λx.(F(xx)))) = FG.

Notice that although the Fixpoint Theorem apparently says that every F has a fixpoint G, it does not actually require F to be a function at all: instead of providing a G to which F can be applied, it provides a G from which FG can be derived. And —moreover— for most F, derivation from G is a divergent computation (G → FG → F(FG) → F(F(FG)) → ...).

Now we're ready for our proof of inconsistency: that for every term P, provably P.

Suppose P is any term. Let Q = λx.(x⇒P). By the Fixpoint Theorem, let R be a term such that QR = R. By writing out the definition of Q and then applying the β-rule, we have QR = (λx.(x⇒P))R → (R⇒P), therefore R = (R⇒P).

By the second axiom, provably (R⇒R); but R = R⇒P, so, by replacing the right hand R in (R⇒R) with (R⇒P), provably (R⇒(R⇒P)).

By the third axiom, provably ((R⇒(R⇒P))⇒(R⇒P)); and we already have provably (R⇒(R⇒P)), so, by modus ponens, provably (R⇒P). But R = (R⇒P), so provably R.

Since provably R and provably (R⇒P), by modus ponens, provably P.

Note: I've had to fix errors in this proof twice since publication; there's some sort of lesson there about either formal proofs or paradoxes.

So, why did I go through all this in detail? Besides, of course, enjoying a good paradox. Well, mostly, this: The entire derivation turns on the essential premise that derivation in the calculus, as occurs (oddly backwards) in the proof of the Fixpoint Theorem, is a relation between logical terms — which is to say that all terms in the calculus have logical meaning.

And we've seen something like that before, in my early explanation of Mitchell Wand's no-go theorem: trivialization of theory resulted from assuming that all calculus derivation was evaluation. So, if we got around Wand's no-go theorem by recognizing that some derivation is not evaluation, what can we do by recognizing that some derivation is not deduction?

Wednesday, December 26, 2012

Metaclassical physics

Nature shows us only the tail of the lion. But there is no doubt in my mind that the lion belongs with it even if he cannot reveal himself to the eye all at once because of his huge dimension.
— Albert Einstein

What theoretical physics needs, I've long believed, is to violate some assumption that is shared in common by both classical physics and quantum mechanics. Everyone nowadays seems to understand that something different is needed, but I suspect the "radical" new theories I've heard of aren't radical enough.

So in this post I'm going to suggest a class of physical theory that seems to me to be radical enough (or if, you prefer, weird enough) to shake things up.

Indirect evidence of a wrong assumption

Suppose you're trying to find the best way to structure your description of something. (Examples: choosing the structure for a computer program to perform some task; or choosing the structure for a theory of physics.)

What you hope to find is the natural structure of what you're describing — a structure that affords a really beautiful, simple description. When you strike the natural structure, a sort of resonance occurs, in which various subsidiary problems you may have had with your description just melt away, and the description practically writes itself.

But here's a problem that often occurs: You've got a structure that affords a pleasing approximate description. But as you try to tweak the description for greater precision, instead of the description getting simpler, as it should if you were really fine-tuning near the natural structure, instead the description gets more and more complicated. What has happened, I suggest, is that you've got a local optimum in solution space, instead of the global optimum of the natural structure: small changes in the structure won't work as well as the earlier approximation, and may not work at all, so fundamental improvement would require a large change to the structure.

I suggest that physics over the past century-and-change has experienced just such a phenomenon. Classical physics was pleasing, but an approximation. Our attempts to come closer to reality gave us quantum mechanics, which has hideously ugly mathematics. And our attempts to improve QM... sigh.

You'll find advocates of these modern theories, and before them advocates of QM, gushing about how beautiful the math is, but frankly I find this wishful thinking. They focus on part of the math that's beautiful, and try to pretend the ugliness out of existence. Ignoring the elephant in the room. In the case of QM, the elephant is commonly called "observation", and in more formal social situations, "wave function collapse".

But it would be a mistake to focus too much on the messiness of QM math. If physics is stuck in a local optimum, we need to look foremost at big things that classical and quantum have in common, rather than getting too focused on details by which they contrast.

The saga of determinism and locality

Two really big ideas that have been much considered, in the contrast between classical and quantum physics, are determinism and locality.

In the classical view of reality, there are three dimensions of space and one dimension of time. In space, there are point particles, and space-filling fields. The particles move continuously through space over time, each tracing out a one-dimensional curve in four-space. The fields propagate in waves over time. The model is deterministic because, in principle, the state of everything at one moment in time completely determines the state of everything at all later moments in time. The model is local because neither particles nor waves travel faster than a certain maximum speed (the speed of light).

QM depicts nature as being fundamentally nondeterministic. Einstein didn't approve of that (it apparently offended his sense of the rhyming scheme of nature, as I've remarked before). God does not play dice, he said.

It's important to realize that Einstein was personally responsible, through his theory of special relativity, for making classical physics more local. Prior to relativity, classical physics did not prohibit things from moving arbitrarily fast; consequently, in considering what would happen to a given volume of space in a given interval of time, there was always the chance that by the end of the interval, some really fast particle might come zooming through the volume that had been on the other side of the universe at the beginning of the interval.

This relation between Einstein and locality helps us appreciate why Einstein, in attempting to demonstrate that quantum mechanics is flawed, constructed with two of his colleagues the EPR paradox showing that QM requires information to propagate across space faster than light. That is, in an attempt to discredit nondeterminism he reasoned that quantum nondeterminism implies non-locality, and since non-locality is obviously absurd, quantum nondeterminism must be wrong.

Perhaps you can see where this is going. Instead of discrediting nondeterminism, he ultimately contributed to discrediting locality.

Okay, let's back up a few years, to 1932. As an alternative to quantum nondeterminism, Einstein was interested in hidden variable theories. A hidden variable theory says that the state of reality is described by some sort of variables that evolve deterministically over time, but these underlying variables are fundamentally unobservable, so that the nondeterministic quantum world is merely our statistical knowledge about the hidden deterministic reality. In 1932, John von Neumann proved, formally, that no hidden variable theory can produce all exactly the same predictions as QM. (All hidden variable theories are experimentally distinguishable from QM.)

This is an example of a no-go theorem, a formal proof that a certain kind of theory cannot work. Often, the most interesting thing about a (correct) no-go theorem is its premise — precisely what it shows to be impossible. Because twenty years later, in 1952, David Bohm published a hidden variable theory experimentally indistinguishable from QM. The hidden variable theory was correct. The no-go theorem was correct. We can therefore deduce that what Bohm did must be different from what von Neumann proved impossible.

And so it was. On careful examination, von Neumann's proof assumes the hidden variable theory is local. Bohm's hidden variable theory has a quantum potential field that can propagate arbitrarily fast, so Bohm's theory is non-local. Einstein remarked, "This is not at all what I had in mind."

Now we come to Bell's Theorem. Published in 1964 (nine years after Einstein's death), this was another no-go theorem, based on a refinement of the EPR experiment. Bell showed that the particular probability distribution predicted by QM, in that experiment, could not possibly be produced by a deterministic local hidden variable theory.

Okay. Here again, maybe you can see already where I'm going with this. I'm preparing to propose a particular way of violating an assumption in the premise of Bell's Theorem. This particular violation may allow construction of a theory that isn't, to my knowledge, what any of the above cast of characters had in mind, but that might nevertheless be plausibly called a deterministic local hidden variable theory.

Changing history

The probability distribution Bell was considering, the one that couldn't be produced by a deterministic local hidden variable theory, has to do with the correlation between observations at two distant detectors, where both observations are based on a generating event that occurred earlier in time, at a point in space in between the detectors.

And one day some years ago, reading about all this, it occurred to me that if you think of these three events —two observations and one generation— as just three points between which signals can be sent back and forth, it's really easy to set up a simple mathematical model in which if you start with a uniform probability distribution, set the system going, and let the signals bounce back and forth until they reach a steady state, the probability distribution of the final state of the system will be exactly what QM predicts. This idea is somewhat reminiscent of a modern development in QM called the transactional interpretation (different, but reminiscent).

The math of this is really easy; it doesn't involve anything more complicated than a dot product of vectors. Wait, propagating back and forth in time? What does that even mean?

There are a lot of really badly thought-out depictions of time travel in modern science fiction. For which I'm sort-of grateful, because over the years I've been annoyed by them, and therefore thought about what was wrong with them, and thereby honed my thinking about time travel.

It seems to me the big problem with the idea of changing history is, what does "change history" mean? In order for something to change, it has to change relative to some other dimension. If a board is badly milled, its thickness may vary (change) along the length of the board, meaning its thickness depends on how far along its length you measure. The apparent magnitude of a star may vary with distance from the star. The position of a moving train varies over time. But if history changes, relative to what other dimension does it change? It isn't changing relative to any of the four dimensions of spacetime.

Let's suppose there is a fifth dimension, relative to which the entire four-dimensional spacetime continuum can change. As a simple name, let's call it "meta-time". This would, of course, raise lots of metaphysical questions; a favorite of mine is, if there's a fifth dimension of meta-time, why not a sixth of meta-meta-time, seventh of meta-meta-meta-time, and so proceed ad infinitum? Though fun to muse on, those sorts of questions aren't needed right now; just suppose for a moment there's meta-time, and let's see where it leads.

While we're at it, let's suppose this five-dimensional model is deterministic in the sense that, in principle, the state of spacetime at one moment in meta-time completely determines the state of spacetime at all later moments in meta-time. And let's also suppose the five-dimensional model is local in the sense that changes to spacetime (whatever they are) propagate, over meta-time, at some maximum rate. (So if you hop in your TARDIS and make a change to history on a particular day in 1963 London, that change to history propagates outward in space at, say, no more than 300,000km per meta-second, and propagates forward and backward in time at no more than one second per meta-second.)

That bit of math I mentioned earlier, in which the QM probability distribution of Bell's Theorem is reproduced? That can be made to use just this kind of model — a five-dimensional system, with three dimensions of space, one of time, and one of meta-time, with determinism and locality relative to meta-time. Granted, it's only a toy: it's nothing like a serious attempt to model reality with any generality at all, just a one-off model describing the particular experimental set-up of Bell's Theorem.

I've got one more suggestion to make, though. And I still won't have a full-blown theory, such as Bohm had (there's stuff Bohm's theory didn't include, but it did have some generality to it), but imho this last point is worth the price of admission. I wouldn't call it "compelling", because atm this is all too outre to be compelling, but I for one found it... remarkable. When I first saw it, it actually made me laugh out loud.

Metaclassical physics

Wondering what a full-blown theory of physics of this sort might look like, I tried to envision what sorts of things would inhabit this five-dimensional model.

In classical physics, as remarked, space contains point particles interacting with fields. And when you add in time, those things that used to look like point particles appear instead as one-dimensional curves, tracing the motion of the particle through spacetime. I was momentarily perplexed when I tried to add in meta-time. Would the three events in Bell's experiment, two observations and one generation, interact through vibrations in these one-dimensional curves traced through spacetime? Modern string theory does make a big thing out of stuff vibrating. Also, though, a one-dimensional curve vibrating, or otherwise evolving, over meta-time traces out a two-dimensional surface in the five-dimensional space-time-metatime continuum. We set out on this journey hoping to simplify things, hoping ideally to strike on the natural structure of physical reality and achieve resonance (the ring of truth?).

But wait. Why have point particles in space? Point particles in classical physics are nice because of the shape of the theory they produce – but points in space don't produce that shape of theory when they're moving through both time and meta-time. And those one-dimensional curves in spacetime don't play nearly such a central role in QM, notwithstanding they make a sort of cameo appearance in Feynman's path integral formulation.

What is really fundamental to QM is the elephant in the room, the thing that makes such a hideous mess out of QM mathematics: observation, known at up-scale parties as wave function collapse. QM views spacetime as a continuum punctuated by zero-dimensional spacetime events — essentially, observations.

And as spacetime evolves over meta-time, a zero-dimensional spacetime event traces out a one-dimensional curve.

So now, apparently, we have a theory in which a continuum is populated by zero-dimensional points and fields, evolving deterministically over a separate dimension with a maximum rate of propagation. Which is so much like classical physics that (as mentioned) when I saw it I laughed out loud.

Friday, January 27, 2012

Guarded continuations

Go to, let us go down, and there confound their language, that they may not understand one another's speech.

— Genesis 11:7 (King James Version)

I've been asked about Kernel's guarded continuations. In writing that section of the Kernel Report, I knew it wasn't as lucid as one might like, but getting it all down while I had it clearly in mind was something of an achievement in itself. I'm not sure I've ever seen a really lucid explanation of continuations; explaining them should probably be considered a major challenge in its own right, above and beyond figuring out how the mechanism should work. It's high time I tried to do better, though; however many tries it takes to get the explanation to come clear, I won't get there if my number of tries never increases.

On my Kernel web page I describe guarded continuations as 'Exception handling "done right"'. For an ultra-short tag line, that may be about as good as any; but guarded continuations are more a simultaneous blend-and-improvement on Scheme continuations and Java exception handling, rather than a supposedly-optimal exception facility. We may still be several generations of exception facilities away from getting exceptions right; but as may be, I think guarded continuations have something to add to the discussion.

It's also worth saying up front that guarded continuations are lower-level than a good exception facility would be. Kernel tries for very simple low-level facilities plus the ability to build higher-level abstractions from them. It's an open, and fascinating, question how well Kernel's abstraction facilities will actually apply to exception handling — and I hope to learn a lot from that question. (I expect shortcomings of Kernel's abstractive facilities to be not about the facilities' power (which I maintain is extraordinary) but about the directions in which that power can be applied. When I figure out what the missing directions are, I mean to go that way.)

Stacks and trees

Classically, programming languages deal with function calls using a stack. There are two kinds of stuff on the stack: local variables (data), and return addresses (control). The local data for a function call produces a single set of local variables; but the algorithm for that same call may produce many different return addresses, one for each point in the algorithm where it, in turn, calls a function. In principle, one might have two stacks, one for data and one for control; but classically, all local variables and all return addresses are use-once, to be discarded in the same order they were introduced, so it may seem natural to store both on a single stack (with resultant security problems, malware accessing data as control and vice versa).

First-class (lexically scoped) functions require local data to be stored, in general, as a tree rather than a stack. When a first-class function is returned as the result of a function call (in vintage parlance, as an upward funarg), lexical scoping expects that function to have access to —in general— the entire column of variables that were on the stack when the function was created, even though some of the function calls that created those variables have since ended. So data that used to be popped off the stack must now, potentially, persist. A set of local variables, which used to be the data part of a stack frame, is now an environment, with a fixed set of ancestors (variables further up the column) and potentially any number of children (one for each call to any function defined in that environment). (In Kernel, btw, environments are first-class objects; there was a lively LtU discussion on first-class environments, here.)

First-class continuations do to the control stack what first-class functions did to the data stack: they turn it into a tree. (First-class continuations do this to the data stack too, but I'm not aware of any language that has first-class continuations without first-class functions.) Just as first-class functions required entire columns of local variables to persist after the ends of function calls that created them, first-class continuations require entire columns of "return addresses" to persist after the ends of function calls that created them.

In visualizing the tree of continuations, the actual algorithms of the program hover in the areas beneath the nodes of the tree (possibly, multiple algorithms under a single node). Each child continuation is a socket waiting for a value to be plugged into it, to be chewed on by the algorithm above that child — which may, eventually, terminate and plug a value into its parent continuation. Primitive operations hover beneath the leaves of the tree. Kernel pretty much only needs one non-primitive operation: combination evaluation, which spins off a subtree to evaluate the operator, spins off subtrees as appropriate to evaluate operands, and finally hands the situation over to another algorithm to make the operative call (using for the operative call its parent continuation, in properly tail-recursive fashion).

Guards

When a function f calls another function g, f gives up some of its ability to regulate the future evolution of the computation. The programmer tends to assume that not all such power has been lost — that (1) the call to g won't terminate without control passing back through the point at which f called it, and (2) that particular call to g won't return through that point more than once. In the presence of first-class continuations, though, even these weak assumptions needn't hold. Once f calls g, the dynamic extent of that call might be exited, and/or later reentered any number of times, in ways that bypass f entirely. Perhaps f opens a file, calls g, and then closes the file before relinquishing control to whoever called f; if g terminates bypassing f, the file might not be closed outside the call to f, and if g is reentered bypassing f, the file might not be open inside the call to g.

Scheme addresses this sort of problem by means of dynamic-wind, a device that causes specified actions to be performed whenever a continuation-call tries to enter or leave the dynamic extent of the call to g. In R5RS Scheme (the version Kernel primarily diverged from), there were a couple of problems with dynamic-wind.

Although dynamic-wind allowed something to be done every time the dynamic extent of the call to g was entered, and every time it was exited, these before and after actions were unable to observe (let alone affect) what value was being passed into or out of the dynamic extent, so failing to support context-sensitive regulation of access to/from the dynamic extent. (In Javascript terms <shudder>, before and after are akin to finally rather than catch.)
The R5RS was explicitly baffled by what would happen if continuations were used to pass into or out of the dynamic extents of the calls to before and after — which suggested to me that the whole thing was something of a kludge. The correct behavior in such cases ought to follow naturally from the conceptual framework, and if it didn't follow naturally one had to wonder if the conceptual framework was altogether coherent.

Java exception handling differs from this arrangement in several ways.

Mostly obviously and trivially, it only allows exiting a dynamic extent, not reentering one.
An intercepting catch has access to the value being thrown (unlike the after of dynamic-wind).
A catch selectively intercepts throws, based on the "target" (one might even say, "destination") of the throw. Whereas Lisp cleanly separates the target of an abnormal transfer of control from the value being transferred —the continuation called, versus the value passed to it— Java entangles them in a single thrown object whose type is the "target".
Once a Java exception is caught, that's the end of the exception unless it is explicitly rethrown — in contrast to the after of dynamic-wind, which ordinarily does not disrupt the abnormal transfer of control during which it is called.

Interestingly, one fairly intricate characteristic shared by both Scheme's dynamic-wind and Java's catch is that the target is a node of a tree. This is, perhaps, all the more interesting since the trees are seemingly very different in nature. The Java target is a type in a tree of exception types; the Scheme target is a continuation in a tree of continuations.

Kernel's guarded continuations drew on both these models for inspiration. An applicative guard-continuation allows the construction of a continuation with entry- and exit-guard clauses. Exit-guard clauses selectively intercept continuation-calls from inside to outside the guarded dynamic extent (the branch of the continuation tree descending from the guarded continuation), and are therefore analogous to Java catch clauses, with some adjustments. Entry-guard clauses are symmetrically designed to selectively intercept continuation-calls from outside to inside the guarded extent.

Each guard clause has a selector continuation. Keep in mind that a continuation-call has two ends: it goes from one point in the continuation-tree to another, and in this case we know that one end of the call is inside the guarded extent, while the other end is outside. A continuation-call matches the selector of a clause just if the outside end of the call lies within the tree branch descending from the selector. For an exit-guard clause this means that the called continuation must be in the selector's branch of the tree. For an entry-guard, the starting point of the continuation-call must be within the selector's branch of the tree.

Because entry guards are selected based on the source of the continuation-call, a list of interceptors selected when the continuation-call is initiated will be different, in general, from what one would get by stopping at some intermediate interceptor and then "rethrowing" the call (in Java catch style). This suggests that normal termination of an intercepting guard should continue along the path of the original continuation-call, reserving explicit diversion for when the interceptor doesn't want to follow the original path. Secondarily, although we do want it to be convenient (as well as possible) for an interceptor to continue along the path of the original continuation-call, we don't really want to give the interceptor first-class continuations for either end of the call (source or destination). We may have built some abstract device on top of the continuation primitives, and as a result, the source/destination continuations may be internal details that really oughtn't be casually spread about.

Therefore, normal termination of an intercepting guard continues along the path of the continuation-call, and the interceptor when called is provided with an explicit continuation for diverting control to just outside the guarded extent (Java-style).

For versatile guard selection, each guarded gateway (entry or exit for a dynamic extent) has a list of guard clauses. When a continuation-call goes through the gate (entering or exiting), the selectors of these clauses are checked, one at a time, until one is found that matches the call, or until there are no more clauses so that no interception occurs. No more than one clause is selected. A root continuation is provided, which can be used as a selector that matches all possible calls. Using the root continuation, plus the ability for an interceptor to continue along the continuation-call path, one can (for example) arrange that a clause will be selected except when the call matches some narrower selector (for example, the error continuation, which is the ancestor of all evaluation-aborting error handlers). One provides a clause that catches errors and causes them to continue along their path, followed by a clause that catches everything and does whatever it is one wants done only in non-error situations.

When a continuation-call is made, a list of intercepting guard clauses can be immediately constructed, since continuation guards are fixed at the time the guarded continuation is constructed. The list doesn't have to be constructed then, of course, exactly because all the relevant parts of the continuation tree are immutable; but if it isn't constructed immediately, one will have to store the information needed to select the clauses later — which includes the source of the call as well as its destination (destination to select exit guards, source to select entry guards).

The Kernel Report carefully works out subtleties of the device, such that when a guard intercepts an entry/exit to dynamic extent d, and then itself makes a continuation-call, that call does not select any guard on d.

The traditional Scheme device for acquiring a first-class continuation object is call/cc, which calls a procedure and passes to that procedure the continuation to which that call would normally return. Frankly, this was always a very clumsy way to work with continuations; one might almost suspect it was devised as an "esoteric programming language" feature, akin to INTERCAL's COME FROM statement. With guarded continuations, though, dependence on call/cc finally becomes untenable. If I'm working in Java, and I want to create a new class of exception, I do that by adding a subtype to the tree of exception types; to do the same thing in Kernel, it seems I want to add a new continuation to the tree branch of continuations descending from the error continuation. If I had to use call/cc to acquire continuations, I could only add a new descendant of the error continuation if I were actually in the dynamic extent of the error continuation — so, to define a new "type" of error during normal computation, I'd have to first "throw" an error exception, passing through all the exit- and entry-guards to get to the point in the tree where I want to add the new continuation. What's wanted is ability to add a child continuation to an existing continuation without having to be inside the dynamic extent of the parent. Kernel provides just that ability, with extend-continuation (to add some algorithm between child and parent) and guard-continuation (to add guards between child and parent). (Interestingly, call/cc still has to be provided as a primitive, since it can't be derived from the other primitives.)

An illustrative example

Alas, no. Source code for continuation guarding is generally bulky and hard to read, and attempting to illuminate the basic concepts of continuation guarding that way would more likely just create an unhelpful mess. For me, at least, it's all best understood by visualizing the tree of nested dynamic scopes, whereas the source code only directly reveals a tree of nested static syntax; the dynamic tree then has to be deduced, and the two trees don't correspond as closely as one might think.

How best to arrange for a clear illustration, I'll have to give some more thought. Meanwhile, imagine there was an illustration here that made it all crystal clear to you.

Delimited continuations

I've also been asked to discuss guarded continuations versus delimited continuations. <cough> That's a rather large topic. I'd probably have to start by explaining why I think "delimited continuations" is a misnomer — by my reckoning, delimited continuations aren't continuations at all. They're functions, and functions are a completely different sort of creature than continuations. Continuations only have one side (input), which is why it's simply meaningless to talk about composing them. (Can you see I'm working up to a major rant, here?)

So I'm going to end this post here —it's already the longest post I've yet written on this blog— and hope to put together something on delimited continuations later. Perhaps that post will also help to further clarify guarded continuations, but we'll see. It's yet another item for my collection of in-development blog posts (along with guarded environments, another interesting topic this has reminded me of).

Monday, December 26, 2011

Preface to Homer

But of the tree of the knowledge of good and euill, thou ſhalt not eate of it: for in the day that thou eateſt thereof, thou ſhalt ſurely die.
— Genesis 2:17 (King James Version)

I'm going to suggest a hypothesis on the evolution of societies, natural languages, and memetic organisms. Relating human language and human thought to programming is a straightforward exercise left to the student.

Eric Havelock in his magnum opus Preface to Plato (fascinating stuff) describes two profoundly different stages in the evolution of human societies, which he calls oral society and literate society. Greek society, reckons Havelock, was transforming from oral to literate around the time of Plato; the epics of Homer are characteristic of oral culture. I suggest that there is a stage of societal evolution before that of oral poetry such as Homer's — and that just as Afghan culture is a surviving example of oral society, the Pirahã culture recently studied in the Amazon is a surviving example of... verbal society. (I can't bring myself to call it "pre-oral", since language is still spoken; but it does have words, and is missing oration, so "verbal" will do till something better comes along.)

Scientific organisms need a literate environment to survive; religious organisms don't need literacy, and were, I reckon, the giants that roamed the ideosphere in the age of orality. But in the age before Homer's, religions could not have survived either. If ever there were an actual event that fits beautifully with the myth of Adam and Eve eating of the fruit of the Tree of Knowledge, the transition from verbal to oral society would be it.

Oral thought

In an oral society (in Havelock's vision — I've also read parts of Walter Ong's Orality and Literacy, which follows Havelock's paradigm), knowledge is preserved through an oral tradition. The form of the account matters; facts are mutated as convenient to put them in a form that will be successfully replicated over many retellings. Standard patterns are used. Repetition is used. A good story also always has an actor: things don't just happen, they are done, by somebody, which may favor a polytheistic society, with deities personifying what we (in a literate society) would call "abstract" forces. One might well end up with some such pattern as

And God said, let there be x. And there was x. And God saw the x, and saw that it was good. And the morning and the evening were the n^th day.

And God said, let there be y. And there was y. And God saw the y, and saw that it was good. And the morning and the evening were the n+1^st day.

(Note the repetition of each item, repetition of the pattern, and the actor.) For a more concrete example, here is the start of Thomas Hobbes's 1676 translation of Homer's Iliad:

O goddess sing what woe the discontent
Of Thetis’ son brought to the Greeks; what souls
Of heroes down to Erebus it sent,
Leaving their bodies unto dogs and fowls;
Whilst the two princes of the army strove,
King Agamemnon and Achilles stout.

Notice that things here don't happen, somebody/something does them. The goddess sings. The discontent (of Thetis' son) brings, sends, leaves. The two princes strive.

The things that are acted on are concrete as well; nothing is abstract in our literate sense.

Such oral tradition can be written down, and was written down, without disrupting the orality of the society. Literate society is what happens when the culture itself embraces writing as a means of preserving knowledge instead of an oral tradition. Once literacy is assimilated, set patterns are no longer needed, repetition is no longer needed, pervasive actors are no longer needed, and details become reliably stable in a way that simply doesn't happen in oral society — the keepers of an oral tradition are apt to believe they tell a story exactly the same way each time, but only because they and their telling change as one. When the actors go away, it becomes possible to conceive of abstract entities. Plato, with his descriptions of shadows on a cave wall, and Ideal Forms, and such, was (Havelock reckoned) trying to explain literate abstraction in a way that might be understood by someone with an oral worldview.

Note that science can't possibly survive in an oral environment. It relies on an objectively fixed record of phenomena, against which to judge theories; and it centrally studies abstract forces. Religion, on the other hand, is ideally suited to an oral environment. I suggest that religious organisms were the dominant taxon of memetic organisms in oral society, and the taxon of scientific organisms evolved once literate society made it possible. Leading to a classic Darwinian struggle for survival between the two taxa.

Verbal thought

Those who study natural languages have a luxury not afforded to artlangers, those who create languages artistically: When a natural language crops up that violates the "natural" rules that linguists thought they understood, new rules can be invented to fit. But, if an artlanger creates a language that violates the rules linguists think they understand, their creation is likely to be ridiculed. This was the observation of David J. Peterson in an essay in 2007 (he gives a "Smiley award" to a conlang each year, and does so in essays containing some brilliant deep insights into conlanging).

What is it to a linguist if Pirahã exists? "That sounds totally fake," says the skeptic. Says the linguist, "Yeah, doesn't it?" But in a world where Pirahã doesn't exist, imagine the conlanger who created it. "I just made up a language with no temporal vocabulary or tense whatsoever, no number system, and a culture of people who have no oral history, no art, and no appreciation for storytelling. Oh, yeah, and the language can just as easily be whistled, hummed or drummed as spoken. Oh, and the men and women have different phonologies. Oh yeah, and it's spoken in an area with a dominant language, but nobody speaks it, because they think their language is the best. Oh yeah, and it's supposed to be a naturalistic language." Suddenly when someone counters and says, "That sounds totally fake," the conlanger is put on the defensive, because they do have to account for it—in other words, "Yeah, doesn't it?" isn't going to fly.
— David J. Peterson, The 2007 Smiley Award Winner: Teonaht

Which is interesting for a conlanger, but fascinating in light of Havelock's notion of oral society. That list of features is pretty much explicitly saying the language doesn't and can't support an oral society: "no oral history, no art, and no appreciation for storytelling" (doesn't), "no temporal vocabulary or tense whatsoever, no number sytem" (can't). And for a verbal society to survive in a world where the main Darwinian memetic struggle is between literacy and orality, of course it would have to be an extraordinarily compelling instance of verbality — "nobody speaks" the dominant language of the area, "because they think their language is the best."

The fruit of the Tree of Knowledge

The Pirahã were studied by Daniel Everett, who originally approached them with an evangelical mission — the point of such efforts is to learn an isolated people's language, then translate the teachings of Christianity into that language. Of course this failed miserably with the Pirahã, with their compellingly verbal language and culture (J.R.R. Tolkien criticized the international auxilliary language Esperanto as sterile because it had no mythology behind it (one could argue Esperanto now does have a mythology of a sort (but I digress))). Within a few years after completing his dissertation, Everett became an atheist. (Everett's 2008 book on his experiences with the Pirahã is Don't Sleep There are Snakes: Life and Language in the Amazonian Jungle.)

All of which goes to show that the myth of the fruit of the Tree of Knowledge can be handily applied to the transition from verbal to oral society. However, as I pointed out in my earlier post on memetic organisms, religious teachings are naturally selected for their ambiguity, their ability to be given a wide variety of different interpretations. The plausibility of an interpretation of the myth is, therefore, pretty much meaningless — the myth is a contextual chamelion, expected to blend plausibly into different belief systems. But it is interesting to consider how the myth might have evolved. The early Judaic written tradition is evidently a written record of an originally oral tradition, and an oral tradition mutates events into a good story (i.e., a highly replicable one). If the verbality conjecture is somewhere in the general neighborhood of the truth, there may once have been a vast cultural upheaval as orality supplanted verbality, perhaps (or perhaps not) on a par with the modern struggle between scientific and religious thinking (a major theme of current geopolitics). Such an upheaval might be expected to make a lasting impression on the oral societies that emerged; the lasting impression would be a myth; and the myth would be shaped into the forms of orality, with concrete actors and objects. What sort of myth do you think might result?

[Update: I further discuss the timing of the verbal/oral transition in a 2015 post Sapience and language.]