...if you cannot say what you mean, you can never mean what you say.— Centauri Minister of Intelligence, Babylon 5.
Ever since I started this blog, I've had in mind to devote a post to the relationship between the strong theory of vau-calculus and the no-go theorem of Mitchell Wand's 1998 paper The Theory of Fexprs is Trivial. Some of my past attempts to explain this relationship have failed miserably, though, so before trying again I wanted some new insight into how to approach the explanation. Following my mention of this issue in my recent post on bypassing no-go theorems, and some off-blog discussion it provoked, I'm going to take a stab at it.
The trouble has always been that while my result is very simple, and Wand's result is very simple, and both treatments use quite conventional mathematical techniques (playing the game strictly according to Hoyle), they're based on bewilderingly different conceptual frameworks. There's not just one difference, but several, and if you try to explain any one of them, your audience is likely to get hung up on one or more of the others.
Here are some highlights of how these conceptual differences stack up; I'll look at things in more detail hereafter.
- Wand's paper assumes that the only equivalences of interest are those between source expressions; he acknowledges this explicitly in the paper.
- Wand's paper isn't really about fexprs — it's about computational reflection, as can be seen in the Observations and Conclusions section at the end of the paper.
- Wand's operational semantics has no terms that aren't source expressions. This isn't a realistic depiction of fexprs as they occur in Lisp, but it is consistent with both his exclusive interest in the theory of source expressions and his interest in computational reflection.
- Wand's operational semantics uses implicit evaluation rather than explicit evaluation: that is, to distinguish between an expression to be evaluated and an expression not to be evaluated, he uses special contexts marking expressions that aren't to be evaluated, rather than special contexts marking expressions that are to be evaluated. From a purely mathematical standpoint, a rewriting system using implicit evaluation always has a trivial theory, while a rewriting system using explicit evaluation can have quite a strong theory. However, Wand's conceptual commitments prevent him both from using explicit evaluation and from caring about its consequences. He cannot use explicit evaluation because the explicit evaluation contexts would not be representable in source expressions, violating his prohibition against non-source terms. And, even if such expressions were allowed, all of the nontrivial equivalences in the theory would involve non-source terms, so that all the equivalences Wand is interested in would still be trivial (though for a different technical reason).
Suppose we want a term-rewriting system in which any source expression S may sometimes appear as data and other times appear as program code to be executed. We can do this in either of two ways: either use a special context to mark a subterm that isn't to be evaluated, or use a special context to mark a subterm that is to be evaluated.
What Wand calls contextual equivalence, and I call operational equivalence, is a binary relation on terms defined thus (up to unimportant differences between treatments):
T1 ≅ T2 iff for every context C and observable V, C[T1] ↦* V iff C[T2] ↦* VBut this formal definition interacts differently with the two different approaches to marking whether S is to be evaluated.
Suppose we use a special context to mark a subterm that isn't to be evaluated. I call this strategy implicit evaluation, because a subterm is implicitly to be evaluated unless we go out of our way to say otherwise. This strategy would naturally occur to a traditional Lisp programmer, who is used to quotation. If this special context is identified by an operator, likely we'd call it "quotation"; but whatever we call it, suppose Q is a context that marks its subterm as data. This means, formally, that Q[S] is itself an observable, and that Q[S1] and Q[S2] are different observables unless S1 and S2 are syntactically identical. It immediately follows, from our definition of operational equivalence, that no two source terms S1 and S2 can ever be operationally equivalent unless they are syntactically identical. When you add all the trappings in Wand's paper, this trivialization of theory might look more complicated, but it still comes down to this: if there's a context that converts arbitrary terms into irreducible observables, then the operational equivalence relation is trivial.
But suppose we use the second strategy, a special context to mark a subterm that is to be evaluated. I call this strategy explicit evaluation because a subterm is only evaluated if it's immediately surrounded by a context that explicitly says to evaluate it. Then the cause of Wand's trivialization of theory "simply" vanishes. Except that the consequences aren't at all simple. The operational equivalence relation is really measuring something different under explicit evaluation than it did under implicit evaluation (under explicit evaluation, source terms have trivial theory because they're irreducible, whereas under implicit evaluation they have trivial theory even though they aren't irreducible). So now I'll go back to square one and build the mathematical machinery with more care.
Dramatis personaeIn classical small-step operational semantics, we have a term set, and a number of binary relations on terms.
The term set includes both source expressions and the observable results of computations, and possibly also some terms that represent states of computation that are neither source expressions nor observables.
Six key properties that a binary relation on terms might have:
- Reflexive: T1 > T1
- Transitive: if T1 > T2 and T2 > T3 then T1 > T3
The reflexive transitive closure ("zero or more steps") is written by suffixing "*" to the relation, as T1 >* T2 - Commutative: if T1 > T2 then T2 > T1
A reflexive transitive commutative relation is called an equivalence. - Compatible: if T1 > T2 then for all contexts C, C[T1] > C[T2]
A compatible equivalence is called a congruence. - Church-Rosser: If T1 >* T2 and T1 >* T3 then there exists T4 such that T2 >* T4 and T3 >* T4
- Deterministic: For each T1, there is at most one T2 such that T1 > T2
This is not usually a mathematically desirable property, because it interferes with other properties like compatibility. However, determinism may be semantically desirable (i.e., we want a program to behave the same way each time it's run).
- Operational step, written ↦
A directed relation meant to "obviously" describe what programs in the language are supposed to do. Because it's meant to be obviously right, it doesn't try to have nice mathematical properties like Church-Rosser-ness. Expected to be deterministic and non-compatible. - Calculus step, written →
A relation meant to have lots of nice mathematical properties, like the reduction step relation of lambda-calculus. The calculus step is generally defined to be compatible, and if it isn't Church-Rosser, something's wrong. - Calculus equality, written =
The reflexive transitive commutative closure of the calculus step (thus, the smallest equivalence relation containing the calculus step). - Operational equivalence, written ≅
T1 ≅ T2 iff for every context C and observable V, C[T1] ↦* V iff C[T2] ↦* V
- completeness:
↦* implies →* - Church-Rosser-ness:
→ is Church-Rosser - soundness:
= implies ≅
The syntax of terms in Wand's semantics is:
T ::= x | (λx.T) | (TT) | (fexpr T) | (eval T)That is, a term is either a variable, or a lambda-expression, or a combination, or a fexpr, or a call to eval. Wand explained he didn't want to have two different binding constructs, one for ordinary procedures and one for fexprs. Well, I didn't either; it's characteristic of our different approaches, though, that where Wand made his binding construct a constructor of applicatives (that evaluate their argument), and put a wrapper around it to cause the operand to be quoted, I made my binding construct a constructor of operatives (i.e., fexprs), and put a wrapper around it to cause the operand to be evaluated.
Wand has three rewriting rules: one for applying a λ-expression, one for applying a fexpr, and one for applying eval. If we were being naive, we might try to define a calculus step like this:
((λx.T)V) → T[x ← V]Here, T is a term, and V is a value, that is, a term that's been reduced as much as it can be. (Obscure point: a value is a term that's irreducible by the operational step. Don't worry about it.)
(fexpr V)T → (V encode(T))
(eval T) → decode(T)
The encode/decode functions require some explanation. The point of not evaluating certain subterms is to be able to use those unevaluated subterms as data. So somehow you have to represent them in an accessible data form. One way to do this would be to use a special quotation operator, and then have accessors that act on quoted expresions, like (car (quote (T1 . T2))) → (quote T1). However, when you're proving a no-go theorem, as Wand was doing, you want to minimize the assumptions you have to make to demonstrate the no-go result. So Wand would naturally want to avoid introducing those extra operators, quote car and so on. Instead, Wand used a Mogensen-Scott encoding, which, given an arbitrary term T, produces a term built up out of lambda-expressions that is not itself reducible at all but which can be queried to extract information about T.
Unfortunately this definition of a calculus step doesn't work, exactly because these rules assume → is compatible, and the whole point of Wand's exercise is that subterm rewriting isn't allowed in all contexts. However, we can define the operational semantics step, which isn't compatible anyway. For the deterministic operational step, we define an "evaluation context", which is a context that determines where the next redex (reducible expression) can occur. We have
E ::= ⎕ | (ET) | ((λx.T)E) | (fexpr E) | (eval E)That is, starting from the top of the syntax tree of a term, we can reduce a redex at the top of the tree, or we can descend into the operator of a combination, or into the operand of a λ-expression, or into the body of a fexpr, or into the operand of eval. Using this, we can define the operational step.
E[((λx.T)V)] ↦ E[T[x ← V]]Preprocessing as a cure for trivialization
E[(fexpr V)T] ↦ E[(V encode(T))]
E[(eval T)] ↦ E[decode(T)]
Wand credited Albert Meyer for noting the trivialization effect. Meyer, though, noted that quote-eval (as opposed to fexprs) need not cause trivialization. (Meyer made this supplementary observation at least as early as 1986; see here, puzzle 3.)
The key to this non-trivialization by quotation is that quotation in traditional Lisp is a "special form", which is to say that the quotation operator, which identifies its argument as not to be evaluated, is fixed at compile-time (that is, before program evaluation begins). So, given a Lisp source program, we can run a preprocessor over it and rewrite each quoted expression to avoid quoting more than a single symbol at a time. (For example, one might rewrite ($quote (+ 1 2)) as (cons ($symbol-quote +) (cons 1 (cons 2 ()))), a style of rewriting that requires only quotation of individual symbols.) And then the term-rewriting system doesn't have to have quotation in it at all (only, in the example, symbol-quotation, a much more restricted facility that would only trivialize the formal theory of individual symbols, not of arbitrary source expressions.)
Wand's Mogensen-Scott encoding is a similar transformation to this preprocessing of Lisp expresions using cons etc. (setting aside how the encoding treats variables, which is a whole other can of worms because lambda-calculus variables behave very differently from Lisp symbols).
However, the encoding in Wand's paper doesn't prevent trivialization, because it's used during term reduction. Our Lisp preprocessor eliminated general quotation before term reduction ever started, so that general quotation didn't even have to be represented in the term syntax for the operational semantics. If you wait until reduction has already started, it's too late for the encoding to matter to trivialization; you already have a trivializing context, and the encoding merely facilitates data access.
This however raises a curious possiblity: what if we preprocessed Wand's language by encoding the arguments to all the operators? Then, instead of encoding an operand when applying a fexpr to it, we decode an operand when applying a non-fexpr to it (or when evaluating it, a case that Wand already provides for — except that when evaluating a term, we would only decode the operators, not their operands, since we don't decode operands until we know they're to be evaluated).
One thing about this: if we're going to decode an operand of a non-fexpr, we need to keep track of that by somehow changing the operator to indicate we've already done the decoding. The simplest way to do this is to wrap non-fexprs (as my treatment does) instead of wrapping fexprs (as Wand's does). There are other, clumsier ways to do it without altering Wand's syntax, but I'll go ahead and change the syntax.
T ::= x | (λx.T) | (TT) | (wrap T) | (eval T)Notice the wrapper is now called wrap instead of fexpr, but the basic binding constructor of combiners is still called λ. Actually, in my dissertation it's called vau rather than lambda (hence "vau-calculus"), but there's also an interesting point to be made by leaving it as λ (setting aside that unicode doesn't have a character for the way I write a lower-case vau). It's just arbitrarily chosen syntax, after all... right?
Our evaluation contexts for the operational step change slightly because, beyond the switch from fexpr to wrap, we're now willing for the operational step to reduce an operand whenever the operator is a value, regardless of whether the operator is a fexpr.
E ::= ⎕ | (ET) | (VE) | (wrap E) | (eval E)The operational step rules differ mainly in their treatment of encoding and decoding. There is no encoding during reduction. As for decoding, it is only partial. We decode everything except the operands of combinations. (Specifying this precisely would be tedious and contribute nothing to the discussion, but since that was already true of the encoding/decoding, we haven't specified it in the first place.)
E[((λx.T)V)] ↦ E[T[x ← V]]Note that the first rule, for applying a λ-expression to an operand, hasn't changed at all, even though a λ-expression is now an operative where in Wand's treatment it was an applicative.
E[(wrap V)T] ↦ E[(V partial-decode(T))]
E[(eval T)] ↦ E[partial-decode(T)]
Given a source program, before we set the operational step loose on the term, we preprocess the program by fully encoding it, and then partial-decoding it. Hence the complete absence of encoding in the rules of the operational semantics. This has the expected, but to me still rather stunning, effect that there is no longer any context in the rewriting system that has to be off-limits to rewriting. We can therefore now define a valid compatible calculus step relation by simply removing the evaluation contexts from the operational step rules:
((λx.T)V) → T[x ← V]I'll go only slightly out on a limb and say that all three master theorems probably hold for this arrangement — completeness, Church-Rosser-ness, and soundness. (Completeness is obvious by construction.) But what's really interesting here is that this arrangement both violates Wand's restriction on the term set and produces an equivalence relation that doesn't correspond to the one in his paper.
(wrap V)T → (V partial-decode(T))
(eval T) → partial-decode(T)
The term-set restriction is violated because source expressions, which Wand was studying exclusively, are now the things that are input to the preprocessor, and the preprocessor maps those source expressions into a proper subset of the terms in our semantics. And the contextual equivalence on source expressions that Wand studied is concerned with these source expressions that don't even effectively belong to our term set at all, since they're mapped through the refractive lens of this encode/partial-decode preprocessor. Yes, our ≅ is no longer a trivial equivalence — but Wand's notion of contextual equivalence of source expressons would require those source expressions to have interchangeable full encodings (in case they both appear as an operand to a fexpr), and that still won't happen unless the source expressions before preprocessing were syntactically identical. The source-expression equivalence Wand was interested in is still trivial, and our non-trivial ≅ is simply not relevant under the treatment used in his paper.
One last thought about this preprocessed, pseudo-explicit-evaluation variant of Wand's semantics. The first rule of the calculus step is the call-by-value β-rule. So this calculus is (if I've not made a goof somewhere) a conservative extension of call-by-value lambda-calculus, and its equational theory includes all the equations in the theory of call-by-value lambda-calculus plus others.
Vau-calculusAlthough my pseudo-explicit-evaluation variant of Wand's system does demonstrate some of the principles involved, the encoding/decoding, and the preprocessor, are obstacles to extracting clear insights from it. Unlike Wand, I'm not trying to prove a no-go theorem; I'm showing that something can be done, so my interest lies in making it clear how to do it, and I'll happly introduce more syntax in order to avoid a complication such as Mogensen-Scott encoding, or preprocessing.
Instead of distinguishing data from to-be-evaluated expressions by means of a deep encoding (in which the entire term is transformed, from its root to its leaves), I simply introduce syntax for fully representing Lisp source expressions as data structures, entirely separate from the syntax for things like combinations and procedures (noting that procedures cannot be expressed by Lisp source expressons: you write a source expression that would evaluate to a procedure, but until you commit to evaluating it, it's not a procedure but a structure built out of atomic data, symbols, pairs, and the empty list.)
S ::= d | ()Here, d is any atomic datum (archetypically, a numeric constant), and s is any symbol (which is another kind of datum, and not to be confused with a variable which is not a source-code element at all). I've separated out the atomic data and the empty list into a separate nonterminal S, mnemonic for "Self-evaluating", because these kinds of terms will be treated separately for purposes of evaluation.
T ::= S | s | (T . T)
All of the terms constructed by the above rules are data, by which I mean, they're irreducible observables; none of those terms can ever occur on the left side of a calculus step. (Therefore, given the master theorems, the operational theory of those terms is trivial.) There are just two syntactic contexts that can ever form a redex; I'll use nonterminal symbol A for these, mnemonic for "active":
S ::= d | ()The intent here (to be realized by calculus step rules) is that [eval T1 T2] represents scheduled evaluation of term T1 in environment T2, while [combine T1 T2 T3] represents scheduled calling of combiner T1 with parameters T2 in environment T3.
A ::= [eval T T] | [combine T T T]
T ::= S | s | (T . T) | A
There's also one other class of syntax: computational results that can't be represented by source code. There are three kinds of these: environments (which we will clearly need to support our evaluation rules), operatives, and applicatives. We'll use "e" for environments and "O" for operatives.
O ::= [vau x.T] | ...The syntactic form of operative expressions, O, is interesting in itself, and in fact even the inclusion of traditional λ-expressions (which is what you see there, although the operator is called "vau" instead of "λ") is worthy of discussion. I figure those issues would distract from the focus of this post, though, so I'm deferring them for some future post. [Those issues arise in a later post, here.]
S ::= d | () | e | O
A ::= [eval T T] | [combine T T T]
T ::= x | S | s | (T . T) | [wrap T] | A
Here, then, are the rules of the calculus step (omitting rules for forms of operatives that I haven't enumerated above).
[eval S e] → SIn the fully expounded vau-calculus, the additional forms of operatives do things like parsing the operand list, and invoking primitives (such as car, and of course $vau).
[eval s e] → lookup(s,e) if lookup(s,e) is defined
[eval (T1 . T2) e] → [combine [eval T1 e] T2 e]
[eval [wrap T] e] → [wrap [eval T e]]
[combine [vau x.T] V e] → T[x ← V]
[combine [wrap T0] (T1 ... Tn) e] → [combine T0 ([eval T1 e] ... [eval Tn e]) e]
Under these rules, every redex —that is, every term that can occur on the left side of a calculus step rule— is either an eval or a combine: reduction occurs only at points where it is explicitly scheduled via eval or combine; and if it is explicitly scheduled to occur at a point in the syntax tree, that reduction cannot be prevented by any surrounding context.
The calculus step rules are, in essence, the core logic of a Lisp interpreter.
This calculus allows us to make much more fluent statements about the evaluation-behavior of terms than mere operational equivalence. We can say, for example, that two terms T1 and T2 would be indistingusihable if evaluated in any environment, no matter what that environment is (and even though T1 and T2 may not themselves be operationally equivalent).
For all environments e, [eval T1 e] ≅ [eval T2 e]Thanks to the definition of ≅, when we say [eval T1 e] ≅ [eval T2 e], we mean
For all contexts C and observables V, C[eval T1 e] →* V iff C[eval T2 e] →* VThis is evidently not the question Wand's treatment asks with its contextual equivalence. However, we can also define Wand's relation in this framework. Let e0 be a standard environment. Then,
T1∼T2 iff for all contexts C, [eval C[T1] e0] ≅ [eval C[T2] e0]Relation ∼ is Wand's contextual equivalence. And, indeed, for all source expressions S1 and S2, S1∼S2 iff S1 and S2 are syntactically identical.
Finally, note what happens if we omit from this calculus all of the syntax for symbols and lists, along with its associated machinery (notably, eval and environments; the only change this makes to the remaining elements is that we drop the third operand to [combine ...]).
O ::= [vau x.T]
S ::= d | O
A ::= [combine T T]
T ::= x | S | A
[combine [vau x.T] V] → T[x ← V]Do you recognize it? You should. It's the call-by-value lambda-calculus.
[edit: besides some minor typos here and there, this post had originally omitted the vau-calculus rule for combining an applicative, [combine [wrap T0] ...].