Showing posts with label fexprs. Show all posts
Showing posts with label fexprs. Show all posts

Monday, August 29, 2016

Interpreted programming languages

Last night I drifted off while reading a Lisp book.
xkcd 224.

It's finally registered on me that much of the past half century of misunderstandings and confusions about the semantics of Lisp, of quotation, and, yes, of fexprs, can be accounted for by failure to recognize there is a theoretical difference between an interpreted programming language and a compiled programming language.  If we fail to take this difference into account, our mathematical technology for studying compiled programming languages will fail when applied to interpreted languages, leading to loss of coherence in language designs and a tendency to blame the language rather than the theory.

Technically, a compiler translates the program into an executable form to be run thereafter, while an interpreter figures out what the program says to do and just does it immediately.  Compilation allows higher-performance execution, because the compiler takes care of reasoning about source-code before execution, usually including how to optimize the translation for whatever resources are prioritized (time, space, peripherals).  It's easy to suppose this is all there is to it; what the computer does is an obvious focus for attention.  One might then suppose that interpretation is a sort of cut-rate alternative to compilation, a quick-and-dirty way to implement a language if you don't care about performance.  I think this misses some crucial point about interpretation, some insight to be found not in the computer, but in the mind of the programmer.  I don't understand that crucial insight clearly enough — yet — to focus a whole blog post on it; but meanwhile, there's this theoretical distinction between the two strategies which also isn't to be found in the computer's-eye view, and which I do understand enough about to focus this blog post on.

It's not safe to say the language is the same regardless of which way it's processed, because the language design and the processing strategy aren't really independent.  In principle a given language might be processed either way; but the two strategies provide different conceptual frameworks for thinking about the language, lending themselves to different language design choices, with different purposes — for which different mathematical properties are of interest and different mathematical techniques are effective.  This is a situation where formal treatment has to be considered in the context of intent.  (I previously blogged about another case where this happens, formal grammars and term-rewriting calculi, which are both term-rewriting systems but have nearly diametrically opposite purposes; over t‍ha‍r.)

I was set onto the topic of this post recently by some questions I was asked about Robert Muller's M-Lisp.  My dissertation mentions Muller's work only lightly, because Muller's work and mine are so far apart.  However, because they seem to start from the same place yet lead in such different directions, one naturally wants to know why, and I've struck on a way to make sense of it:  starting from the ink-blot of Lisp, Muller and I both looked to find a nearby design with greater elegance — and we ended up with vastly different languages because Muller's search space was shaped by a conceptual framework of compilation while mine was shaped by a conceptual framework of interpretation.  I will point out, below, where his path and mine part company, and remark briefly on how this divergence affected his destination.

The mathematical technology involved here, I looked at from a lower-level perspective in an earlier post.  It turns out, from a higher-level perspective, that the technology itself can be used for both kinds of languages, but certain details in the way it is usually applied only work with compiled languages and, when applied to interpreted languages, result in the trivialization of theory noted by Wand's classic paper, "The Theory of Fexprs is Trivial".

Contents
M-language
Compilation
Universal program
S-language
Half a century's worth of misunderstandings and confusions
M-language

I'll explore this through a toy programming language which I'll then modify, starting with something moderately similar to what McCarthy originally described for Lisp (before it got unexpectedly implemented).

This is a compiled programming language, without imperative features, similar to λ-calculus, for manipulating data structures that are nested trees of atomic symbols.  The syntax of this language has two kinds of expressions:  S-expressions (S for Symbolic), which don't specify any computation but merely specify constant data structures — trees of atomic symbols — and M-expressions (M for Meta), which specify computations that manipulate these data structures.

S-expressions, the constants of the language, take five forms:

S   ::=   s | () | #t | #f | (S . S)   .
That is, an S-expression is either a symbol, an empty list, true, false, or a pair (whose elements are called, following Lisp tradition, its car and cdr).  A symbol name is a sequence of one or more upper-case letters.  (There should be no need, given the light usage in this post, for any elaborate convention to clarify the difference between a symbol name and a nonterminal such as S.)

I'll assume the usual shorthand for lists, (S ...) ≡ (S . (...)), so for example (FOO BAR QUUX) ≡ (FOO . (BAR . (QUUX . ()))); but I won't complicate the formal grammar with this shorthand since it doesn't impact the abstract syntax.

The forms of M-expressions start out looking exactly like λ-calculus, then add on several other compound forms and, of course, S-expressions which are constants:

M   ::=   x | [λx.M] | [M M] | [if M M M] | [car M] | [cdr M] | [cons M M] | [eq? M M] | S   .
The first form is a variable, represented here by nonterminal x.  A variable name will be a sequence of one or more lower-case letters.  Upper- versus lower-case letters is how McCarthy distinguished between symbols and variables in his original description of Lisp.

Variables, unlike symbols, are not constants; rather, variables are part of the computational infrastructure of the language, and any variable might stand for an arbitrary computation M.

The second form constructs a unary function, via λ; the third applies a unary function to a single argument.  if expects its first argument to be boolean, and returns its second if the first is true, third if the first is false.  car and cdr expect their argument to be a pair and extract its first and second element respectively.  cons constructs a pair.  eq? expects its two arguments to be S-expressions, and returns #t if they're identical, #f if they're different.

Compound unary functions, constructed by λ, are almost first-class.  They can be returned as the values of expressions, and they can be assigned to variables; but as the grammar is set out, they cannot appear as elements of data structures.  A pair expression is built up from two S-expressions, and a compound unary function is not an S-expression.  McCarthy's original description of Lisp defines S-expressions this way; his stated purpose was only to manipulate trees of symbols.  Trained as I am in the much later Lisp culture of first-class functions and minimal constraints, it felt unnatural to follow this narrower definition of S-expressions; but in the end I had to define it so.  I'm trying to reproduce the essential factors that caused Lisp to come out the way it did, and, strange to tell, everything might have come out differently if not for the curtailed definition of S-expressions.

(One might ask, couldn't we indirectly construct a pair with a function in it using cons?  A pair with a function in it would thus be a term that can't be represented directly by a source expression.  This point likely doesn't matter directly to how things turned out; but fwiw, I suspect McCarthy didn't have in mind to allow that, no.  It's entirely possible he also hadn't really thought about it yet at the preliminary stage of design where this detail had its impact on the future of Lisp.  It's the sort of design issue one often discovers by playing around with a prototype — and in this case, playing around with a prototype is how things got out of hand; more on that later.)

Compilation

Somehow we want to specify the meanings of programs in our programming language.  Over the decades, a number of techniques for formal PL semantics have been entertained.  One of the first things tried was to set up a term-rewriting machine modeled roughly on λ-calculus, that would perform small finite steps until it reached a final state; that was Peter Landin's notion, when Christopher Strachey set him onto the problem in the 1960s.  It took some years to get the kinks out of that approach, and meanwhile other techniques were tried — such as denotational semantics, meta-circular evaluation — but setting up a term-rewriting calculus has been quite a popular technique since the major technical obstacles to it were overcome.  Of the three major computational models tied together in the 1930s by the Church-Turing thesis, two of them were based on term-rewriting:  Turing machines, which were the convincing model, the one that lent additional credibility to the others when they were all proven equivalent; and λ-calculus, which had mathematical elegance that the nuts-and-bolts Turing-machine model lacked.  The modern "small-step" rewriting approach to semantics (as opposed to "big-step", where one deduces how to go in a single step from start to finish) does a credible job of combining the strengths of Turing-machines and λ-calculus; I've a preference for it myself, and it's the strategy I'll use here.  I described the technique in somewhat more depth in my previous post on this material.

Small-step semantics applies easily to this toy language because every intermediate computational state of the system is naturally represented by a source-code expression.  That is, there is no obvious need to go beyond the source-code grammar we've already written.  Some features that have been omitted from the toy language would make it more difficult to limit all computational states to source expressions; generally these would be "stateful" features, such as input/output or a mutable store.  Landin used a rewriting system whose terms (computational states) were not source-code expressions.  One might ask whether there are any language features that would make it impossible to limit computational states to source expressions, and the answer is essentially yes, there are — features related not to statefulness, but to interpretation.  For now, though, we'll assume that all terms are source expressions.

We can define the semantics of the language in reasonably few rewriting schemata.

[[λx.M1] M2]   →   M1[x ← M2]
[if #t M1 M2]   →   M1
[if #f M1 M2]   →   M2
[car (S1 . S2)]   →   S1
[cdr (S1 . S2)]   →   S2
[cons S1 S2]   →   (S1 . S2)
[eq? S S]   →   #t
[eq? S1 S2]   →   #f     if the Sk are different.

Rewriting relation → is the compatible closure of these schemata; that is, for any context C and terms Mk, if M1 → M2 then C[M1] → C[M2].  Relation → is also Church-Rosser:  although a given term M may be rewritable in more than one way, any resulting difference can always be eliminated by later rewriting.  That is, the reflexive transitive closure →* has the diamond property:  if M1 →* M2 and M1 →* M3, then there exists M4 such that M2 →* M4 and M3 →* M4.

Formal equality of terms, =, is the symmetric closure of →* (thus, the reflexive symmetric transitive compatible closure of the schemata, which is to say, the least congruence containing the schemata).

Another important relation is operational equivalence, ≅.  Two terms are operationally equivalent just if replacing either by the other in any possible context preserves the observable result of the computation.  M1 ≅ M2 iff for every context C and S-expression S,  C[M1] ↦* S  iff  C[M2] ↦* S.  (Fwiw, relation ↦ is what the computation actually does, versus → which is anything the rewriting calculus could do; → is compatible Church-Rosser and therefore nice to work with mathematically, but ↦ is deterministic and therefore we can be sure it does what we meant it to.  Another way of putting it is that → has the mathematical character of λ-calculus while ↦ has the practical character of Turing-machines.)

Operational equivalence is exactly what must be guaranteed in order for an optimizing compiler to safely perform a local source-to-source transformation:  as long as the two terms are operationally equivalent, the compiler can replace one with the other in any context.  The rewriting calculus is operationally sound if formal equality implies operational equivalence; then the rewriting calculus can supply proofs of operational equivalence for use in optimizing compilation.

Before moving on, two other points of interest about operational equivalence:

  • Operational equivalence of S-expressions is trivial; that is, S1 and S2 are operationally equivalent only if they are identical.  This follows immediately by plugging the trivial context into the definition of operational equivalence, C[Sk] ≡ Sk.  Thus, in every non-trivial operational equivalence M1 ≅ M2, at least one of the Mk is not an S-expression.

  • All terms in the calculus — all M-expressions — are source expressions; but if there were any terms in the calculus that were not source expressions, they would be irrelevant to a source-to-source optimizer; however, if these non-source terms could be usefully understood as terms in an intermediate language used by the compiler, an optimizer might still be able to make use of them and their formal equalities.

Universal program

McCarthy's Lisp language was still in its infancy when the project took an uncontrollable turn in a radically different direction than McCarthy had envisioned going with it.  Here's what happened.

A standard exercise in theory of computation is to construct a universal Turing machine, which can take as input an encoding of an arbitrary Turing machine T and an input w to T, and simulate what T would do given input w.  This is an extremely tedious exercise; the input to a Turing machine looks nothing like the control device of a Turing machine, so the encoding is highly intrusive, and the control device of the universal machine is something of an unholy mess.  McCarthy set out to lend his new programming language mathematical street cred by showing that not only could it simulate itself with a universal program, but the encoding would be much more lucid and the logic simpler in contrast to the unholy mess of the universal Turing machine.

The first step of this plan was to describe an encoding of programs in the form of data structures that could be used as input to a program.  That is to say, an encoding of M-expressions as S-expressions.  Much of this is a very straightforward homomorphism, recursively mapping the non-data M-expression structure onto corresponding S-expressions; for our toy language, encoding φ would have

  • φ(x) = symbol s formed by changing the letters of its name from lower-case to upper-case. Thus, φ(foo) = FOO.
  • φ([λx.M]) = (LAMBDA φ(x) φ(M)).
  • φ([M1 M2]) = (φ(M1) φ(M2)).
  • φ([if M1 M2 M3]) = (IF φ(M1) φ(M2) φ(M3)).
  • φ([car M]) = (CAR φ(M)).
  • φ([cdr M]) = (CDR φ(M)).
  • φ([cons M1 M2]) = (CONS φ(M1) φ(M2)).
  • φ([eq? M1 M2]) = (EQ φ(M1) φ(M2)).

(This encoding ignores the small detail that certain symbol names used in the encoding — LAMBDA, IF, CAR, CDR, CONS, EQ — must not also be used as variable names, if the encoding is to behave correctly.  McCarthy seems not to have been fussed about this detail, and nor should we be.)

For a proper encoding, though, S-expressions have to be encoded in a way that unambiguously distinguishes them from the encodings of other M-expressions.  McCarthy's solution was

  • φ(S) = (QUOTE S).

Now, in some ways this is quite a good solution.  It has the virtue of simplicity, cutting the Gordian knot.  It preserves the readability of the encoded S-expression, which supports McCarthy's desire for a lucid encoding.  The main objection one could raise is that it isn't homomorphic; that is, φ((S1 . S2)) is not built up from φ(S1) and φ(S2).

As McCarthy later recounted, they had expected to have plenty of time to refine the language design before it could be implemented.  (The FORTRAN compiler, after all, had been a massive undertaking.)  Meanwhile, to experiment with the language they began hand-implementing particular functions.  The flaw in this plan was that, because McCarthy had been so successful in demonstrating a universal Lisp function eval with simple logic, it wasn't difficult to hand-implement eval; and, because he had been so successful in making the encoding lucid, this instantly produced a highly usable Lisp interpreter.  The sudden implementation precipitated a user community and substantial commitment to specifics of what had been a preliminary language design.

All this might have turned out differently if the preliminary design had allowed first-class functions to be elements in pairs.  A function has to be encoded, homomorphically, which would require a homomorphic encoding of pairs, perhaps φ((M1 . M2)) = (CONS φ(M1) φ(M2)); once we allow arbitrary M-expressions within the pair syntax, (M1 . M2), that syntax itself becomes a pair constructor and there's really no need for a separate cons operator in the M-language; then CONS can encode the one constructor.  One might then reasonably restrict QUOTE to base cases; more, self-encode () #t and #f, leaving only the case of encoding symbols, and rename QUOTE to SYMBOL.  The encoding would then be fully homomorphic — but the encodings of large constant data structures would become unreadable.  For example, fairly readable constant structure

((LAMBDA X (X Y)) FOO)
would encode through φ as
(CONS (CONS (SYMBOL LAMBDA) (CONS (SYMBOL X) (CONS (CONS (SYMBOL X) (CONS (SYMBOL Y) ())) ()))) (CONS (SYMBOL FOO) ()))   .
That didn't happen, though.

The homomorphic, non-QUOTE encoding would naturally tend to produce a universal function with no practical potential for meta-programming.  In theory, one could use the non-homomorphic QUOTE encoding and still not offer any native meta-programming power.  However, the QUOTE-based encoding means there are data structures lying around that look exactly like working executable code except that they happen to be QUOTE‍d.  In practice, the psychology of the notation makes it rather inevitable that various facilities in the language would allow a blurring of the line between data and code.  Lisp, I was told when first taught the language in the 1980s, treats code as data.  Sic:  I was not told Lisp treats data as code, but that it treats code as data.

In other words, Lisp had accidentally become an interpreted language; a profoundly different beast than the compiled language McCarthy had set out to create, and one whose character naturally suggests a whole different set of features that would not have occurred to someone designing a compiled language in 1960.  There were, of course, some blunders along the way (dynamic scope is an especially famous one, and I would rate the abandonment of fexprs in favor of macros as another of similar magnitude); but in retrospect I see all that as part of exploring a whole new design space of interpreted features.  Except that over the past three decades or so the Lisp community seems to have somewhat lost track of its interpreted roots; but, I'll get back to that.

Of interest:

  • In S-expression Lisp, all source expressions are S-expressions.  It is no less true now than before that an operational equivalence M1 ≅ M2 can only be nontrivial if at least one of the Mk is not an S-expression; but now, if the Mk are source expressions, we can be absolutely certain that they are both S-expressions.  So if the operational equivalence relation is restricted to source expressions, it's trivial.  This isn't disastrous; it just means that, in order to have nontrivial theory, we are going to have to have some terms that are not source expressions (as Landin did, though for a different reason); and if we choose to compile the language, we won't be allowed to call our optimizations "local source-to-source" (any given optimization could be one or the other, but not both at once).

  • This is the fork in the road where Muller and I went our separate ways.  Muller's M-Lisp, taking the compiled-language view, supposes that S-expressions are encoded homomorphically, resulting in a baseline language with no native meta-programming power.  He then considers how to add some meta-programming power to the resulting language.  However, practical meta-programming requires the programmer be able to write lucid code that can also be manipulated as data; and the homomorphic encoding isn't lucid.  So instead, meta-programming in Muller's extended M-Lisp uses general M-expressions directly (rather than their encodings).  If an M-expression turns out to be wanted as data, it then gets belatedly encoded — with the drawback that the M-expression can't be rewritten until the rewriting schemata can tell it won't be needed as data.  This causes difficulties with operational equivalence of general M-expressions; in effect, as the burden of meta-programming is shifted from S-expressions to general M-expressions, it carries along with it the operational-equivalence difficulties that had been limited to S-expressions.

S-language

McCarthy hadn't finished the details of M-expressions, so S-expression Lisp wasn't altogether an encoding of anything; it was itself, leaving its implementors rather free to invent it as they went along.  Blurring the boundary between quoted data and unquoted code provided meta-programming facilities that hadn't been available in compiled languages (essentially, I suggest, a sort of flexibility we enjoy in natural languages).  In addition to QUOTE itself (which has a rather fraught history entangled with first-class functions and dynamic scope, cf. §3.3.2 of my dissertation), from very early on the language had fexprs, which are like LAMBDA-constructed functions except that they treat their operands as data — as if the operands had been QUOTE‍d — rather than evaluating them as code (which may later be done, if desired, explicitly by the fexpr using EVAL).  In 1963, macros were added — not the mere template-substitution macros found in various other languages, but macros that treat their operands as data and perform an arbitrary computation to generate a data structure as output, which is then interpreted as code at the point of call.

But how exactly do we specify the meanings of programs in this interpreted S-expression language?  We could resort to the meta-circular evaluator technique; this is a pretty natural strategy since an evaluator is exactly what we have as our primary definition of the language.  That approach, though, is difficult to work with mathematically, and in particular doesn't lend itself to proofs of operational equivalence.  If we try to construct a rewriting system the way we did before, we immediately run into the glaring practical problem that the same representation is used for executable code, which we want to have nontrivial theory, and passive data which necessarily has perfectly trivial theory.  That is, as noted earlier, all source expressions are S-expressions and operational equivalence of S-expressions is trivial.  It's possible to elaborate in vivid detail the theoretical train wreck that results from naive application of the usual rewriting semantics strategy to Lisp with quotation (or, worse, fexprs); but this seems to be mainly of interest if one is trying to prove that something can't be done.  I'm interested in what can be done.

If what you want is nontrivial theory, that could in principle be used to guide optimizations, this is not difficult to arrange (once you know how; cf. my past discussion of profundity index).  As mentioned, all nontrivial operational equivalences must have at least one of the two terms not a source expression (S-expression), therefore we need some terms that aren't source expressions; and our particular difficulty here is having no way to mark a source expression unmistakably as code; so, introduce a primitive context that says "evaluate this source expression".  The new context only helps with operational equivalence if it's immune to QUOTE, and no source expression is immune to QUOTE, so that's yet another way to see that the new context must form a term that isn't a source expression.

Whereas the syntax of the compiled M-language had two kinds of terms — constant S-expressions and computational M-expressions — the syntax of the interpreted S-language will have three kinds of terms.  There are, again, the "constant" terms, the S-expressions, which are now exactly the source expressions.  There are the "computational" terms, which are needed for the actual work of computation; these are collectively shaped something like λ-calculus.  We expect a big part of the equational strength of the rewriting calculus to reside in these computational terms, roughly the same equational strength as λ-calculus itself, and therefore of course those terms have to be entirely separate from the source expressions which can't have nontrivial equational theory.  And then there are the "interpretation" terms, the ones that orchestrate the gradual conversion of source expressions into computational expressions.  The code-marking terms are of this sort.  The rewriting rules involving these "interpretation" terms will amount to an algorithm for interpreting source code.

This neat division of terms into three groups won't really be as crisp as I've just made it sound.  Interpretation is by nature a gradual process whose coordination seeps into other parts of the grammar.  Some non-interpretation terms will carry along environment information, in order to make it available for later use.  This blurring of boundaries is perhaps another part of the flexibility that (I'll again suggest) makes interpreted languages more similar to natural languages.

I'll use nonterminal T for arbitrary terms.  Here are the interpretation forms.

T   ::=   [eval T T] | ⟨wrap T⟩ | e   .
Form [eval T1 T2] is a term that stands for evaluating a term T1 in an environment T2.  This immediately allows us to distinguish between statements such as
S1 ≅ S2
[eval S1 e0] ≅ [eval S2 e0]
∀e, [eval S1 e] ≅ [eval S2 e]   .
The first proposition is the excessively strong statement that S-expressions Sk are operationally equivalent — interchangeable in any context — which can only be true if the Sk are identically the same S-expression.  The second proposition says that evaluating S1 in environment e0 is operationally equivalent to evaluating S2 in environment e0 — that is, for all contexts C and all S-expressions S, C[[eval S1 e0]] ↦* S  iff  C[[eval S2 e0]] ↦* S.  The third proposition says that evaluating S1 in any environment e is operationally equivalent to evaluating S2 in the same e — which is what we would ordinarily have meant, in a compiled language, if we said that two executable code (as opposed to data) expressions Sk were operationally equivalent.

The second form, ⟨wrap T⟩, is a wrapper placed around a function T, that induces evaluation of the operand passed to the function.  If T is used without such a wrapper (and presuming T isn't already a wrapped function), it acts directly on its unevaluated operand — that is, T is a fexpr.

The third form, e, is simply an environment.  An environment is a series of symbol-value bindings, ⟪sk←Tk⟫; there's no need to go into gory detail here (though I did say more in a previous post).

The computational forms are, as mentioned, similar to λ-calculus with some environments carried along.

T   ::=   x | [combine T T T] | ⟨λx.T⟩ | ⟨εx.T⟩   .
Here we have a variable, a combination, and two kinds of function.  Form ⟨λx.T⟩ is a function that substitutes its operand for x in its body T.  Variant ⟨εx.T⟩ substitutes its dynamic environment for x in its body T.

Form [combine T1 T2 T3] is a term that stands for combining a function T1 with an operand T2 in a dynamic environment T3.  The dynamic environment is the set of bindings in force at the point where the function is called; as opposed to the static environment, the set of bindings in force at the point where the function is constructed.  Static environments are built into the bodies of functions by the function constructor, so they don't show up in the grammar.  For example, [eval (LAMBDA X FOO) e0] would evaluate to a function with static environment e0, of the form ⟨wrap ⟨λx.[eval FOO ⟪...⟫]⟩⟩ with the contents of e0 embedded somewhere in the ⟪...⟫.

Putting it all together,

T   ::=   x | [combine T T T] | ⟨λx.T⟩ | ⟨εx.T⟩ | [eval T T] | ⟨wrap T⟩ | e | S   .

The rewriting schemata naturally fall into two groups, those for internal computation and those for source-code interpretation.  (There are of course no schemata associated with the third group of syntactic forms, the syntactic forms for passive data, because passive.)  The computation schemata closely resemble λ-calculus, except with the second form of function used to capture the dynamic environment (which fexprs sometimes need).

[combine ⟨λx.T1⟩ T2 T3]   →   T1[x ← T2]
[combine ⟨εx.T1⟩ T2 T3]   →   [combine T1[x ← T3] T2 T3]   .
The interpretation schemata look very much like the dispatching logic of a Lisp interpreter.
[eval d T]   →   d
        if d is an empty list, boolean, λ-function, ε-function, or environment
[eval s e]   →   lookup(s,e)     if symbol s is bound in environment e
[eval (T1 T2) T3]   →   [combine T1 T2 T3]
[combine ⟨wrap T1⟩ T2 T3]   →   [combine T1 [eval T2 T3] T3]   .
(There would also be some atomic constants representing primitive first-class functions and reserved operators such as if, and schemata specifying what they do.)

Half a century's worth of misunderstandings and confusions

As I remarked earlier, Lisp as we know it might not have happened — at least, not when and where it did — if McCarthy had thought to allow first-class functions to occur in pairs.  The thing is, though, I don't think it's all that much of an "accident".  He didn't think to allow first-class functions to occur in pairs, and perhaps the reason we're likely to think to allow them today is that our thinking has been shaped by decades of the free-wheeling attitude fostered by the language that Lisp became because he didn't think to then.  The actual sequence of events seems less unlikely than one might first suppose.

Researchers trying to set up semantics for Lisp have been led astray, persistently over the decades, by the fact that the primary Lisp constructor of first-class functions is called LAMBDA.  Its behavior is not that of calculus λ, exactly because it's entangled with the process of interpreting Lisp source code.  This becomes apparent when contemplating rewriting calculi for Lisp of the sort I've constructed above (and have discussed before on this blog):  When you evaluate a LAMBDA expression you get a wrapped function, one that explicitly evaluates its operand and then passes the result to a computational function; that is, passes the result to a fexpr.  Scan that:  ordinary Lisp functions do not correspond directly to calculus λ-abstractions, but fexprs do correspond directly to calculus λ-abstractions.  In its relation to Lisp, λ-calculus is a formal calculus of fexprs.

Much consternation has also been devoted to the perceived theoretical difficulty presented by Lisp's quotation operator (and presented in more extreme form by fexprs), because it presents a particular context that can distinguish any two S-expressions placed into it:  (QUOTE S1) and (QUOTE S2) are observably distinct whenever the Sk are distinct from each other.  Yet, this observation only makes sense in a compiled programming language.  Back in the day, it would have been an unremarkable observation that Lisp only has syntax for data structures, no syntax at all for control.  Two syntactically distinct Lisp source expressions are operationally non-equivalent even without any quotation or fexpr context, because they don't represent programs at all; they're just passive data structures.  The context that makes a source expression code rather than data is patently not in the source; it's in what program you send the source expression to.  Conventional small-step operational semantics presumes the decision to compile, along with a trivial interpretive mapping between source expressions and internal computational terms (so the interpretive mapping doesn't have to appear explicitly in the rewriting schemata).  It is true that, without any such constructs as quotation or fexprs, there would be no reason to treat the language as interpreted rather than compiled; but once you've crossed that Rubicon, the particular constructs like quotation or fexprs are mere fragments of, and can be distractions from, the main theoretical challenge of defining the semantics of an interpreted language.

The evolution of Lisp features has itself been a long process of learning how best to realize the flexibility offered by interpreted language.  Fexprs were envisioned just about from the very beginning — 1960 — but were sabotaged by dynamic scope, a misfeature that resulted from early confusion over how to handle symbol bindings in an interpreter.  Macros were introduced in 1963, and unlike fexprs they lend themselves to preprocessing at compile time if one chooses to use a compiler; but macros really ought to be much less mathematically elegant... except that in the presence of dynamic scope, fexprs are virulently unstable.  Then there was the matter of first-class functions; that's an area where Lisp ought to have had a huge advantage; but first-class functions don't really come into their own without static scope (The Art of the Interpreter noted this) — and first-class functions also present a difficulty for compilers (which is why procedures in ALGOL were second-class).  The upshot was that after Lisp 1.5, when Lisp splintered into multiple dialects, first-class functions went into eclipse until they reemerged in the mid-1970s when Scheme introduced static scope into the language.  Fexprs held on longer but, ironically, were finally rejected by the Lisp community in 1980 — just a little ahead of the mainstream adoption of Scheme's static-scope innovation.  So for the next twenty years and more, Lisp had static scope and first-class functions, but macros and no fexprs.  Meanwhile, EVAL — key workhorse of meta-linguistic flexibility — was expunged from the new generation of mainstream Lisps and has had great difficulty getting back in.

The latter half of Lisp's history has been colored by a long-term trend in programming language design as a whole.  I've alluded to this several times above.  I have no specific sources to suggest for this; it's visible in the broad sweep of what languages were created, what research was done, and I've sensed it though my interactions with the Lisp community over the past thirty years.  When I learned Lisp in the mid-1980s, it was from the Revised Maclisp Manual, Saturday Evening Edition (which I can see a few feet away on my bookshelf as I write this, proof that manuals can be well-written).  Maclisp was a product of the mentality of the 1970s.  Scheme too was a product of that mentality.  And what comes through to me now, looking back, isn't the differences between those languages (different though they are), but that those people knew, gut deep, that Lisp is an interpreted language — philosophically, regardless of the technical details of the language processing software.  The classic paper I cited above for the relationship between first-class functions and static scope was one of the "lambda" papers associated with the development of Scheme: "The Art of the Interpreter".  The classic textbook — the Wizard Book — that emerged from the Scheme design is Structure and Interpretation of Computer Programs.

But then things changed.  Compilation had sometimes intruded into Lisp design, yes (with unfortunate results, as I've mentioned), but the intrusion became more systematic.  Amongst Scheme's other achievements it had provided improved compilation techniques, a positive development but which also encouraged greater focus on the challenges of compilation.  We refined our mathematical technology for language semantics of compiled languages, we devised complex type systems for use with compiled languages, more and more we designed our languages to fit these technologies — and as Lisp didn't fit, more and more we tweaked our Lisp dialects to try to make them fit.  Of course some of the indigenous features of Lisp couldn't fit, because the mathematical tools were fundamentally incompatible with them (no pun intended).  And somewhere along the line, somehow, we forgot, perhaps not entirely but enough, that Lisp is interpreted.  Second-class syntax has lately been treated more and more as if it were a primary part of the language, rather than a distraction from the core design.  Whatever merits such languages have, wholeheartedly embracing the interpretation design stance is not among them.

I'm a believer in trying more rather than less.  I don't begrudge anyone their opportunity to follow the design path that speaks to them; but not all those paths speak to me.  Second-class syntax doesn't speak to me, nor recasting Lisp into a compiled language.  I'm interested in compiling Lisp, but want the language design to direct those efforts rather than the other way around.  To me, the potential of interpretation beckons; the exciting things we've already found on that path suggest to me there's more to find further along, and the only way to know is to follow the path and see.  To do that, it seems to me we have to recognize that it is a distinct path, the distinct philosophy of interpretation; and, in company with that, we need to hone our mathematical technology for interpreted languages.

These are your father's parentheses
Elegant weapons
for a more... civilized age.
xkcd 297.

Wednesday, July 31, 2013

Explicit evaluation

...if you cannot say what you mean, you can never mean what you say.
— Centauri Minister of Intelligence, Babylon 5.

Ever since I started this blog, I've had in mind to devote a post to the relationship between the strong theory of vau-calculus and the no-go theorem of Mitchell Wand's 1998 paper The Theory of Fexprs is Trivial.  Some of my past attempts to explain this relationship have failed miserably, though, so before trying again I wanted some new insight into how to approach the explanation.  Following my mention of this issue in my recent post on bypassing no-go theorems, and some off-blog discussion it provoked, I'm going to take a stab at it.

The trouble has always been that while my result is very simple, and Wand's result is very simple, and both treatments use quite conventional mathematical techniques (playing the game strictly according to Hoyle), they're based on bewilderingly different conceptual frameworks.  There's not just one difference, but several, and if you try to explain any one of them, your audience is likely to get hung up on one or more of the others.

Here are some highlights of how these conceptual differences stack up; I'll look at things in more detail hereafter.

  • Wand's paper assumes that the only equivalences of interest are those between source expressions; he acknowledges this explicitly in the paper.
  • Wand's paper isn't really about fexprs — it's about computational reflection, as can be seen in the Observations and Conclusions section at the end of the paper.
  • Wand's operational semantics has no terms that aren't source expressions.  This isn't a realistic depiction of fexprs as they occur in Lisp, but it is consistent with both his exclusive interest in the theory of source expressions and his interest in computational reflection.
  • Wand's operational semantics uses implicit evaluation rather than explicit evaluation:  that is, to distinguish between an expression to be evaluated and an expression not to be evaluated, he uses special contexts marking expressions that aren't to be evaluated, rather than special contexts marking expressions that are to be evaluated.  From a purely mathematical standpoint, a rewriting system using implicit evaluation always has a trivial theory, while a rewriting system using explicit evaluation can have quite a strong theory.  However, Wand's conceptual commitments prevent him both from using explicit evaluation and from caring about its consequences.  He cannot use explicit evaluation because the explicit evaluation contexts would not be representable in source expressions, violating his prohibition against non-source terms.  And, even if such expressions were allowed, all of the nontrivial equivalences in the theory would involve non-source terms, so that all the equivalences Wand is interested in would still be trivial (though for a different technical reason).

The deceptively simple math

Suppose we want a term-rewriting system in which any source expression S may sometimes appear as data and other times appear as program code to be executed.  We can do this in either of two ways:  either use a special context to mark a subterm that isn't to be evaluated, or use a special context to mark a subterm that is to be evaluated.

What Wand calls contextual equivalence, and I call operational equivalence, is a binary relation on terms defined thus (up to unimportant differences between treatments):

T1 ≅ T2  iff for every context C and observable V,  C[T1] ↦* V  iff   C[T2] ↦* V
But this formal definition interacts differently with the two different approaches to marking whether S is to be evaluated.

Suppose we use a special context to mark a subterm that isn't to be evaluated.  I call this strategy implicit evaluation, because a subterm is implicitly to be evaluated unless we go out of our way to say otherwise.  This strategy would naturally occur to a traditional Lisp programmer, who is used to quotation.  If this special context is identified by an operator, likely we'd call it "quotation"; but whatever we call it, suppose Q is a context that marks its subterm as data.  This means, formally, that Q[S] is itself an observable, and that Q[S1] and Q[S2] are different observables unless S1 and S2 are syntactically identical.  It immediately follows, from our definition of operational equivalence, that no two source terms S1 and S2 can ever be operationally equivalent unless they are syntactically identical.  When you add all the trappings in Wand's paper, this trivialization of theory might look more complicated, but it still comes down to this:  if there's a context that converts arbitrary terms into irreducible observables, then the operational equivalence relation is trivial.

But suppose we use the second strategy, a special context to mark a subterm that is to be evaluated.  I call this strategy explicit evaluation because a subterm is only evaluated if it's immediately surrounded by a context that explicitly says to evaluate it.  Then the cause of Wand's trivialization of theory "simply" vanishes.  Except that the consequences aren't at all simple.  The operational equivalence relation is really measuring something different under explicit evaluation than it did under implicit evaluation (under explicit evaluation, source terms have trivial theory because they're irreducible, whereas under implicit evaluation they have trivial theory even though they aren't irreducible).  So now I'll go back to square one and build the mathematical machinery with more care.

Dramatis personae

In classical small-step operational semantics, we have a term set, and a number of binary relations on terms.

The term set includes both source expressions and the observable results of computations, and possibly also some terms that represent states of computation that are neither source expressions nor observables.

Six key properties that a binary relation on terms might have: 

  • Reflexive:  T1 > T1
  • Transitive:  if  T1 > T2  and  T2 > T3  then  T1 > T3
    The reflexive transitive closure ("zero or more steps") is written by suffixing "*" to the relation, as  T1 >* T2
  • Commutative:  if  T1 > T2  then  T2 > T1
    A reflexive transitive commutative relation is called an equivalence.
  • Compatible:  if  T1 > T2  then for all contexts C,  C[T1] > C[T2]
    A compatible equivalence is called a congruence.
  • Church-Rosser:  If  T1 >* T2  and  T1 >* T3  then there exists T4 such that  T2 >* T4  and  T3 >* T4
  • Deterministic:  For each T1, there is at most one T2 such that T1 > T2
    This is not usually a mathematically desirable property, because it interferes with other properties like compatibility.  However, determinism may be semantically desirable (i.e., we want a program to behave the same way each time it's run).
Four key binary relations in the small-step semantics of a programming language:
  • Operational step, written  ↦
    A directed relation meant to "obviously" describe what programs in the language are supposed to do.  Because it's meant to be obviously right, it doesn't try to have nice mathematical properties like Church-Rosser-ness.  Expected to be deterministic and non-compatible.
  • Calculus step, written  →
    A relation meant to have lots of nice mathematical properties, like the reduction step relation of lambda-calculus.  The calculus step is generally defined to be compatible, and if it isn't Church-Rosser, something's wrong.
  • Calculus equality, written  =
    The reflexive transitive commutative closure of the calculus step (thus, the smallest equivalence relation containing the calculus step).
  • Operational equivalence, written  ≅
    T1 ≅ T2  iff for every context C and observable V,  C[T1] ↦* V  iff   C[T2] ↦* V
The master theorems one wants to prove about these relations are
  • completeness:
    ↦*  implies  →*
  • Church-Rosser-ness:
    →  is Church-Rosser
  • soundness:
    =  implies  ≅
(There's another "master theorem" we could include on this list, called standardization, but that really is way more infrastructure than we need here.)

Wand's operational semantics

The syntax of terms in Wand's semantics is:

T   ::=   x | (λx.T) | (TT) | (fexpr T) | (eval T)
That is, a term is either a variable, or a lambda-expression, or a combination, or a fexpr, or a call to eval.  Wand explained he didn't want to have two different binding constructs, one for ordinary procedures and one for fexprs.  Well, I didn't either; it's characteristic of our different approaches, though, that where Wand made his binding construct a constructor of applicatives (that evaluate their argument), and put a wrapper around it to cause the operand to be quoted, I made my binding construct a constructor of operatives (i.e., fexprs), and put a wrapper around it to cause the operand to be evaluated.

Wand has three rewriting rules: one for applying a λ-expression, one for applying a fexpr, and one for applying eval.  If we were being naive, we might try to define a calculus step like this:

((λx.T)V)   →   T[x ← V]
(fexpr V)T   →   (V encode(T))
(eval T)   →   decode(T)
Here, T is a term, and V is a value, that is, a term that's been reduced as much as it can be.  (Obscure point:  a value is a term that's irreducible by the operational step.  Don't worry about it.)

The encode/decode functions require some explanation.  The point of not evaluating certain subterms is to be able to use those unevaluated subterms as data.  So somehow you have to represent them in an accessible data form.  One way to do this would be to use a special quotation operator, and then have accessors that act on quoted expresions, like  (car (quote (T1 . T2))) → (quote T1).  However, when you're proving a no-go theorem, as Wand was doing, you want to minimize the assumptions you have to make to demonstrate the no-go result.  So Wand would naturally want to avoid introducing those extra operators, quote car and so on.  Instead, Wand used a Mogensen-Scott encoding, which, given an arbitrary term T, produces a term built up out of lambda-expressions that is not itself reducible at all but which can be queried to extract information about T.

Unfortunately this definition of a calculus step doesn't work, exactly because these rules assume → is compatible, and the whole point of Wand's exercise is that subterm rewriting isn't allowed in all contexts.  However, we can define the operational semantics step, which isn't compatible anyway.  For the deterministic operational step, we define an "evaluation context", which is a context that determines where the next redex (reducible expression) can occur.  We have

E   ::=   ⎕ | (ET) | ((λx.T)E) | (fexpr E) | (eval E)
That is, starting from the top of the syntax tree of a term, we can reduce a redex at the top of the tree, or we can descend into the operator of a combination, or into the operand of a λ-expression, or into the body of a fexpr, or into the operand of eval.  Using this, we can define the operational step.
E[((λx.T)V)]   ↦   E[T[x ← V]]
E[(fexpr V)T]   ↦   E[(V encode(T))]
E[(eval T)]   ↦   E[decode(T)]

Preprocessing as a cure for trivialization

Wand credited Albert Meyer for noting the trivialization effect.  Meyer, though, noted that quote-eval (as opposed to fexprs) need not cause trivialization.  (Meyer made this supplementary observation at least as early as 1986; see here, puzzle 3.)

The key to this non-trivialization by quotation is that quotation in traditional Lisp is a "special form", which is to say that the quotation operator, which identifies its argument as not to be evaluated, is fixed at compile-time (that is, before program evaluation begins).  So, given a Lisp source program, we can run a preprocessor over it and rewrite each quoted expression to avoid quoting more than a single symbol at a time.  (For example, one might rewrite  ($quote (+ 1 2))  as  (cons ($symbol-quote +) (cons 1 (cons 2 ()))), a style of rewriting that requires only quotation of individual symbols.)  And then the term-rewriting system doesn't have to have quotation in it at all (only, in the example, symbol-quotation, a much more restricted facility that would only trivialize the formal theory of individual symbols, not of arbitrary source expressions.)

Wand's Mogensen-Scott encoding is a similar transformation to this preprocessing of Lisp expresions using cons etc. (setting aside how the encoding treats variables, which is a whole other can of worms because lambda-calculus variables behave very differently from Lisp symbols).

However, the encoding in Wand's paper doesn't prevent trivialization, because it's used during term reduction.  Our Lisp preprocessor eliminated general quotation before term reduction ever started, so that general quotation didn't even have to be represented in the term syntax for the operational semantics.  If you wait until reduction has already started, it's too late for the encoding to matter to trivialization; you already have a trivializing context, and the encoding merely facilitates data access.

This however raises a curious possiblity:  what if we preprocessed Wand's language by encoding the arguments to all the operators?  Then, instead of encoding an operand when applying a fexpr to it, we decode an operand when applying a non-fexpr to it (or when evaluating it, a case that Wand already provides for — except that when evaluating a term, we would only decode the operators, not their operands, since we don't decode operands until we know they're to be evaluated).

One thing about this:  if we're going to decode an operand of a non-fexpr, we need to keep track of that by somehow changing the operator to indicate we've already done the decoding.  The simplest way to do this is to wrap non-fexprs (as my treatment does) instead of wrapping fexprs (as Wand's does).  There are other, clumsier ways to do it without altering Wand's syntax, but I'll go ahead and change the syntax.

T   ::=   x | (λx.T) | (TT) | (wrap T) | (eval T)
Notice the wrapper is now called wrap instead of fexpr, but the basic binding constructor of combiners is still called λ.  Actually, in my dissertation it's called vau rather than lambda (hence "vau-calculus"), but there's also an interesting point to be made by leaving it as λ (setting aside that unicode doesn't have a character for the way I write a lower-case vau).  It's just arbitrarily chosen syntax, after all... right?

Our evaluation contexts for the operational step change slightly because, beyond the switch from fexpr to wrap, we're now willing for the operational step to reduce an operand whenever the operator is a value, regardless of whether the operator is a fexpr.

E   ::=   ⎕ | (ET) | (VE) | (wrap E) | (eval E)
The operational step rules differ mainly in their treatment of encoding and decoding.  There is no encoding during reduction.  As for decoding, it is only partial.  We decode everything except the operands of combinations.  (Specifying this precisely would be tedious and contribute nothing to the discussion, but since that was already true of the encoding/decoding, we haven't specified it in the first place.)
E[((λx.T)V)]   ↦   E[T[x ← V]]
E[(wrap V)T]   ↦   E[(V partial-decode(T))]
E[(eval T)]   ↦   E[partial-decode(T)]
Note that the first rule, for applying a λ-expression to an operand, hasn't changed at all, even though a λ-expression is now an operative where in Wand's treatment it was an applicative.

Given a source program, before we set the operational step loose on the term, we preprocess the program by fully encoding it, and then partial-decoding it.  Hence the complete absence of encoding in the rules of the operational semantics.  This has the expected, but to me still rather stunning, effect that there is no longer any context in the rewriting system that has to be off-limits to rewriting.  We can therefore now define a valid compatible calculus step relation by simply removing the evaluation contexts from the operational step rules:

((λx.T)V)   →   T[x ← V]
(wrap V)T   →   (V partial-decode(T))
(eval T)   →   partial-decode(T)
I'll go only slightly out on a limb and say that all three master theorems probably hold for this arrangement — completeness, Church-Rosser-ness, and soundness.  (Completeness is obvious by construction.)  But what's really interesting here is that this arrangement both violates Wand's restriction on the term set and produces an equivalence relation that doesn't correspond to the one in his paper.

The term-set restriction is violated because source expressions, which Wand was studying exclusively, are now the things that are input to the preprocessor, and the preprocessor maps those source expressions into a proper subset of the terms in our semantics.  And the contextual equivalence on source expressions that Wand studied is concerned with these source expressions that don't even effectively belong to our term set at all, since they're mapped through the refractive lens of this encode/partial-decode preprocessor.  Yes, our ≅ is no longer a trivial equivalence — but Wand's notion of contextual equivalence of source expressons would require those source expressions to have interchangeable full encodings (in case they both appear as an operand to a fexpr), and that still won't happen unless the source expressions before preprocessing were syntactically identical.  The source-expression equivalence Wand was interested in is still trivial, and our non-trivial ≅ is simply not relevant under the treatment used in his paper.

One last thought about this preprocessed, pseudo-explicit-evaluation variant of Wand's semantics.  The first rule of the calculus step is the call-by-value β-rule.  So this calculus is (if I've not made a goof somewhere) a conservative extension of call-by-value lambda-calculus, and its equational theory includes all the equations in the theory of call-by-value lambda-calculus plus others.

Vau-calculus

Although my pseudo-explicit-evaluation variant of Wand's system does demonstrate some of the principles involved, the encoding/decoding, and the preprocessor, are obstacles to extracting clear insights from it.  Unlike Wand, I'm not trying to prove a no-go theorem; I'm showing that something can be done, so my interest lies in making it clear how to do it, and I'll happly introduce more syntax in order to avoid a complication such as Mogensen-Scott encoding, or preprocessing.

Instead of distinguishing data from to-be-evaluated expressions by means of a deep encoding (in which the entire term is transformed, from its root to its leaves), I simply introduce syntax for fully representing Lisp source expressions as data structures, entirely separate from the syntax for things like combinations and procedures (noting that procedures cannot be expressed by Lisp source expressons:  you write a source expression that would evaluate to a procedure, but until you commit to evaluating it, it's not a procedure but a structure built out of atomic data, symbols, pairs, and the empty list.)

S   ::=   d | ()
T   ::=   S | s | (. T)
Here, d is any atomic datum (archetypically, a numeric constant), and s is any symbol (which is another kind of datum, and not to be confused with a variable which is not a source-code element at all).  I've separated out the atomic data and the empty list into a separate nonterminal S, mnemonic for "Self-evaluating", because these kinds of terms will be treated separately for purposes of evaluation.

All of the terms constructed by the above rules are data, by which I mean, they're irreducible observables; none of those terms can ever occur on the left side of a calculus step.  (Therefore, given the master theorems, the operational theory of those terms is trivial.)  There are just two syntactic contexts that can ever form a redex; I'll use nonterminal symbol A for these, mnemonic for "active":

S   ::=   d | ()
A   ::=   [eval T T] | [combine T T T]
T   ::=   S | s | (. T) | A
The intent here (to be realized by calculus step rules) is that  [eval T1 T2]  represents scheduled evaluation of term T1 in environment T2, while  [combine T1 T2 T3]  represents scheduled calling of combiner T1 with parameters T2 in environment T3.

There's also one other class of syntax:  computational results that can't be represented by source code.  There are three kinds of these:  environments (which we will clearly need to support our evaluation rules), operatives, and applicatives.  We'll use "e" for environments and "O" for operatives.

O   ::=   [vau x.T] | ...
S   ::=   d | () | e | O
A   ::=   [eval T T] | [combine T T T]
T   ::=   x | S | s | (. T) | [wrap T] | A
The syntactic form of operative expressions, O, is interesting in itself, and in fact even the inclusion of traditional λ-expressions (which is what you see there, although the operator is called "vau" instead of "λ") is worthy of discussion.  I figure those issues would distract from the focus of this post, though, so I'm deferring them for some future post.  [Those issues arise in a later post, here.]

Here, then, are the rules of the calculus step (omitting rules for forms of operatives that I haven't enumerated above).

[eval S e]   →   S
[eval s e]   →   lookup(s,e)     if lookup(s,e) is defined
[eval (T1 . T2) e]   →   [combine [eval T1 e] T2 e]
[eval [wrap T] e]   →   [wrap [eval T e]]
[combine [vau x.T] V e]   →   T[x ← V]
[combine [wrap T0] (T1 ... Tn) e]   →   [combine T0 ([eval T1 e] ... [eval Tn e]) e]
In the fully expounded vau-calculus, the additional forms of operatives do things like parsing the operand list, and invoking primitives (such as car, and of course $vau).

Under these rules, every redex —that is, every term that can occur on the left side of a calculus step rule— is either an eval or a combine:  reduction occurs only at points where it is explicitly scheduled via eval or combine; and if it is explicitly scheduled to occur at a point in the syntax tree, that reduction cannot be prevented by any surrounding context.

The calculus step rules are, in essence, the core logic of a Lisp interpreter.

This calculus allows us to make much more fluent statements about the evaluation-behavior of terms than mere operational equivalence.  We can say, for example, that two terms T1 and T2 would be indistingusihable if evaluated in any environment, no matter what that environment is (and even though T1 and T2 may not themselves be operationally equivalent).

For all environments e,   [eval T1 e] ≅ [eval T2 e]
Thanks to the definition of ≅, when we say  [eval T1 e] ≅ [eval T2 e], we mean
For all contexts C and observables V,   C[eval T1 e] →* V  iff  C[eval T2 e] →* V
This is evidently not the question Wand's treatment asks with its contextual equivalence.  However, we can also define Wand's relation in this framework.  Let e0 be a standard environment.  Then,
T1∼T2   iff   for all contexts C,  [eval C[T1] e0] ≅ [eval C[T2] e0]
Relation ∼ is Wand's contextual equivalence.  And, indeed, for all source expressions S1 and S2,  S1∼S2  iff  S1 and S2 are syntactically identical.

Finally, note what happens if we omit from this calculus all of the syntax for symbols and lists, along with its associated machinery (notably, eval and environments; the only change this makes to the remaining elements is that we drop the third operand to [combine ...]).

O   ::=   [vau x.T]
S   ::=   d | O
A   ::=   [combine T T]
T   ::=   x | S | A
[combine [vau x.T] V]   →   T[x ← V]
Do you recognize it?  You should.  It's the call-by-value lambda-calculus.

[edit: besides some minor typos here and there, this post had originally omitted the vau-calculus rule for combining an applicative, [combine [wrap T0] ...].

Saturday, June 4, 2011

Primacy of syntax

One of the most influential papers in the history of computer science is Christopher Strachey's 1967 "Fundamental Concepts in Programming" — lecture notes for the International Summer School on Computer Programming, Copenhagen.  Christopher Strachey basically created the academic subject of programming languages, and in those lecture notes you will find many core concepts laid out.  First-class objects.  Left-values and right-values.  Polymorphism.  And so on.

Although Strachey did write a finished version of the lecture notes for publication in the proceedings of the summer school, those proceedings never materialized — so copies of the lecture notes were passed from hand to hand.  (The modern name for such a phenomenon is "going viral", but it's surely been happening since before writing was invented.)  Finally, on the twenty-fifth anniversary of Strachey's untimely death, the finished paper was published.

By that time, though, I'd spent over a decade studying my cherished copy of the raw lecture notes.  The notes consist largely of abbreviated, even cryptic, non-sentences, but their pithy phrases are often more expressive than the polished prose of the published paper — and sometimes the ideas candidly expressed didn't even make it into the published paper.  A particular item from the lecture notes, that has stuck with me from the first moment I read it (about a quarter century ago, by now) is
Basic irrelevance of syntax and primacy of semantics.
This seems to me to neatly capture an attitude toward syntax that became firmly rooted in the late 1960s, and has scarcely been loosened since.  My own passion, though, within the academic subject Strachey created, is abstraction; and from my first explorations of abstraction in the late 1980s, it has seemed to me that abstraction defies the separation between syntax and semantics.  Sometime during the 1988/89 academic year, I opined to a professor that the artificial distinction between syntax and semantics had been retarding the development of abstraction technology for over twenty years — and the professor I said it to laughed, and after a moment distractedly remarked "that's funny" as they headed off to their next appointment.  Nonplussed by that response (I was still rather in awe of professors, in those days), to myself alone I thought, but I wasn't joking.

Abstraction

A source code element modifies the language in which it occurs.  Starting, say, with standard Java, one introduces a new class, and ends up with a different programming language — almost exactly like standard Java, but not quite because it now has this additional class in it.  That is abstraction, building a new language on top of an old one.
Abstraction:  Transformation of one programming language into another by means of facilities available in the former language.
What makes it abstraction, rather than just an arbitrary PL transformation, is that it uses the facilities of the pre-existing programming language.  (For those who prefer to trace their definitions back a few centuries, the key requirement that the new language be latent in the old —that it be drawn out of the old, Latin abstrahere— appears in John Locke's account of abstraction in his An Essay Concerning Human Understanding; the relevant passage is the epigraph at the top of Chapter 1 of the Wizard Book — and is also the epigraph at the top of Chapter 1 of my dissertation.)

An abstractively powerful programming language is, by my reckoning, a language from which one can abstract to a wide variety of other languages.  The precise formalization of that is a subtle thing; and worthy of a blog entry of its own, especially since my formal treatment of it, WPI-CS-TR-08-01, gets rather dense in some parts.  For the current entry, though, we don't need those details; the key point here is that the result of the abstraction is another language.  This is in contrast to the denotational approach to programming languages (which Strachey helped create, BTW, and which is visible in nascent form in the 1967 lecture notes):  denotationally, the programming language is essentially a function which, when applied to a source-code term, produces a semantic value of another sort entirely.
[Note:  I've since written an entry on abstractive power.]
The idea that semantic results are languages is a very powerful one.  The results of computations (and any other embellishment one wants) can be modeled by introducing additional sorts of terms that may occur in languages; the significance of each language is then fully represented by the possible sequences of terms that can follow from it.  But then, one doesn't really need the explicit semantics at all.  Only the sequences of terms matter, and one can define a programming language by a set of sequences of terms.  At that point, one could say that all semantics is being represented as syntax (which is a truism about semantics, anyway), or one could just as well say that semantics has vanished entirely to be replaced with pure syntax.

Lisp and syntax

Lisp has been described as a language with no syntax.  There is a sense in which that's true:  if by "syntax" one means "syntax for representing programs rather than data".  In primordial S-expression Lisp, the only syntax exclusively represents data.  (The way that happened —to remind— was that McCarthy had originally envisioned a second kind of syntax, M-expressions, representing programs, but he'd also described an algorithm for encoding M-expressions as S-expressions.  He expected to have years in which to polish details of the language since writing a compiler was then understood to be such a colossal undertaking, but in the meantime they were hand-coding specific Lisp functions — and then S.R. Russell hand-coded eval, and abruptly they had a working interpreter for S-expression Lisp.)

I believe, by the way, this is how Lisp should be taught to novices:  Teach them S-expression syntax first, set it firmly in their minds that such expressions are data, and only after that begin to teach them about evaluation.  Mike Gennert and I tried this approach with a class back in the spring semester of 1999/2000.  Over the duration of the course, we led the students through writing a Scheme interpreter in Java, starting with a level-0 "REPL" loop that was missing the "E" — it would read in an S-expression, and just write it out without evaluating it.  By the end of the term we'd added in proper tail recursion.  The experiment as a whole wasn't as successful as we'd hoped, because at the moment we tried it, the department's curriculum was in a state of flux, and many of the students didn't already know Java; but we didn't have the sorts of problems I've seen, or heard others describe, due to novice students failing to think in terms of evaluating S-expressions.

The connection to abstraction is immediate and compelling.  Abstraction is all about specifying how syntax will be used for subsequent program code, and the design of Lisp is focused on virtuoso manipulation of the very type of data structure (S-expressions) that is effectively the syntax of program code.  The more power Lisp gives the programmer to control how syntax will be interpreted, the more abstractive power accrues.  Since fexprs greatly expand the programmer's direct control over the interpretation-time behavior of the evaluator (short of the "throw everything out and start over" tactic of running a meta-circular evaluator — a tactic that lacks stability), fexprs should massively increase the (already formidable) abstractive power of Lisp.  That's why the subtitle of my dissertation is $vau: the ultimate abstraction.

Note: Polymorphism

Strachey's 1967 lecture notes are most known —and cited— for coining the term polymorphism.  Therein, he divided polymorphism into two forms:  parametric polymorphism, and ad hoc polymorphism.

The parametric/ad hoc distinction struck me as a logical bipartition of polymorphism; that is, every possible form of polymorphism would necessarily be either parametric or ad hoc.  Evidently Cardelli and Wegner did not interpret Strachey this way; their taxonomy placed "inclusion polymorphism" outside of both parametric and ad hoc.

It also strikes me that the name ad hoc reflects real disapproval.  This goes back to the "basic irrelevance of syntax" remark.  At heart, parametric polymorphism is semantic, ad hoc is syntactic; one might be tempted to call them "semantic polymorphism" and "syntactic polymorphism", or even "good polymorphism" and "bad polymorphism".  There is a close connection between my perception that parametric/ad hoc reflects semantic/syntactic, and my perception that parametric/ad hoc was meant to be exhaustive.

Although I expect parametric polymorphism should have greater abstractive power than ad hoc polymorphism, I don't like biasing terminology.  I'd like to see any formal results stand on their own merits.  So, years ago, I looked for an alternative to "ad hoc".  The best I came up with at the time:  selection polymorphism.

Tuesday, May 17, 2011

Dangerous things should be difficult to do by accident

Although this is a key principle for design in general —scarcely behind supporting what the system is being designed to do— I'm mainly interested in it here as a principle for designing programming languages.

Bicycles

That said, the metaphor I use to ground the principle is from mechanical engineering.

A once-popular bicycle design was the "penny-farthing", with a great big front wheel that the rider essentially sat on top of, and a small rear wheel.  (Why "penny farthing"?  The British penny was a large coin, and the farthing a small one.)  The pedals were directly on the front wheel, and the handlebars were directly over it, turning on a vertical axis.  What's wrong with that?  Obvious problems are that the rider is too far up for their feet to reach the ground, so it's not easy to stop safely; there's a long way to fall; the rider is so far forward that it's easy to fall forward over the front; and it's easy to get one's feet, or clothing, caught in the spokes of the front wheel, especially when turning (as this causes the spokes to move in relation to the rider).  All of which obvious problems are eliminated by the later "safety bicycle" design, which has two smaller wheels with the rider sitting between them, feet well away from the parts that turn when steering, and low enough that the rider can simply plant their feet on the ground when stopped.

The safety bike design also uses a chain to multiply the turning of the pedals into a higher speed than can be achieved with the penny-farthing (and multiplying speed was the reason the front wheel of the penny-farthing was made so big in the first place).

But another thing I find especially interesting about the safety bicycle design is another innovation:  its steering axis —the axis along which the handle bars rotate the front wheel— is well off the vertical.  This angle off the vertical (called the caster) means that the force of gravity, pulling the rider down toward the ground, tends to pull the front wheel toward the straight-forward position.  In fact, the further the handle bars are turned to either side, the more gravity pushes them back out of the turn.  That's inherently stable.  (The fact that, riding at speed, both wheels act as gyroscopes doesn't hurt either.)

So the significant caster of the safety bicycle actively helps the rider to limit turns to intentional turns.  That's the kind of designed-in safety factor a programming language should aspire to (and it's a tough standard to live up to).

Programming languages

Douglas McIlroy, at the 1969 Extensible Languages Symposium, described opposing philosophies of programming language design as "anarchist" and "fascist".  The principle of accident avoidance illuminates both philosophies, and the relation between them.  (McIlroy was not, BTW, playing favorites to either philosophy, so if you think this terminology is harder on one side or the other that may tell you something about your politics. :-).
C makes it easy to shoot yourself in the foot; C++ makes it harder, but when you do it blows your whole leg off.
— Bjarne Stroustrup (attributed)
The fascist approach to accident prevention is to simply prohibit dangerous behaviors — which prevents accidents caused by the prohibited behaviors, at the cost of, at least,  (a) forcing the programmer to work around the prohibition and  (b) prohibiting whatever gainful employment, if any, might otherwise be derived by exploiting the prohibited behavior.  (There's a case claimed for the fascist approach based on ability to prove things about programs, but that's a subject for a different post; here I'm considering what may be done deliberately versus what may be done by accident, while that other post would concern what can be done versus what can be proven about it.)  Drawbacks to this arrangement accrue from both (a) and (b), as workarounds in (a) are something else to get wrong and, moreover, when the programmer wants to overcome limitations of (b), this tends to involve subverting the fascist restrictions, rather than working with them, producing an inherently unstable situation (in contrast to the ideal of stability we saw with the significant caster of the safety bicycle).

The anarchist philosophy makes reliance on this design principle more obvious.  You've got more opportunities to do dangerous things, so if there's something in the language design that causes you to do those things when you don't mean to —or that pushes you to mean to do them routinely, multiplying opportunities to do them wrong— that's going to be a highly visible problem with the language.

Most Lisps are pretty anarchic, and certainly my Kernel programming language is, supporting fexprs as it does.  Dangerous things are allowed on general principle; and the whole language is, among other things, an exercise in strongly principled language design, so some explicit principle was clearly needed to keep the anarchy sane.  Dangerous things should be difficult to do by accident was crafted for that sanity.

Which brings me back to something I said I'd post separately (in my earlier post about fexprs).

Hygiene in Kernel

Kernel is statically scoped, which is key to making fexprs manageable.

The static environment of a combiner is the environment of symbol-value bindings in effect at the point (in source code, and in time) where the combiner is defined.  The dynamic environment of a call to a combiner is the environment of symbol-value bindings in effect at the point (in source code and in time) from which the combiner is called.  The local environment of a combiner call is the environment of symbol-value bindings that are used, for that call, to interpret the body of the combiner.  The local environment has some call-specific bindings, of the combiner's formal parameters to arguments (or operands) passed to the call; but then, for other symbols, the local environment refers to a parent environment.  In a statically scoped combiner, the parent is the static environment; in a dynamically scoped combiner, the parent is the dynamic environment.

When a programmer writes a symbol into source code, their understanding of what that symbol refers to tends to be based on what else is declared in the same region of source code.  For the actual meaning of the symbol to coincide with the programmer's expectation, most programming languages use statically scoped applicatives (applicatives meaning, as explained in my earlier post, that the operands are evaluated in the dynamic environment, and the resulting arguments are passed to the underlying combiner).  This behavior —operands interpreted in the dynamic environment, combiner body interpreted in a local child of the static environment— is commonly called good hygiene.

Early Lisp was dynamically scoped.  (How and why that happened, and then for a couple of decades got worse instead of better, are explored in Section 3.3 of my dissertation.)  Under dynamic scope, when one writes a combiner, one doesn't generally know anything about what it will do when called:  none of the non-local symbols have any fixed meaning that can be determined when-and-where the combiner is defined.  Sometimes one actually wants this sort of dynamic behavior; but good hygiene is the "straight forward" behavior that anarchic Kernel gravitates toward as a stable state.

So, what does Kernel do to allow hygiene violations while maintaining a steady stabilizing gravity toward good hygiene?

The primitive constructor of compound operatives, $vau, is statically scoped; in addition to a formal parameter tree that matches the operands of the call, there is an extra environment parameter, a symbol that is bound in the local environment to the dynamic environment of the call.  That makes it possible to violate hygiene — although it is actually commonly used to maintain hygiene, as will be shown.  If you don't want to use the dynamic environment, you can use a special Kernel value, #ignore, in place of the environment parameter to prevent any local binding to the dynamic environment, so it's then impossible to accidentally invoke the dynamic environment.

The usual constructor for compound applicatives, $lambda, can be defined as follows; it simply transforms a combination  ($lambda formals . body)  into  (wrap ($vau formals #ignore . body)).
($define! $lambda
   ($vau (formals . body) env
      (wrap (eval (list* $vau formals #ignore body)
                  env))))

The very existence of $lambda is the first and simplest way Kernel pushes the programmer toward good hygiene:  because $lambda is easier to use than $vau —and because they can't readily be mistaken for each other— the programmer will naturally use $lambda for most purposes, turning out hygienic combiners as a matter of course.  Constructing an operative takes more work.  Constructing an applicative that doesn't ignore its dynamic environment is even more laborious, requiring a composition of $vau with wrap, as in the standard derivation of get-current-environment:
($define! get-current-environment
   (wrap ($vau () e e)))
So good hygiene just takes less work than bad hygiene.

Once the programmer has decided to use $vau, gravitating toward good hygiene gets subtler.  The above derivation of $lambda exemplifies the main tactics.  In order to evaluate the operands in any environment at all, you typically use eval — and eval requires an explicit second argument specifying the environment in which to do the evaluation.  And the most immediately available environment that can be specified is the one for which a local binding has been explicitly provided:  the dynamic environment of the call.

A nuance here is that the expression to be evaluated will typically be cobbled together within the combiner, using some parts of the operand tree together with other elements introduced from elsewhere.  This is the case for our derivation of $lambda, where the target expression has four elements — two parts of the operand tree, formals and body; literal constant #ignore; and $vau.  For these cases, a key stabilizing factor is that the construction is conducted using applicative combinations, in which the parts are specified using symbols.  Since the constructing combinations are applicative, those symbols are evaluated in the local environment during construction — so by the time the constructed expression is evaluated, all dependencies on local or static bindings have already been disposed of.  The additional elements typically introduced this way are, in fact, combiners (and occasional constants), such as in this case $vau; and these elements are atomic, evaluating to themselves, so they behave hygienically when the constructed expression is later evaluated (as I noted in my earlier post under the subheading "fluently doing nothing").

Wednesday, April 6, 2011

Fexpr

Fexpr is a noun.  It's pronounced FEKSper.  A fexpr is a procedure that acts on the syntax of its operands, rather than on the values determined by that syntax.

This is only in the world of Lisp programming languages — not that only Lisps happen to have fexprs, but that having fexprs (and having the related characteristics that make them a really interesting feature) seemingly causes a programming language to be, at some deep level, a dialect of Lisp.  That's a clue to something:  although this acting-on-syntax business sounds superficial, it's a gateway to the deepest nature of the Lisp programming-language model.  Fexprs are at the heart of the rhyming scheme of Lisp.

Data as programs

When Lisp reads a syntax (i.e., source code) expression, it immediately represents the expression as a data structure; or at least, in theory it does.  For fexprs to even make sense, this would have to be true:  a procedure in a program acts on data, so if you're passing the operand syntax expressions to a procedure, those expressions have to be data.  The Lisp evaluator then interprets the syntax expression in data form, and that's the whole of Lisp:  read an expression and evaluate it, read another and evaluate it, and so on.

But Lisp was designed, from the start, specifically for manipulating an especially simple and general kind of data structures, essentially trees (though they can also be viewed as nested lists, hence the name of the language, short for LISt Processing).  And syntax expressions are represented as these same trees that are already Lisp's native data structure.  And, the Lisp evaluator algorithm isn't limited to data that started life as a representation of syntax:  any data value can, in principle, be evaluated.  Which means that a fexpr doesn't have to act on syntax.

The theory of fexprs

One reason it matters that fexprs can act on non-syntax, is because of a notorious theoretical result about fexprs.  Proving programs correct usually makes heavy use of determining whether any two source expressions are interchangeable.  When two source expressions may be operands to a fexpr, though, they won't be interchangeable in general unless they're syntactically identical.  So with fexprs in the language, no two distinct operands are ever universally interchangeable.  This was famously observed by Mitch Wand in a journal article back in 1998, The Theory of Fexprs is Trivial.

But while a fexpr can analyze any syntactic operand down to the operand's component atoms, computed operands are a different matter.  It's almost incidental that some computed data structures are encapsulated, so can't be fully analyzed by fexprs.  The more important point is, even if the structure resulting from computation can be fully analyzed, the process by which it was computed is not subject to analysis.  If a fexpr is given an operand 42, the fexpr can't tell how that operand was arrived at; it might have been specified in source code, or computed by multiplying 6 times 7, or computed in any of infinitely many other possible ways.

So, suppose one sets up a computational calculus, something like lambda-calculus, for describing computation in a Lisp with fexprs.  Source expressions are terms in the calculus, and no two of them are contextually equivalent (i.e., interchangeable as subterms of all larger terms).  But —unless the calculus is constructed pathologically— there are still very many terms in the calculus, representing intermediate states of subcomputations, that are contextually equivalent.

I've developed a calculus like that, by the way.  It's called vau-calculus.

Deep fexprs

We're about to need much better terminology.  The word fexpr is a legacy from the earliest days of Lisp, and procedure is used in the Lisp world with several different meanings.  Here's more systematic terminology, that I expanded from Scheme for use with the Kernel programming language.
A list to be evaluated is a combination; its first element is the operator, and the rest of its elements are operands.  The action designated by the operator is a combiner.  A combiner that acts directly on its operands is an operative.  (Legacy terms: an operative that is a data value is a fexpr, an operative that is not a data value is a special form.)  A combiner that isn't operative is applicative; in that case, the operands are all evaluated, the results of these evaluations are called arguments, and the action is performed on the arguments instead of on the operands.
It might seem that applicative combinations would be more common, and far more varied, than operative combinations.  Explicitly visible operatives in a Lisp program are largely limited to a small set, used to define symbols (in Kernel, mainly $define!  and $let), construct applicatives ($lambda), and do logical branching ($if  and $cond) — about half a dozen operatives, used over and over again.  The riotous variety of programmer-defined combiners are almost all applicative.

But looking closely at the above definition of applicative, it implies that every applicative has an operative hiding inside it.  Once an argument list has been computed, it's just another list of data values — and those values are then acted on directly with no further processing, which is what one does when calling an operative!  Applicative +, which evaluates its operands to arguments and then adds the arguments, has an underlying operative that just adds its operands; and so on.

Vau-calculus

In a computational calculus for fexprs, it's a big advantage to represent each applicative explicitly as a wrapper (to indicate the operands are to be evaluated) around another underlying combiner.  That way, the calculus can formally reason about argument evaluation separately from reasoning about the underlying actions.  Vau-calculus works that way.  The whole calculus turns out to have three parts.  There's one part that only represents tree/list structures, and no computation takes place purely within that part.  There's one part that only deals with computations via fexprs.  And then, linking those two, there's the machinery of evaluation, which is where the wrapper-to-induce-operand-evaluation comes in.

Fascinatingly, of these three parts of vau-calculus, the one that deals only with computations involving fexprs is (give or take) lambda-calculus.  One could reasonably claim —without contradicting Mitch Wand's perfectly valid result, but certainly contrasting with it— that the theory of fexprs is lambda-calculus.

(Vau-calculus seems a likely topic for a future blog entry here.  Meanwhile, if you're really feeling ambitious, the place to look is my dissertation.)
[Note:  I've since blogged on vau-calculus here.]
Kernel

What works for a computational calculus also works for a Lisp language:  represent each applicative as a wrapper around an underlying combiner.  The Kernel programming language does this.  An applicative unwrap  takes an applicative argument and returns the underlying combiner of that argument; and an applicative wrap  takes any combiner at all as an argument, and returns an applicative whose underlying combiner is that argument.

This makes Kernel a powerful tool for programmers to fluently manipulate the operand-evaluation process, just as the analogous device in vau-calculus allows reasoning about operand-evaluation separately from reasoning about the underlying lambda-calculus computations.

Kernel (evaluator)

Here's the central logic of the Kernel evaluator (coded in Kernel, then put in words).
($define! eval
   ($lambda (expr env)
      ($cond ((symbol? expr)  (lookup expr env))
             ((pair? expr)
                (combine (eval (car expr) env)
                         (cdr expr)
                         env))
             (#t  expr))))

($define! combine
   ($lambda (combiner operands env)
      ($if (operative? combiner)
           (operate combiner operands env)
           (combine (unwrap combiner)
                    (map-eval operands env)
                    env))))
To evaluate an expression in an environment:  If it's a symbol, look it up in the environment.  If it's a pair (which is the more general case of a list), evaluate the operator in the environment, and combine  the resulting combiner with the operands in the environment. If it's neither a symbol nor a pair, it evaluates to itself.

To combine a combiner with an operands-object:  If the combiner is operative, cause it to act on the operands-object (and give it the environment, too, since some operatives need that).  If the combiner is applicative, evaluate all the operands in the environment, and recursively call combine  with the underlying combiner of the applicative, and the list of arguments.

Kernel (fluently doing nothing)

When evaluating syntax read directly from a source file, the default case of evaluation —the one explained in boldface— is why a literal constant, such as an integer, evaluates to itself.  What makes it worth boldfacing, though, is that when evaluating computed expressions, that case helps keep environments from bleeding into each other (in Lisp terminology, it helps avoid accidental bad hygiene).  Here's a basic example.

Lisp apply  overrides the usual rule for calling an applicative, by allowing a single arbitrary computation-result to be used in place of the usual list of arguments.  The first argument to apply  is the applicative, and its second argument is the value to be used instead of a list of arguments.  In Kernel, and then in words:
($define! apply
   ($lambda (appv args)
      (eval (cons (unwrap appv) args)
            (make-environment))))
To apply an applicative to an args-object, construct a combination whose operator is the underlying combiner of the applicative, and whose operands-object is the args-object; and then evaluate the constructed combination in a freshly created empty environment.  When the constructed combination is evaluated, its operator evaluates to itself because it's a combiner.  This defaulting operator evaluation doesn't need anything from the environment where the arguments to apply  were evaluated, so the constructed combination can be evaluated in an empty environment — and the environment of the call to apply  doesn't bleed into the call to the constructed combination.

In a standard Kernel environment, (apply list 2) evaluates to 2.

A more impressive illustration is the way $lambda  can be defined hygienically in Kernel using more primitive elements of the language.  I should make that a separate post, though.  The earlier parts of this post deliberately didn't assume Lisp-specific knowledge at all, and in the later parts I've tried to ease into Lisp somewhat gently — but $lambda  gets into a whole nother level of Lisp sophistication (which is what makes it a worthwhile example), so it just feels logically separate.
[Note: I did later post on Kernel hygiene and $lambda, here.]