Monday, August 29, 2016

Interpreted programming languages

Last night I drifted off while reading a Lisp book.
xkcd 224.

It's finally registered on me that much of the past half century of misunderstandings and confusions about the semantics of Lisp, of quotation, and, yes, of fexprs, can be accounted for by failure to recognize there is a theoretical difference between an interpreted programming language and a compiled programming language.  If we fail to take this difference into account, our mathematical technology for studying compiled programming languages will fail when applied to interpreted languages, leading to loss of coherence in language designs and a tendency to blame the language rather than the theory.

Technically, a compiler translates the program into an executable form to be run thereafter, while an interpreter figures out what the program says to do and just does it immediately.  Compilation allows higher-performance execution, because the compiler takes care of reasoning about source-code before execution, usually including how to optimize the translation for whatever resources are prioritized (time, space, peripherals).  It's easy to suppose this is all there is to it; what the computer does is an obvious focus for attention.  One might then suppose that interpretation is a sort of cut-rate alternative to compilation, a quick-and-dirty way to implement a language if you don't care about performance.  I think this misses some crucial point about interpretation, some insight to be found not in the computer, but in the mind of the programmer.  I don't understand that crucial insight clearly enough — yet — to focus a whole blog post on it; but meanwhile, there's this theoretical distinction between the two strategies which also isn't to be found in the computer's-eye view, and which I do understand enough about to focus this blog post on.

It's not safe to say the language is the same regardless of which way it's processed, because the language design and the processing strategy aren't really independent.  In principle a given language might be processed either way; but the two strategies provide different conceptual frameworks for thinking about the language, lending themselves to different language design choices, with different purposes — for which different mathematical properties are of interest and different mathematical techniques are effective.  This is a situation where formal treatment has to be considered in the context of intent.  (I previously blogged about another case where this happens, formal grammars and term-rewriting calculi, which are both term-rewriting systems but have nearly diametrically opposite purposes; over t‍ha‍r.)

I was set onto the topic of this post recently by some questions I was asked about Robert Muller's M-Lisp.  My dissertation mentions Muller's work only lightly, because Muller's work and mine are so far apart.  However, because they seem to start from the same place yet lead in such different directions, one naturally wants to know why, and I've struck on a way to make sense of it:  starting from the ink-blot of Lisp, Muller and I both looked to find a nearby design with greater elegance — and we ended up with vastly different languages because Muller's search space was shaped by a conceptual framework of compilation while mine was shaped by a conceptual framework of interpretation.  I will point out, below, where his path and mine part company, and remark briefly on how this divergence affected his destination.

The mathematical technology involved here, I looked at from a lower-level perspective in an earlier post.  It turns out, from a higher-level perspective, that the technology itself can be used for both kinds of languages, but certain details in the way it is usually applied only work with compiled languages and, when applied to interpreted languages, result in the trivialization of theory noted by Wand's classic paper, "The Theory of Fexprs is Trivial".

Contents
M-language
Compilation
Universal program
S-language
Half a century's worth of misunderstandings and confusions
M-language

I'll explore this through a toy programming language which I'll then modify, starting with something moderately similar to what McCarthy originally described for Lisp (before it got unexpectedly implemented).

This is a compiled programming language, without imperative features, similar to λ-calculus, for manipulating data structures that are nested trees of atomic symbols.  The syntax of this language has two kinds of expressions:  S-expressions (S for Symbolic), which don't specify any computation but merely specify constant data structures — trees of atomic symbols — and M-expressions (M for Meta), which specify computations that manipulate these data structures.

S-expressions, the constants of the language, take five forms:

S   ::=   s | () | #t | #f | (S . S)   .
That is, an S-expression is either a symbol, an empty list, true, false, or a pair (whose elements are called, following Lisp tradition, its car and cdr).  A symbol name is a sequence of one or more upper-case letters.  (There should be no need, given the light usage in this post, for any elaborate convention to clarify the difference between a symbol name and a nonterminal such as S.)

I'll assume the usual shorthand for lists, (S ...) ≡ (S . (...)), so for example (FOO BAR QUUX) ≡ (FOO . (BAR . (QUUX . ()))); but I won't complicate the formal grammar with this shorthand since it doesn't impact the abstract syntax.

The forms of M-expressions start out looking exactly like λ-calculus, then add on several other compound forms and, of course, S-expressions which are constants:

M   ::=   x | [λx.M] | [M M] | [if M M M] | [car M] | [cdr M] | [cons M M] | [eq? M M] | S   .
The first form is a variable, represented here by nonterminal x.  A variable name will be a sequence of one or more lower-case letters.  Upper- versus lower-case letters is how McCarthy distinguished between symbols and variables in his original description of Lisp.

Variables, unlike symbols, are not constants; rather, variables are part of the computational infrastructure of the language, and any variable might stand for an arbitrary computation M.

The second form constructs a unary function, via λ; the third applies a unary function to a single argument.  if expects its first argument to be boolean, and returns its second if the first is true, third if the first is false.  car and cdr expect their argument to be a pair and extract its first and second element respectively.  cons constructs a pair.  eq? expects its two arguments to be S-expressions, and returns #t if they're identical, #f if they're different.

Compound unary functions, constructed by λ, are almost first-class.  They can be returned as the values of expressions, and they can be assigned to variables; but as the grammar is set out, they cannot appear as elements of data structures.  A pair expression is built up from two S-expressions, and a compound unary function is not an S-expression.  McCarthy's original description of Lisp defines S-expressions this way; his stated purpose was only to manipulate trees of symbols.  Trained as I am in the much later Lisp culture of first-class functions and minimal constraints, it felt unnatural to follow this narrower definition of S-expressions; but in the end I had to define it so.  I'm trying to reproduce the essential factors that caused Lisp to come out the way it did, and, strange to tell, everything might have come out differently if not for the curtailed definition of S-expressions.

(One might ask, couldn't we indirectly construct a pair with a function in it using cons?  A pair with a function in it would thus be a term that can't be represented directly by a source expression.  This point likely doesn't matter directly to how things turned out; but fwiw, I suspect McCarthy didn't have in mind to allow that, no.  It's entirely possible he also hadn't really thought about it yet at the preliminary stage of design where this detail had its impact on the future of Lisp.  It's the sort of design issue one often discovers by playing around with a prototype — and in this case, playing around with a prototype is how things got out of hand; more on that later.)

Compilation

Somehow we want to specify the meanings of programs in our programming language.  Over the decades, a number of techniques for formal PL semantics have been entertained.  One of the first things tried was to set up a term-rewriting machine modeled roughly on λ-calculus, that would perform small finite steps until it reached a final state; that was Peter Landin's notion, when Christopher Strachey set him onto the problem in the 1960s.  It took some years to get the kinks out of that approach, and meanwhile other techniques were tried — such as denotational semantics, meta-circular evaluation — but setting up a term-rewriting calculus has been quite a popular technique since the major technical obstacles to it were overcome.  Of the three major computational models tied together in the 1930s by the Church-Turing thesis, two of them were based on term-rewriting:  Turing machines, which were the convincing model, the one that lent additional credibility to the others when they were all proven equivalent; and λ-calculus, which had mathematical elegance that the nuts-and-bolts Turing-machine model lacked.  The modern "small-step" rewriting approach to semantics (as opposed to "big-step", where one deduces how to go in a single step from start to finish) does a credible job of combining the strengths of Turing-machines and λ-calculus; I've a preference for it myself, and it's the strategy I'll use here.  I described the technique in somewhat more depth in my previous post on this material.

Small-step semantics applies easily to this toy language because every intermediate computational state of the system is naturally represented by a source-code expression.  That is, there is no obvious need to go beyond the source-code grammar we've already written.  Some features that have been omitted from the toy language would make it more difficult to limit all computational states to source expressions; generally these would be "stateful" features, such as input/output or a mutable store.  Landin used a rewriting system whose terms (computational states) were not source-code expressions.  One might ask whether there are any language features that would make it impossible to limit computational states to source expressions, and the answer is essentially yes, there are — features related not to statefulness, but to interpretation.  For now, though, we'll assume that all terms are source expressions.

We can define the semantics of the language in reasonably few rewriting schemata.

[[λx.M1] M2]   →   M1[x ← M2]
[if #t M1 M2]   →   M1
[if #f M1 M2]   →   M2
[car (S1 . S2)]   →   S1
[cdr (S1 . S2)]   →   S2
[cons S1 S2]   →   (S1 . S2)
[eq? S S]   →   #t
[eq? S1 S2]   →   #f     if the Sk are different.

Rewriting relation → is the compatible closure of these schemata; that is, for any context C and terms Mk, if M1 → M2 then C[M1] → C[M2].  Relation → is also Church-Rosser:  although a given term M may be rewritable in more than one way, any resulting difference can always be eliminated by later rewriting.  That is, the reflexive transitive closure →* has the diamond property:  if M1 →* M2 and M1 →* M3, then there exists M4 such that M2 →* M4 and M3 →* M4.

Formal equality of terms, =, is the symmetric closure of →* (thus, the reflexive symmetric transitive compatible closure of the schemata, which is to say, the least congruence containing the schemata).

Another important relation is operational equivalence, ≅.  Two terms are operationally equivalent just if replacing either by the other in any possible context preserves the observable result of the computation.  M1 ≅ M2 iff for every context C and S-expression S,  C[M1] ↦* S  iff  C[M2] ↦* S.  (Fwiw, relation ↦ is what the computation actually does, versus → which is anything the rewriting calculus could do; → is compatible Church-Rosser and therefore nice to work with mathematically, but ↦ is deterministic and therefore we can be sure it does what we meant it to.  Another way of putting it is that → has the mathematical character of λ-calculus while ↦ has the practical character of Turing-machines.)

Operational equivalence is exactly what must be guaranteed in order for an optimizing compiler to safely perform a local source-to-source transformation:  as long as the two terms are operationally equivalent, the compiler can replace one with the other in any context.  The rewriting calculus is operationally sound if formal equality implies operational equivalence; then the rewriting calculus can supply proofs of operational equivalence for use in optimizing compilation.

Before moving on, two other points of interest about operational equivalence:

  • Operational equivalence of S-expressions is trivial; that is, S1 and S2 are operationally equivalent only if they are identical.  This follows immediately by plugging the trivial context into the definition of operational equivalence, C[Sk] ≡ Sk.  Thus, in every non-trivial operational equivalence M1 ≅ M2, at least one of the Mk is not an S-expression.

  • All terms in the calculus — all M-expressions — are source expressions; but if there were any terms in the calculus that were not source expressions, they would be irrelevant to a source-to-source optimizer; however, if these non-source terms could be usefully understood as terms in an intermediate language used by the compiler, an optimizer might still be able to make use of them and their formal equalities.

Universal program

McCarthy's Lisp language was still in its infancy when the project took an uncontrollable turn in a radically different direction than McCarthy had envisioned going with it.  Here's what happened.

A standard exercise in theory of computation is to construct a universal Turing machine, which can take as input an encoding of an arbitrary Turing machine T and an input w to T, and simulate what T would do given input w.  This is an extremely tedious exercise; the input to a Turing machine looks nothing like the control device of a Turing machine, so the encoding is highly intrusive, and the control device of the universal machine is something of an unholy mess.  McCarthy set out to lend his new programming language mathematical street cred by showing that not only could it simulate itself with a universal program, but the encoding would be much more lucid and the logic simpler in contrast to the unholy mess of the universal Turing machine.

The first step of this plan was to describe an encoding of programs in the form of data structures that could be used as input to a program.  That is to say, an encoding of M-expressions as S-expressions.  Much of this is a very straightforward homomorphism, recursively mapping the non-data M-expression structure onto corresponding S-expressions; for our toy language, encoding φ would have

  • φ(x) = symbol s formed by changing the letters of its name from lower-case to upper-case. Thus, φ(foo) = FOO.
  • φ([λx.M]) = (LAMBDA φ(x) φ(M)).
  • φ([M1 M2]) = (φ(M1) φ(M2)).
  • φ([if M1 M2 M3]) = (IF φ(M1) φ(M2) φ(M3)).
  • φ([car M]) = (CAR φ(M)).
  • φ([cdr M]) = (CDR φ(M)).
  • φ([cons M1 M2]) = (CONS φ(M1) φ(M2)).
  • φ([eq? M1 M2]) = (EQ φ(M1) φ(M2)).

(This encoding ignores the small detail that certain symbol names used in the encoding — LAMBDA, IF, CAR, CDR, CONS, EQ — must not also be used as variable names, if the encoding is to behave correctly.  McCarthy seems not to have been fussed about this detail, and nor should we be.)

For a proper encoding, though, S-expressions have to be encoded in a way that unambiguously distinguishes them from the encodings of other M-expressions.  McCarthy's solution was

  • φ(S) = (QUOTE S).

Now, in some ways this is quite a good solution.  It has the virtue of simplicity, cutting the Gordian knot.  It preserves the readability of the encoded S-expression, which supports McCarthy's desire for a lucid encoding.  The main objection one could raise is that it isn't homomorphic; that is, φ((S1 . S2)) is not built up from φ(S1) and φ(S2).

As McCarthy later recounted, they had expected to have plenty of time to refine the language design before it could be implemented.  (The FORTRAN compiler, after all, had been a massive undertaking.)  Meanwhile, to experiment with the language they began hand-implementing particular functions.  The flaw in this plan was that, because McCarthy had been so successful in demonstrating a universal Lisp function eval with simple logic, it wasn't difficult to hand-implement eval; and, because he had been so successful in making the encoding lucid, this instantly produced a highly usable Lisp interpreter.  The sudden implementation precipitated a user community and substantial commitment to specifics of what had been a preliminary language design.

All this might have turned out differently if the preliminary design had allowed first-class functions to be elements in pairs.  A function has to be encoded, homomorphically, which would require a homomorphic encoding of pairs, perhaps φ((M1 . M2)) = (CONS φ(M1) φ(M2)); once we allow arbitrary M-expressions within the pair syntax, (M1 . M2), that syntax itself becomes a pair constructor and there's really no need for a separate cons operator in the M-language; then CONS can encode the one constructor.  One might then reasonably restrict QUOTE to base cases; more, self-encode () #t and #f, leaving only the case of encoding symbols, and rename QUOTE to SYMBOL.  The encoding would then be fully homomorphic — but the encodings of large constant data structures would become unreadable.  For example, fairly readable constant structure

((LAMBDA X (X Y)) FOO)
would encode through φ as
(CONS (CONS (SYMBOL LAMBDA) (CONS (SYMBOL X) (CONS (CONS (SYMBOL X) (CONS (SYMBOL Y) ())) ()))) (CONS (SYMBOL FOO) ()))   .
That didn't happen, though.

The homomorphic, non-QUOTE encoding would naturally tend to produce a universal function with no practical potential for meta-programming.  In theory, one could use the non-homomorphic QUOTE encoding and still not offer any native meta-programming power.  However, the QUOTE-based encoding means there are data structures lying around that look exactly like working executable code except that they happen to be QUOTE‍d.  In practice, the psychology of the notation makes it rather inevitable that various facilities in the language would allow a blurring of the line between data and code.  Lisp, I was told when first taught the language in the 1980s, treats code as data.  Sic:  I was not told Lisp treats data as code, but that it treats code as data.

In other words, Lisp had accidentally become an interpreted language; a profoundly different beast than the compiled language McCarthy had set out to create, and one whose character naturally suggests a whole different set of features that would not have occurred to someone designing a compiled language in 1960.  There were, of course, some blunders along the way (dynamic scope is an especially famous one, and I would rate the abandonment of fexprs in favor of macros as another of similar magnitude); but in retrospect I see all that as part of exploring a whole new design space of interpreted features.  Except that over the past three decades or so the Lisp community seems to have somewhat lost track of its interpreted roots; but, I'll get back to that.

Of interest:

  • In S-expression Lisp, all source expressions are S-expressions.  It is no less true now than before that an operational equivalence M1 ≅ M2 can only be nontrivial if at least one of the Mk is not an S-expression; but now, if the Mk are source expressions, we can be absolutely certain that they are both S-expressions.  So if the operational equivalence relation is restricted to source expressions, it's trivial.  This isn't disastrous; it just means that, in order to have nontrivial theory, we are going to have to have some terms that are not source expressions (as Landin did, though for a different reason); and if we choose to compile the language, we won't be allowed to call our optimizations "local source-to-source" (any given optimization could be one or the other, but not both at once).

  • This is the fork in the road where Muller and I went our separate ways.  Muller's M-Lisp, taking the compiled-language view, supposes that S-expressions are encoded homomorphically, resulting in a baseline language with no native meta-programming power.  He then considers how to add some meta-programming power to the resulting language.  However, practical meta-programming requires the programmer be able to write lucid code that can also be manipulated as data; and the homomorphic encoding isn't lucid.  So instead, meta-programming in Muller's extended M-Lisp uses general M-expressions directly (rather than their encodings).  If an M-expression turns out to be wanted as data, it then gets belatedly encoded — with the drawback that the M-expression can't be rewritten until the rewriting schemata can tell it won't be needed as data.  This causes difficulties with operational equivalence of general M-expressions; in effect, as the burden of meta-programming is shifted from S-expressions to general M-expressions, it carries along with it the operational-equivalence difficulties that had been limited to S-expressions.

S-language

McCarthy hadn't finished the details of M-expressions, so S-expression Lisp wasn't altogether an encoding of anything; it was itself, leaving its implementors rather free to invent it as they went along.  Blurring the boundary between quoted data and unquoted code provided meta-programming facilities that hadn't been available in compiled languages (essentially, I suggest, a sort of flexibility we enjoy in natural languages).  In addition to QUOTE itself (which has a rather fraught history entangled with first-class functions and dynamic scope, cf. §3.3.2 of my dissertation), from very early on the language had fexprs, which are like LAMBDA-constructed functions except that they treat their operands as data — as if the operands had been QUOTE‍d — rather than evaluating them as code (which may later be done, if desired, explicitly by the fexpr using EVAL).  In 1963, macros were added — not the mere template-substitution macros found in various other languages, but macros that treat their operands as data and perform an arbitrary computation to generate a data structure as output, which is then interpreted as code at the point of call.

But how exactly do we specify the meanings of programs in this interpreted S-expression language?  We could resort to the meta-circular evaluator technique; this is a pretty natural strategy since an evaluator is exactly what we have as our primary definition of the language.  That approach, though, is difficult to work with mathematically, and in particular doesn't lend itself to proofs of operational equivalence.  If we try to construct a rewriting system the way we did before, we immediately run into the glaring practical problem that the same representation is used for executable code, which we want to have nontrivial theory, and passive data which necessarily has perfectly trivial theory.  That is, as noted earlier, all source expressions are S-expressions and operational equivalence of S-expressions is trivial.  It's possible to elaborate in vivid detail the theoretical train wreck that results from naive application of the usual rewriting semantics strategy to Lisp with quotation (or, worse, fexprs); but this seems to be mainly of interest if one is trying to prove that something can't be done.  I'm interested in what can be done.

If what you want is nontrivial theory, that could in principle be used to guide optimizations, this is not difficult to arrange (once you know how; cf. my past discussion of profundity index).  As mentioned, all nontrivial operational equivalences must have at least one of the two terms not a source expression (S-expression), therefore we need some terms that aren't source expressions; and our particular difficulty here is having no way to mark a source expression unmistakably as code; so, introduce a primitive context that says "evaluate this source expression".  The new context only helps with operational equivalence if it's immune to QUOTE, and no source expression is immune to QUOTE, so that's yet another way to see that the new context must form a term that isn't a source expression.

Whereas the syntax of the compiled M-language had two kinds of terms — constant S-expressions and computational M-expressions — the syntax of the interpreted S-language will have three kinds of terms.  There are, again, the "constant" terms, the S-expressions, which are now exactly the source expressions.  There are the "computational" terms, which are needed for the actual work of computation; these are collectively shaped something like λ-calculus.  We expect a big part of the equational strength of the rewriting calculus to reside in these computational terms, roughly the same equational strength as λ-calculus itself, and therefore of course those terms have to be entirely separate from the source expressions which can't have nontrivial equational theory.  And then there are the "interpretation" terms, the ones that orchestrate the gradual conversion of source expressions into computational expressions.  The code-marking terms are of this sort.  The rewriting rules involving these "interpretation" terms will amount to an algorithm for interpreting source code.

This neat division of terms into three groups won't really be as crisp as I've just made it sound.  Interpretation is by nature a gradual process whose coordination seeps into other parts of the grammar.  Some non-interpretation terms will carry along environment information, in order to make it available for later use.  This blurring of boundaries is perhaps another part of the flexibility that (I'll again suggest) makes interpreted languages more similar to natural languages.

I'll use nonterminal T for arbitrary terms.  Here are the interpretation forms.

T   ::=   [eval T T] | ⟨wrap T⟩ | e   .
Form [eval T1 T2] is a term that stands for evaluating a term T1 in an environment T2.  This immediately allows us to distinguish between statements such as
S1 ≅ S2
[eval S1 e0] ≅ [eval S2 e0]
∀e, [eval S1 e] ≅ [eval S2 e]   .
The first proposition is the excessively strong statement that S-expressions Sk are operationally equivalent — interchangeable in any context — which can only be true if the Sk are identically the same S-expression.  The second proposition says that evaluating S1 in environment e0 is operationally equivalent to evaluating S2 in environment e0 — that is, for all contexts C and all S-expressions S, C[[eval S1 e0]] ↦* S  iff  C[[eval S2 e0]] ↦* S.  The third proposition says that evaluating S1 in any environment e is operationally equivalent to evaluating S2 in the same e — which is what we would ordinarily have meant, in a compiled language, if we said that two executable code (as opposed to data) expressions Sk were operationally equivalent.

The second form, ⟨wrap T⟩, is a wrapper placed around a function T, that induces evaluation of the operand passed to the function.  If T is used without such a wrapper (and presuming T isn't already a wrapped function), it acts directly on its unevaluated operand — that is, T is a fexpr.

The third form, e, is simply an environment.  An environment is a series of symbol-value bindings, ⟪sk←Tk⟫; there's no need to go into gory detail here (though I did say more in a previous post).

The computational forms are, as mentioned, similar to λ-calculus with some environments carried along.

T   ::=   x | [combine T T T] | ⟨λx.T⟩ | ⟨εx.T⟩   .
Here we have a variable, a combination, and two kinds of function.  Form ⟨λx.T⟩ is a function that substitutes its operand for x in its body T.  Variant ⟨εx.T⟩ substitutes its dynamic environment for x in its body T.

Form [combine T1 T2 T3] is a term that stands for combining a function T1 with an operand T2 in a dynamic environment T3.  The dynamic environment is the set of bindings in force at the point where the function is called; as opposed to the static environment, the set of bindings in force at the point where the function is constructed.  Static environments are built into the bodies of functions by the function constructor, so they don't show up in the grammar.  For example, [eval (LAMBDA X FOO) e0] would evaluate to a function with static environment e0, of the form ⟨wrap ⟨λx.[eval FOO ⟪...⟫]⟩⟩ with the contents of e0 embedded somewhere in the ⟪...⟫.

Putting it all together,

T   ::=   x | [combine T T T] | ⟨λx.T⟩ | ⟨εx.T⟩ | [eval T T] | ⟨wrap T⟩ | e | S   .

The rewriting schemata naturally fall into two groups, those for internal computation and those for source-code interpretation.  (There are of course no schemata associated with the third group of syntactic forms, the syntactic forms for passive data, because passive.)  The computation schemata closely resemble λ-calculus, except with the second form of function used to capture the dynamic environment (which fexprs sometimes need).

[combine ⟨λx.T1⟩ T2 T3]   →   T1[x ← T2]
[combine ⟨εx.T1⟩ T2 T3]   →   [combine T1[x ← T3] T2 T3]   .
The interpretation schemata look very much like the dispatching logic of a Lisp interpreter.
[eval d T]   →   d
        if d is an empty list, boolean, λ-function, ε-function, or environment
[eval s e]   →   lookup(s,e)     if symbol s is bound in environment e
[eval (T1 T2) T3]   →   [combine T1 T2 T3]
[combine ⟨wrap T1⟩ T2 T3]   →   [combine T1 [eval T2 T3] T3]   .
(There would also be some atomic constants representing primitive first-class functions and reserved operators such as if, and schemata specifying what they do.)

Half a century's worth of misunderstandings and confusions

As I remarked earlier, Lisp as we know it might not have happened — at least, not when and where it did — if McCarthy had thought to allow first-class functions to occur in pairs.  The thing is, though, I don't think it's all that much of an "accident".  He didn't think to allow first-class functions to occur in pairs, and perhaps the reason we're likely to think to allow them today is that our thinking has been shaped by decades of the free-wheeling attitude fostered by the language that Lisp became because he didn't think to then.  The actual sequence of events seems less unlikely than one might first suppose.

Researchers trying to set up semantics for Lisp have been led astray, persistently over the decades, by the fact that the primary Lisp constructor of first-class functions is called LAMBDA.  Its behavior is not that of calculus λ, exactly because it's entangled with the process of interpreting Lisp source code.  This becomes apparent when contemplating rewriting calculi for Lisp of the sort I've constructed above (and have discussed before on this blog):  When you evaluate a LAMBDA expression you get a wrapped function, one that explicitly evaluates its operand and then passes the result to a computational function; that is, passes the result to a fexpr.  Scan that:  ordinary Lisp functions do not correspond directly to calculus λ-abstractions, but fexprs do correspond directly to calculus λ-abstractions.  In its relation to Lisp, λ-calculus is a formal calculus of fexprs.

Much consternation has also been devoted to the perceived theoretical difficulty presented by Lisp's quotation operator (and presented in more extreme form by fexprs), because it presents a particular context that can distinguish any two S-expressions placed into it:  (QUOTE S1) and (QUOTE S2) are observably distinct whenever the Sk are distinct from each other.  Yet, this observation only makes sense in a compiled programming language.  Back in the day, it would have been an unremarkable observation that Lisp only has syntax for data structures, no syntax at all for control.  Two syntactically distinct Lisp source expressions are operationally non-equivalent even without any quotation or fexpr context, because they don't represent programs at all; they're just passive data structures.  The context that makes a source expression code rather than data is patently not in the source; it's in what program you send the source expression to.  Conventional small-step operational semantics presumes the decision to compile, along with a trivial interpretive mapping between source expressions and internal computational terms (so the interpretive mapping doesn't have to appear explicitly in the rewriting schemata).  It is true that, without any such constructs as quotation or fexprs, there would be no reason to treat the language as interpreted rather than compiled; but once you've crossed that Rubicon, the particular constructs like quotation or fexprs are mere fragments of, and can be distractions from, the main theoretical challenge of defining the semantics of an interpreted language.

The evolution of Lisp features has itself been a long process of learning how best to realize the flexibility offered by interpreted language.  Fexprs were envisioned just about from the very beginning — 1960 — but were sabotaged by dynamic scope, a misfeature that resulted from early confusion over how to handle symbol bindings in an interpreter.  Macros were introduced in 1963, and unlike fexprs they lend themselves to preprocessing at compile time if one chooses to use a compiler; but macros really ought to be much less mathematically elegant... except that in the presence of dynamic scope, fexprs are virulently unstable.  Then there was the matter of first-class functions; that's an area where Lisp ought to have had a huge advantage; but first-class functions don't really come into their own without static scope (The Art of the Interpreter noted this) — and first-class functions also present a difficulty for compilers (which is why procedures in ALGOL were second-class).  The upshot was that after Lisp 1.5, when Lisp splintered into multiple dialects, first-class functions went into eclipse until they reemerged in the mid-1970s when Scheme introduced static scope into the language.  Fexprs held on longer but, ironically, were finally rejected by the Lisp community in 1980 — just a little ahead of the mainstream adoption of Scheme's static-scope innovation.  So for the next twenty years and more, Lisp had static scope and first-class functions, but macros and no fexprs.  Meanwhile, EVAL — key workhorse of meta-linguistic flexibility — was expunged from the new generation of mainstream Lisps and has had great difficulty getting back in.

The latter half of Lisp's history has been colored by a long-term trend in programming language design as a whole.  I've alluded to this several times above.  I have no specific sources to suggest for this; it's visible in the broad sweep of what languages were created, what research was done, and I've sensed it though my interactions with the Lisp community over the past thirty years.  When I learned Lisp in the mid-1980s, it was from the Revised Maclisp Manual, Saturday Evening Edition (which I can see a few feet away on my bookshelf as I write this, proof that manuals can be well-written).  Maclisp was a product of the mentality of the 1970s.  Scheme too was a product of that mentality.  And what comes through to me now, looking back, isn't the differences between those languages (different though they are), but that those people knew, gut deep, that Lisp is an interpreted language — philosophically, regardless of the technical details of the language processing software.  The classic paper I cited above for the relationship between first-class functions and static scope was one of the "lambda" papers associated with the development of Scheme: "The Art of the Interpreter".  The classic textbook — the Wizard Book — that emerged from the Scheme design is Structure and Interpretation of Computer Programs.

But then things changed.  Compilation had sometimes intruded into Lisp design, yes (with unfortunate results, as I've mentioned), but the intrusion became more systematic.  Amongst Scheme's other achievements it had provided improved compilation techniques, a positive development but which also encouraged greater focus on the challenges of compilation.  We refined our mathematical technology for language semantics of compiled languages, we devised complex type systems for use with compiled languages, more and more we designed our languages to fit these technologies — and as Lisp didn't fit, more and more we tweaked our Lisp dialects to try to make them fit.  Of course some of the indigenous features of Lisp couldn't fit, because the mathematical tools were fundamentally incompatible with them (no pun intended).  And somewhere along the line, somehow, we forgot, perhaps not entirely but enough, that Lisp is interpreted.  Second-class syntax has lately been treated more and more as if it were a primary part of the language, rather than a distraction from the core design.  Whatever merits such languages have, wholeheartedly embracing the interpretation design stance is not among them.

I'm a believer in trying more rather than less.  I don't begrudge anyone their opportunity to follow the design path that speaks to them; but not all those paths speak to me.  Second-class syntax doesn't speak to me, nor recasting Lisp into a compiled language.  I'm interested in compiling Lisp, but want the language design to direct those efforts rather than the other way around.  To me, the potential of interpretation beckons; the exciting things we've already found on that path suggest to me there's more to find further along, and the only way to know is to follow the path and see.  To do that, it seems to me we have to recognize that it is a distinct path, the distinct philosophy of interpretation; and, in company with that, we need to hone our mathematical technology for interpreted languages.

These are your father's parentheses
Elegant weapons
for a more... civilized age.
xkcd 297.

Saturday, June 11, 2016

The co-hygiene principle

The mathematics is not there till we put it there.
Sir Arthur Eddington, The Philosophy of Physical Science, 1938.

Investigating possible connections between seemingly unrelated branches of science and mathematics can be very cool.  Independent of whether the connections actually pan out.  It can be mind-bending either way — I'm a big fan of mind-bending, as a practical cure for rigid thinking — and you can get all sorts of off-beat insights into odd corners that get illuminated along the way.  The more unlikely the connection, the more likely potential for mind bending; and also the more likely potential for pay-off if somehow it does pan out after all.

Two hazards you need to avoid, with this sort of thing:  don't overplay the chances it'll pan out — and don't underplay the chances it'll pan out.  Overplay and you'll sound like a crackpot and, worse, you might turn yourself into one.  Relish the mind bending, take advantage of it to keep your thinking limber, and don't get upset when you're not finding something that might not be there.  And at the same time, if you're after something really unlikely, say with only one chance in a million it'll pan out, and you don't leave yourself open to the possibility it will, you might just draw that one chance in a million and miss it, which would be just awful.  So treat the universe as if it has a sense of humor, and be prepared to roll with the punchlines.

Okay, the particular connection I'm chasing is an observed analogy between variable substitution in rewriting calculi and fundamental forces in physics.  If you know enough about those two subjects to say that makes no sense, that's what I thought too when I first noticed the analogy.  It kept bothering me, though, because it hooks into something on the physics side that's already notoriously anomalous — gravity.  The general thought here is that when two seemingly disparate systems share some observed common property, there may be some sort of mathematical structure that can be used to describe both of them and gives rise to the observed common property; and a mathematical modeling structure that explains why gravity is so peculiar in physics is an interesting prospect.  So I set out to understand the analogy better by testing its limits, elaborating it until it broke down.  Except, the analogy has yet to cooperate by breaking down, even though I've now featured it on this blog twice (1, 2).

So, building on the earlier explorations, in this post I tackle the problem from the other end, and try to devise a type of descriptive mathematical model that would give rise to the pattern observed in the analogy.

This sort of pursuit, as I go about it, is a game of endurance; again and again I'll lay out all the puzzle pieces I've got, look at them together, and try to accumulate a few more insights to add to the collection.  Then gather up the pieces and save them away for a while, and come back to the problem later when I'm fresh on it again.  Only this time I've kind-of succeeded in reaching my immediate goal.  The resulting post, laying out the pieces and accumulating insights, is therefore both an explanation of where the result comes from and a record of the process by which I got there.  There are lots of speculations within it shooting off in directions that aren't where I ended up.  I pointedly left the stray speculations in place.  Some of those tangents might turn out to be valuable after all; and taking them out would create a deceptive appearance of things flowing inevitably to a conclusion when, in the event, I couldn't tell whether I was going anywhere specific until I knew I'd arrived.

Naturally, for finding a particular answer — here, a mathematical structure that can give rise to the observed pattern — the reward is more questions.

Contents
Noether's Theorem
Calculi
Analogy
Metatime
Transformations
Determinism and rewriting
Nondeterminism and the cosmic footprint
Massive interconnection
Factorization
Side-effects
Co-hygiene
Epilog: hygiene
Noether's Theorem

Noether's theorem (pedantically, Noether's first theorem) says that each differentiable invariant in the action of a system gives rise to a conservation law.  This is a particularly celebrated result in mathematical physics; it's explicitly about how properties of a system are implied by the mathematical structure of its description; and invariants — the current fad name for them in physics is "symmetries" — are close kin to both hygiene and geometry, which relate to each other through the analogy I'm pursuing; so Noether's theorem has a powerful claim on my attention.

The action of a system always used to seem very mysterious to me, until I figured out it's one of those deep concepts that, despite its depth, is also quite shallow.  It comes from Lagrangian mechanics, a mathematical formulation of classical mechanics alternative to the Newtonian mechanics formulation.  This sort of thing is ubiquitous in mathematics, alternative formulations that are provably equivalent to each other but make various problems much easier or harder to solve.

Newtonian mechanics seeks to describe the trajectory of a thing in terms of its position, velocity, mass, and the forces acting on it.  This approach has some intuitive advantages but is sometimes beastly difficult to solve for practical problems.  The Lagrangian formulation is sometimes much easier to solve.  Broadly, the time evolution of the system follows a trajectory through abstract state-space, and a function called the Lagrangian of the system maps each state into a quantity that... er... well, its units are those of energy.  For each possible trajectory of the system through state-space, the path integral of the Lagrangian is the action.  The principle of least action says that starting from a given state, the system will evolve along the trajectory that minimizes the action.  Solving for the behavior of the system is then a matter of finding the trajectory whose action is smallest.  (How do you solve for the trajectory with least action?  Well, think of the trajectories as abstract values subject to variation, and imagine taking the "derivative" of the action over these variations.  The least action will be a local minimum, where this derivative is zero.  There's a whole mathematical technology for solving problems of just that form, called the "calculus of variations".)

The Lagrangian formulation tends to be good for systems with conserved quantities; one might prefer the Newtonian approach for, say, a block sliding on a surface with friction acting on it.  And this Lagrangian affinity for conservative systems is where Noether's theorem comes in:  if there's a differentiable symmetry of the action — no surprise it has to be differentiable, seeing how central integrals and derivatives are to all this — the symmetry manifests itself in the system behavior as a conservation law.

And what, you may ask, is this magical Lagrangian function, whose properties studied through the calculus of variations reveal the underlying conservation laws of nature?  Some deeper layer of reality, the secret structure that underlies all?  Not exactly.  The Lagrangian function is whatever works:  some function that causes the principle of least action to correctly predict the behavior of the system.  In quantum field theory — so I've heard, having so far never actually grappled with QFT myself — the Lagrangian approach works for some fields but there is no Lagrangian for others.  (Yes, Lagrangians are one of those mathematical devices from classical physics that treats systems in such an abstract, holistic way that it's applicable to quantum mechanics.  As usual for such devices, its history involves Sir William Rowan Hamilton, who keeps turning up on this blog.)

This is an important point:  the Lagrangian is whatever function makes the least-action principle work right.  It's not "really there", except in exactly the sense that if you can devise a Lagrangian for a given system, you can then use it via the action integral and the calculus of variations to describe the behavior of the system.  Once you have a Lagrangian function that does in fact produce the system behavior you want it to, you can learn things about that behavior from mathematical exploration of the Lagrangian.  Such as Noether's theorem.  When you find there is, or isn't, a certain differentiable symmetry in the action, that tells you something about what is or isn't conserved in the behavior of the system, and that result really may be of great interest; just don't lose sight of the fact that you started with the behavior of the system and constructed a suitable Lagrangian from which you are now deducing things about what the behavior does and doesn't conserve.

In 1543, Copernicus's heliocentric magnum opus De revolutionibus orbium coelestium was published with an unsigned preface by Lutheran theologian Andreas Osiander saying, more or less, that of course it'd be absurd to suggest the Earth actually goes around the Sun but it's a very handy fiction for the mathematics.  Uhuh.  It's unnecessary to ask whether our mathematical models are "true"; we don't need them to be true, just useful.  When Francis Bacon remarked that what is most useful in practice is most correct in theory, he had a point — at least, for practical purposes.

Calculi

The rewriting-calculus side of the analogy has a structural backstory from at least the early 1960s (some of which I've described in an earlier post, though with a different emphasis).  Christopher Strachey hired Peter Landin as an assistant, and encouraged him to do side work exploring formal foundations for programming languages.  Landin focused on tying program semantics to λ-calculus; but this approach suffered from several mismatches between the behavioral properties of programming languages versus λ-calculus, and in 1975 Gordon Plotkin published a solution for one of these mismatches, in one of the all-time classic papers in computer science, "Call-by-name, call-by-value and the λ-calculus" (pdf).  Plotkin defined a slight variant of λ-calculus, by altering the conditions for the β-rule so that the calculus became call-by-value (the way most programming languages behaved while ordinary λ-calculus did not), and proved that the resulting λv-calculus was fully Church-Rosser ("just as well-behaved" as ordinary λ-calculus).  He further set up an operational semantics — a rewriting system that ignored mathematical well-behavedness in favor of obviously describing the correct behavior of the programming language — and proved a set of correspondence theorems between the operational semantics and λv-calculus.

[In the preceding paragraph I perhaps should have mentioned compatibility, the other crucial element of rewriting well-behavedness; which you might think I'd have thought to mention since it's a big deal in my own work, though less flashy and more taken-for-granted than Church-Rosser-ness.]

Then in the 1980s, Matthias Felleisen applied Plotkin's approach to some of the most notoriously "unmathematical" behaviors of programs:  side-effects in both data (mutable variables) and control (goto and its ilk).  Like Plotkin, he set up an operational semantics and a calculus, and proved correspondence theorems between them, and well-behavedness for the calculus.  He introduced the major structural innovation of treating a side-effect as an explicit syntactic construct that could move upward within its term.  This upward movement would be a fundamentally different kind of rewrite from the function-application — the β-rule — of λ-calculus; abstractly, a side-effect is represented by a context 𝓢, which moves upward past some particular context C and, in the process, modifies C to leave in its wake some other context C':  C[𝓢[T]] → 𝓢[C'[T]] .  A side-effect is thus viewed as something that starts in a subterm and expands outward to affect more and more of the term until, potentially, it affects the whole term — if it's allowed to expand that far.  Of course, a side-effect might never expand that far if it's trapped inside a context that it can't escape from; notably, no side-effect can escape from context λ.[ ] , which is to say, no side-effect can escape from inside the body of a function that hasn't been called.

This is where I started tracking the game, and developing my own odd notions.  There seemed to me to be two significant drawbacks to Felleisen's approach, in its original published form.  For one thing, the transformation of context C to C', as 𝓢 moved across it, could be quite extensive; Felleisen himself aptly called these transformations "bubbling up"; as an illustration of how messy things could get, here are the rules for a continuation-capture construct C expanding out of the operator or operand of a function call:

(CT1)T2   →   C(λx1.(T1(λx2.(A(x1(x2T2))))))    for unused xk.
V(CT)   →   C(λx1.(T(λx2.(A(x1(V‍x2))))))    for unused xk.
The other drawback to the approach was that as published at the time, it didn't actually provide the full measure of well-behavedness from Plotkin's treatment of call-by-value.  One way or another, a constraint had to be relaxed somewhere.  What does the side-effect construct 𝓢 do once it's finished moving upward?  The earliest published solution was to wait till 𝓢 reaches the top of the term, and then get rid of it by a whole-term rewriting rule; that works, but the whole-term rewriting rule is explicitly not well-behaved:  calculus well-behavedness requires that any rewriting on a whole term can also be done to a subterm, and here we've deliberately introduced a rewriting rule that can't be applied to subterms.  So we've weakened the calculus well-behavedness.  Another solution is to let 𝓢 reach the top of the term, then let it settle into some sort of normal form, and relax the semantics–calculus correspondence theorems to allow for equivalent normal forms.  So the correspondence is weaker or, at least, more complicated.  A third solution is to introduce an explicit context-marker — in both the calculus and the operational semantics — delimiting the possible extent of the side-effect.  So you've got full well-behavedness but for a different language than you started out with.  (Felleisen's exploration of this alternative is part of the prehistory of delimited continuations, but that's another story.)

[In a galling flub, I'd written in the preceding paragraph Church-Rosser-ness instead of well-behavedness; fixed now.]

It occurred to me that a single further innovation should be able to eliminate both of these drawbacks.  If each side-effect were delimited by a context-marker that can move upward in the term, just as the side-effect itself can, then the delimiter would restore full Church-Rosser-ness without altering the language behavior; but, in general, the meanings of the delimited side-effects depend on the placement of the delimiter, so to preserve the meaning of the term, moving the delimiter may require some systematic alteration to the matching side-effect markers.  To support this, let the delimiter be a variable-binding construct, with free occurrences of the variable in the side-effect markers.  The act of moving the delimiter would then involve a sort of substitution function that propagates needed information to matching side-effect markers.  What with one thing and another, my academic pursuits dragged me away from this line of thought for years, but then in the 2000s I found myself developing an operational semantics and calculus as part of my dissertation, in order to demonstrate that fexprs really are well-behaved (though I should have anticipated that some people, having been taught otherwise, would refuse to believe it even with proof).  So I seized the opportunity to also elaborate my binding-delimiters approach to things that — unlike fexprs — really are side-effects.

This second innovation rather flew in the face of a tradition going back about seven or eight decades, to the invention of λ-calculus.  Alonzo Church was evidently quite concerned about what variables mean; he maintained that a proposition with free variables in it doesn't have a clear meaning, and he wanted to have just one variable-binding construct, λ, whose β-rule defines the practical meanings of all variables.  This tradition of having just one kind of variable, one binding construct, and one kind of variable-substitution (β-substitution) has had a powerful grip on researchers' imaginations for generations, to the point where even when other binding constructs are introduced they likely still have most of the look-and-feel of λ.  My side-effect-ful variable binders are distinctly un-λ-like, with rewriting rules, and substitution functions, bearing no strong resemblance to the β-rule.  Freedom from the β mold had the gratifying effect of allowing much simpler rewriting rules for moving upward through a term, without the major perturbations suggested by the term "bubbling up"; but, unsurprisingly, the logistics of a wild profusion of new classes of variables were not easy to work out.  Much elegant mathematics surrounding λ-calculus rests squarely on the known simple properties of its particular take on variable substitution.  The chapter of my dissertation that grapples with the generalized notion of substitution (Chapter 13, "Substitutive Reduction Systems", for anyone keeping score) has imho appallingly complicated foundations, although the high-level theorems at least are satisfyingly powerful.  One thing that did work out neatly was enforcement of variable hygiene, which in ordinary λ-calculus is handled by α-renaming.  In order to apply any nontrivial term-rewriting rule without disaster, you have to first make sure there aren't some two variables using the same name whose distinction from each other would be lost during the rewrite.  It doesn't matter, really, what sort of variables are directly involved in the rewrite rule:  an unhygienic rewrite could mess up variables that aren't even mentioned by the rule.  Fortunately, it's possible to define a master α-renaming function that recurses through the term renaming variables to maintain hygiene, and whenever you add a new sort of variable to the system, just extend the master function with particular cases for that new sort of variable.  Each rewriting rule can then invoke the master function, and everything works smoothly.

I ended up with four classes of variables.  "Ordinary" variables, of the sort supported by λ-calculus, I found were actually wanted only for a specific (and not even technically necessary) purpose:  to support partial evaluation.  You could build the whole calculus without them and everything would work right, but the equational theory would be very weak.  (I blogged on this point in detail here.)  A second class of variable supported continuations; in effect, the side-effect marker was a "throw" and the binding construct was a "catch".  Mutable state was more complicated, involving two classes of variables, one for assignments and one for lookups.  The variables for assignment were actually environment identities; each assignment side-effect would then specify a value, a symbol, and a free variable identifying the environment.  The variables for lookup stood for individual environment-symbol queries; looking up a symbol in an environment would generate queries for that environment and each of its ancestor environments.  The putative result of the lookup would be a leaf subterm with free variable occurrences for all the queries involved, waiting to assimilate the query results, while the queries themselves would rise through the term in search of matching assignments.  Whenever a query found a matching assignment, it would self-annihilate while using substitution to report the result to all waiting free variable occurrences.

Does all this detail matter to the analogy with physics?  Well, that's the question, isn't it.  There's a lot there, a great deal of fodder to chew on when considering how an analogy with something else might have a structural basis.

Analogy

Amongst the four classes of variables, partial-evaluation variables have a peculiar sort of... symmetry.  If you constructed a vau-calculus with, say, only continuation variables, you'd still have two different substitution functions — one to announce that a delimiting "catch" has been moved upward, and one for α-renaming.  If you constructed a vau-calculus with only mutable-state variables, you'd have, well, a bunch of substitution functions, but in particular all the substitutions used to enact rewriting operations would be separate from α-renaming.  β-substitution, though, is commensurate with α-renaming; once you've got β-substitution of partial-evaluation variables, you can use it to α-rename them as well, which is why ordinary λ-calculus has, apparently at least, only one substitution function.

Qualitatively, partial-evaluation variables seem more integrated into the fabric of the calculus, in contrast to the other classes of variables.

All of which put me powerfully in mind of physics because it's a familiar observation that gravity seems qualitatively more integrated into the fabric of spacetime, in contrast to the other fundamental forces (xkcd).  General relativity portrays gravity as the shape of spacetime, whereas the other forces merely propagate through spacetime, and a popular strategy for aspiring TOEs (Theories Of Everything) is to integrate the other fundamental forces into the geometry as well — although, looking at the analogy, perhaps that popular strategy isn't such a good idea after all.  Consider:  The analogy isn't just between partial-evaluation variables and gravity.  It's between the contrast of partial-evaluation variables against the other classes of variables, and the contrast of gravity against the other fundamental forces.  All the classes of variables, and all the fundamental forces, are to some extent involved.  I've already suggested that Felleisen's treatment of side-effects was both weakened and complicated by its too-close structural imitation of λ, whereas a less λ-like treatment of side-effects can be both stronger and simpler; so, depending on how much structure carries through the analogy, perhaps trying to treat the other fundamental forces too much like gravity should be expected to weaken and complicate a TOE.

Projecting through the analogy suggests alternative ways to structure theories of physics, which imho is worthwhile independent of whether the analogy is deep or shallow; as I've remarked before, I actively look for disparate ways of thinking as a broad base for basic research.  The machinery of calculus variable hygiene, with which partial-evaluation variables have a special affinity, is only one facet of term structure; and projecting this through to fundamental physics, where gravity has a special affinity with geometry, this suggests that geometry itself might usefully be thought of, not as the venue where physics takes place, but merely as part of the rules by which the game is played.  Likewise, the different kinds of variables differ from each other by the kinds of structural transformations they involve; and projecting that through the analogy, one might try to think of the fundamental forces as differing from each other not (primarily) by some arbitrary rules of combination and propagation, but by being different kinds of structural manipulations of reality.  Then, if there is some depth to the analogy, one might wonder if some of the particular technical contrasts between different classes of variables might be related to particular technical contrasts between different fundamental forces — which, frankly, I can't imagine deciphering until and unless one first sets the analogy on a solid technical basis.

I've speculated several times on this blog on the role of non-locality in physics.  Bell's Theorem says that the statistical distribution of quantum predictions cannot be explained by any local, deterministic theory of physics if, by 'local and deterministic', you mean 'evolving forward in time in a local and deterministic way'; but it's quite possible to generate this same statistical distribution of spacetime predictions using a theory that evolves locally and deterministically in a fifth dimension orthogonal to spacetime.  Which strikes a familiar chord through the analogy with calculus variables, because non-locality is, qualitatively at least, the defining characteristic of what we mean by "side-effects", and the machinery of α-renaming maintains hygiene for these operations exactly by going off and doing some term rearranging on the side (as if in a separate dimension of rewriting that we usually don't bother to track).  Indeed, thought of this way, a "variable" seems to be an inherently distributed entity, spread over a region of the term — called its scope — rather than located at a specific point.  A variable instance might appear to have a specific location, but only because we look at a concrete syntactic term; naturally we have to have a particular concrete term in order to write it down, but somehow this doesn't seem to do justice to the reality of the hygiene machinery.  One could think of an equivalence class of terms under α-renaming, but imho even that is a bit passive.  The reality of a variable, I've lately come to think, is a dynamic distributed entity weaving through the term, made up of the binding construct (such as a λ), all the free instances within its scope, and the living connections that tie all those parts together; I imagine if you put your hand on any part of that structure you could feel it humming with vitality.

Metatime

To give a slightly less hand-wavy description of my earlier post on Bell's Theorem — since it is the most concrete example we have to inform our view of the analogy on the physics side:

Bell looked at a refinement of the experiment from the EPR paradox.  A device emits two particles with entangled spin, which shoot off in opposite directions, and their spins are measured by oriented detectors at some distance from the emitter.  The original objection of Einstein Podolsky and Rosen was that the two measurements are correlated with each other, but because of the distance between the two detectors, there's no way for information about either measurement to get to where the other measurement takes place without "spooky action at a distance".  Bell refined this objection by noting that the correlation of spin measurements depends on the angle θ between the detectors.  If you suppose that the orientations of the detectors at measurement is not known at the time and place where the particles are emitted, and that the outcomes of the measurements are determined by some sort of information — "hidden variable" — propagating from the emission event at no more than the speed of light, then there are limits (called Bell's Inequality) on how the correlation can be distributed as a function of θ, no matter what the probability distribution of the hidden variable.  The distribution predicted by quantum mechanics violates Bell's Inequality; so if the actual probability distribution of outcomes from the experiment matches the quantum mechanical prediction, we're living in a world that can't be explained by a local hidden-variable theory.

My point was that this whole line of reasoning supposes the state of the world evolves forward in time.  If it doesn't, then we have to rethink what we even mean by "locality", and I did so.  Suppose our entire four-dimensional reality is generated by evolving over a fifth dimension, which we might as well call "metatime".  "Locality" in this model means that information about the state of one part of spacetime takes a certain interval of metatime to propagate a certain distance to another part of spacetime.  Instead of trying to arrange the probability distribution of a hidden variable at the emission event so that it will propagate through time to produce the desired probability distribution of measurements — which doesn't work unless quantum mechanics is seriously wrong about this simple system — we can start with some simple, uniform probability distribution of possible versions of the entire history of the experiment, and by suitably arranging the rules by which spacetime evolves, we can arrange that eventually spacetime will settle into a stable state where the probability distribution is just what quantum mechanics predicts.  In essence, it works like this:  let the history of the experiment be random (we don't need nondeterminism here; this is just a statement of uniformly unknown initial conditions), and suppose that the apparent spacetime "causation" between the emission and the measurements causes the two measurements to be compared to each other.  Based on θ, let some hidden variable decide whether this version of history is stable; and if it isn't stable, just scramble up a new one (we can always do that by pulling it out of the uniform distribution of the hidden variable, without having to posit fundamental nondeterminism).  By choosing the rule for how the hidden variable interacts with θ, you can cause the eventual stable history of the experiment to exhibit any probability distribution you choose.

That immense power is something to keep a cautious eye on:  not only can this technique produce the probability distribution predicted by quantum mechanics, it can produce any other probability distribution as well.  So, if the general structure of this mathematical theory determines something about the structure of the physical reality it depicts, what it determines is apparently not, in any very straightforward fashion, that probability distribution.

Transformations

The side of the analogy we have prior detailed structural knowledge about is the vau-calculus side.  Whatever useful insights we may hope to extract from the metatime approach to Bell's Theorem, it's very sketchy compared to vau-calculus.  So if we want to work out a structural pattern that applies to both sides of the analogy, it's plausible to start building from the side we know about, questioning and generalizing as we go along.  To start with,

  • Suppose we have a complex system, made up of interconnected parts, evolving by some sort of transformative steps according to some simple rules.
Okay, freeze frame.  Why should the system be made up of parts?  Well, in physics it's (almost) always the parts we're interested in.  We ourselves are, apparently, parts of reality, and we interact with parts of reality.  Could we treat the whole as a unit and then somehow temporarily pull parts out of it when we need to talk about them?  Maybe, but the form with parts is still the one we're primarily interested in.  And what about "transformative steps"; do we want discrete steps rather than continuous equations?  Actually, yes, that is my reading of the situation; not only does fundamental physics appear to be shot through with discreteness (I expanded on this point a while back), but the particular treatment I used for my metatime proof-of-concept (above) used an open-ended sequence of discrete trials to generate the requisite probability distribution.  If a more thoroughly continuous treatment is really wanted, one might try to recover continuity by taking a limit a la calculus.
  • Suppose we separate the transformation rules into two groups, which we call bookkeeping rules and operational rules; and suppose we have a set of exclusive criteria on system configurations, call these hygiene conditions, which must be satisfied before any operational rule can be applied.

Freeze again.  At first glance, this looks pretty good.  From any unhygienic configuration, we can't move forward operationally until we've done bookkeeping to ensure hygiene.  Both calculus rewriting and the metatime proof-of-concept seemingly conform to this pattern; but the two cases differ profoundly in how their underlying hygiene (supposing that's what it is, in the physics case) affects the form of the modeled system, and we'll need to consider the difference carefully if we mean to build our speculations on a sound footing.

Determinism and rewriting

Hygiene in rewriting is all about preserving properties of a term (to wit, variable instance–binding correspondences), whereas our proof-of-concept metatime transformations don't appear to be about perfectly preserving something but rather about shaping probability distributions.  One might ask whether it's possible to set up the internals of our metatime model so that the probability distribution is a consequence, or symptom, of conserving something behind the scenes.  Is the seemingly nondeterministic outcome of our quantum observation in a supposedly small quantum system here actually dictated by the need to maintain some cosmic balance that can't be directly observed because it's distributed over a ridiculously large number of entities (such as the number of electrons in the universe)?  That could lead to some bracing questions about how to usefully incorporate such a notion into a mathematical theory.

As an alternative, one might decide that the probability distribution in the metatime model should not be a consequence of absolutely preserving a condition.  There are two philosophically disparate sorts of models involving probabilities:  either the probability comes from our lack of knowledge (the hidden-variable hypothesis), and in the underlying model the universe is computing an inevitable outcome; or the probability is in the foundations (God playing dice, in the Einsteinian phrase), and in the underlying model the universe is exploring the range of possible outcomes.  I discussed this same distinction, in another form, in an earlier post, where it emerged as the defining philosophical distinction between a calculus and a grammar (here).  In those terms, if our physics model is fundamentally deterministic then it's a calculus and by implication has that structural affinity with the vau-calculi on the other side of the analogy; but if our physics model is fundamentally nondeterministic then it's a grammar, and our analogy has to try to bridge that philosophical gap.  Based on past experience, though, I'm highly skeptical of bridging the gap; if the analogy can be set on a concrete technical basis, the TOE on the physics side seems to me likely to be foundationally deterministic.

The foundationally deterministic approach to probability is to start with a probabilistic distribution of deterministic initial states, and evolve them all forward to produce a probabilistic distribution of deterministic final states.  Does the traditional vau-calculus side of our analogy, where we have so much detail to start with, have anything to say about this?  In the most prosaic sense, ones suspects not; probability distributions don't traditionally figure into deterministic computation semantics, where this approach would mean considering fuzzy sets of terms.  There may be some insight lurking, though, in the origins of calculus hygiene.

When Alonzo Church's 1932/3 formal logic turned out to be inconsistent, he tried to back off and find some subset of it that was provably consistent.  Here consistent meant that not all propositions are equivalent to each other, and the subset of the logic that he and his student J. Barkley Rosser proved consistent in this sense was what we now call λ-calculus.  The way they did it was to show that if any term T1 can be reduced in the calculus in two different ways, as T2 and T3, then there must be some T4 that both of them can be reduced to.  Since logical equivalence of terms is defined as the smallest congruence generated by the rewriting relation of the calculus, from the Church-Rosser property it follows that if two terms are equivalent, there must be some term that they both can be reduced to; and therefore, two different irreducible terms cannot possibly be logically equivalent to each other.

Proving the Church-Rosser theorem for λ-calculus was not, originally, a simple matter.  It took three decades before a simple proof began to circulate, and the theorem for variant calculi continues to be a challenge.  And this is (in one view of the matter, at least) where hygiene comes into the picture.  Church had three major rewriting rules in his system, later called the α, β, and γ rules.  The α rule was the "bookkeeping" rule; it allowed renaming a bound variable as long as you don't lose its distinction from other variables in the process.  The β rule is now understood as the single operational rule of λ-calculus, how to apply a function to an argument.  The γ rule is mostly forgotten now; it was simply doing a β-step backward, and was later dropped in favor of starting with just α and β and then taking the congruent closure (reflexive, symmetric, transitive, and compatible).  Ultimately the Church-Rosser theorem allows terms to be sorted into β-equivalence classes; but the terms in each class aren't generally thought of as "the same term", just "equivalent terms".  α-equivalent terms, though, are much closer to each other, and for many purposes would actually be thought of as "the same term, just written differently".  Recall my earlier description of a variable as a distributed entity, weaving through the term, made up of binding construct, instances, and living connections between them.  If you have a big term, shot through with lots of those dynamic distributed entities, the interweaving could really be vastly complex.  So factoring out the α-renaming is itself a vast simplification, which for a large term may dwarf what's left after factoring to complete the Church-Rosser proof.  To see by just how much the bookkeeping might dwarf the remaining operational complexity, imagine scaling the term up to the sort of cosmological scope mentioned earlier — like the number of electrons in the universe.

It seems worth considering, that hygiene may be a natural consequence of a certain kind of factoring of a vastly interconnected system:  you sequester almost all of the complexity into bookkeeping with terrifically simple rules applied on an inhumanly staggering scale, and comparatively nontrivial operational rules that never have to deal directly with the sheer scale of the system because that part of the complexity was factored into the bookkeeping.  In that case, at some point we'll need to ask when and why that sort of factoring is possible.  Maybe it isn't really possible for the cosmos, and a flaw in our physics is that we've been trying so hard to factor things this way; when we really dive into that question we'll be in deep waters.

It's now no longer clear, btw, that geometry corresponds quite directly to α-renaming.  There was already some hint of that in the view of vau-calculus side-effects as "non-local", which tends to associate geometry with vau-calculus term structure rather than α-renaming as such.  Seemingly, hygiene is then a sort of adjunct to the geometry, something that allows the geometry to coexist with the massive interconnection of the system.

But now, with massive interconnection resonating between the two sides of the analogy, it's definitely time to ask some of those bracing questions about incorporating cosmic connectivity into a mathematical theory of physics.

Nondeterminism and the cosmic footprint

We want to interpret our probability distribution as a footprint, and reconstruct from it the massively connected cosmic order that walked there.  Moreover, we're conjecturing that the whole system is factorable into bookkeeping/hygiene on one hand(?), and operations that amount to what we'd ordinarily call "laws of physics" on the other; and we'd really like to deduce, from the way quantum mechanics works, something about the nature of the bookkeeping and the factorization.

Classically, if we have a small system that's acted on by a lot of stuff we don't know about specifically, we let all those influences sum to a potential field.  One might think of this classical approach as a particular kind of cosmological factorization in which the vast cosmic interconnectedness is reduced to a field, so one can then otherwise ignore almost everything to model the behavior of the small system of interest using a small operational set of physical laws.  We know the sort of cosmic order that reduces that way, it's the sort with classical locality (relative to time evolution); and the vaster part of the factorization — the rest of the cosmos, that reduced to a potential field — is of essentially the same kind as the small system.  The question we're asking at this point is, what sort of cosmic order reduces such that its operational part is quantum mechanics, and what does its bookkeeping part look like?  Looking at vau-calculus, with its α-equivalence and Church-Rosser β-equivalence, it seems fairly clear that hygiene is an asymmetric factorization:  if the cosmos factors this way, the bookkeeping part wouldn't have to look at all like quantum mechanics.  A further complication is that quantum mechanics may be an approximation only good when the system you're looking at is vastly smaller than the universe as a whole; indeed, this conjecture seems rather encouraged by what happens when we try to apply our modern physical theories to the cosmos as a whole: notably, dark matter.  (The broad notion of asymmetric factorizations surrounding quantum mechanics brings to mind both QM's notorious asymmetry between observer and observed, and Einstein's suggestion that QM is missing some essential piece of reality.)

For this factorization to work out — for the cosmic system as a whole to be broadly "metaclassical" while factoring through the bookkeeping to either quantum mechanics or a very good approximation thereof — the factorization has to have some rather interesting properties.  In a generic classical situation where one small thing is acted on by a truly vast population of other things, we tend to expect all those other things to average out (as typically happens with a classical potential field), so that the vastness of the population makes their combined influence more stable rather than less; and also, as our subsystem interacts with the outside influence and we thereby learn more about that influence, we become more able to allow for it and still further reduce any residual unpredictability of the system.

Considered more closely, though, the expectation that summing over a large population would tend to average out is based broadly on the paradigm of signed magnitudes on an unbounded scale that attenuate over distance.  If you don't have attenuation, and your magnitudes are on a closed loop, such as angles of rotation, increasing the population just makes things more unpredictable.  Interestingly, I'd already suggested in one of my earlier explorations of the hygiene analogy that the physics hygiene condition might be some sort of rotational constraint, for the — seemingly — unrelated reason that the primary geometry of physics has 3+1 dimension structure, which is apparently the structure of quaternions, and my current sense of quaternions is that they're the essence of rotation.  Though when things converge like this, it can be very hard to distinguish between an accidental convergence and one that simply reassembles fragments of a deeper whole.

I'll have a thought on the other classical problem — losing unpredictability as we learn more about outside influences over time — after collecting some further insights on the structural dynamics of bookkeeping.

Massive interconnection

Given a small piece of the universe, which other parts of the universe does it interact with, and how?

In the classical decomposition, all interactions with other parts of the universe are based on relative position in the geometry — that is, locality.  Interestingly, conventional quantum mechanics retains this manner of selecting interactions, embedding it deeply into the equational structure of its mathematics.  Recall the Schrödinger equation,

iℏ Ψ
t
 =   Ĥ Ψ .
The element that shapes the time evolution of the system — the Hamiltonian function Ĥ — is essentially an embodiment of the classical expectation of the system behavior; which is to say, interaction according to the classical rules of locality.  (I discussed the structure of the Schrödinger equation at length in an earlier post, yonder.)  Viewing conventional QM this way, as starting with classical interactions and then tacking on quantum machinery to "fix" it, I'm put in mind of Ptolemaic epicycles, tacked on to the perfect-circles model of celestial motion to make it work.  (I don't mean the comparison mockingly, just critically; Ptolemy's system worked pretty well, and Copernicus used epicycles too.  Turns out there's a better way, though.)

How does interaction-between-parts play out in vau-calculus, our detailed example of hygiene at work?

The whole syntax of a calculus term is defined by two aspects:  the variables — by which I mean those "distributed entities" woven through the term, each made of a binding, its bound instances, and the connections that hygiene maintains between them — and, well, everything else.  Once you ignore the specific identities of all the variable instances, you've just got a syntax tree with each node labeled by a context-free syntax production; and the context-free syntax doesn't have very many rules.  In λ-calculus there are only four syntax rules: a term is either a combination, or a λ-abstraction, or a variable, or a constant.  Some treatments simplify this by using variables for the "constants", and it's down to only three syntax rules.  The lone operational rule β,

((λx.T1)T2)   →   T1[x ← T2]   ,
gives a purely local pattern in the syntax tree for determining when the operation can be applied:  any time you have a parent node that's a combination whose left child is a λ-abstraction.  Operational rules stay nearly this simple even in vau-calculi; the left-hand side of each operational rule specifies when it can be applied by a small pattern of adjacent nodes in the syntax tree, and mostly ignores variables (thanks to hygienic bookkeeping).  The right-hand side is where operational substitution may be introduced.  So evidently vau-calculus — like QM — is giving local proximity a central voice in determining when and how operations apply, with the distributed aspect (variables) coming into play in the operation's consequences (right-hand side).

Turning it around, though, if you look at a small subterm — analogous, presumably, to a small system studied in physics — what rules govern its non-local connections to other parts of the larger term?  Let's suppose the term is larger than our subterm by a cosmically vast amount.  The free variables in the subterm are the entry points by which non-local influences from the (vast) context can affect the subterm (via substitution functions).  And there is no upper limit to how fraught those sorts of interconnections can get (which is, after all, what spurs advocates of side-effect-less programming).  That complexity belongs not to the "laws of physics" (neither the operational nor even the bookkeeping rules), but to the configuration of the system.  From classical physics, we're used to having very simple laws, with all the complexity of our problems coming from the particular configuration of the small system we're studying; and now we've had that turned on its head.  From the perspective of the rules of the calculus, yes, we still have very simple rules of manipulation, and all the complexity is in the arrangement of the particular term we're working with; but from the perspective of the subterm of interest, the interconnections imposed by free variables look a lot like "laws of physics" themselves.  If we hold our one subterm fixed and allow the larger term around it to vary probabilistically then, in general, we don't know what the rules are and have no upper bound on how complicated those rules might be.  All we have are subtle limits on the shape of the possible influences by those rules, imposed roundaboutly by the nature of the bookkeeping-and-operational transformations.

One thing about the shape of these nonlocal influences:  they don't work like the local part of operations.  The substitutive part of an operation typically involves some mixture of erasing bound variable instances and copying fragments from elsewhere.  The upshot is that it rearranges the nonlocal topology of the term, that is, rearranges the way the variables interweave — which is, of course, why the bookkeeping rules are needed, to maintain the integrity of the interweaving as it winds and twists.  And this is why a cosmic system of this sort doesn't suffer from a gradual loss of unpredictability as the subsystem interacts with its neighbors in the nonlocal network and "learns" about them:  each nonlocal interaction that affects it changes its nonlocal-network neighbors.  As long as the overall system is cosmically vast compared to the the subsystem we're studying, in practice we won't run out of new network neighbors no matter how many nonlocal actions our subsystem undergoes.

This also gives us a specific reason to suspect that quantum mechanics, by relying on this assumption of an endless supply of fresh network neighbors, would break down when studying subsystems that aren't sufficiently small compared to the cosmos as a whole.  Making QM (as I've speculated before), like Newtonian physics, an approximation that works very well in certain cases.

Factorization

Here's what the reconstructed general mathematical model looks like so far:

  • The system as a whole is made up of interconnected parts, evolving by transformative steps according to simple rules.

  • The interconnections form two subsystems:  local geometry, and nonlocal network topology.

  • The transformation rules are of two kinds, bookkeeping and operational.

  • Operational rules can only be applied to a system configuration satisfying certain hygiene conditions on its nonlocal network topology; bookkeeping, which only acts on nonlocal network topology, makes it possible to achieve the hygiene conditions.

  • Operational rules are activated based on local criteria (given hygiene).  When applied, operations can have both local and nonlocal effects, while the integrity of the nonlocal network topology is maintained across both kinds of effect via hygiene, hence bookkeeping.

To complete this picture, it seems, we want to consider what a small local system consists of, and how it relates to the whole.  This is all the more critical since we're entertaining the possibility that quantum mechanics might be an approximation that only works for a small local system, so that understanding how a local system relates to the whole may be crucial to understanding how quantum mechanics can arise locally from a globally non-quantum TOE.

A local system consists of some "local state", stuff that isn't interconnection of either kind; and some interconnections of (potentially) both kinds that are entirely encompassed within the local system.  For example, in vau-calculi — our only prior source for complete examples — we might have a subterm (λx.(y‍x)).  Variables are nonlocal network topology, of course, but in this case variable x is entirely contained within the local system.  The choice of variable name "x" is arbitrary, as long as it remains different from "y" (hygiene).  But what about the choice of "y"?  In calculus reasoning we would usually say that because y is free in this subterm, we can't touch it; but that's only true if we're interested in comparing it to specific contexts, or to other specific subterms.  If we're only interested in how this subterm relates to the rest of the universe, and we have no idea what the rest of the universe is, then the free variable y really is just one end of a connection whose other end is completely unknown to us; and a different choice of free variable would tell us exactly as much, and as little, as this one.  We would be just as well off with (λx.(z‍x)), or (λz.(w‍z)), or even (λy.(x‍y)) — as long as we maintain the hygienic distinction between the two variables.  The local geometry that can occur just outside this subterm, in its surrounding context, is limited to certain specific forms (by the context-free grammar of the calculus); the nonlocal network topology is vastly less constrained.

The implication here is that all those terms just named are effectively equivalent to (λx.(y‍x)).  One might be tempted to think of this as simply different ways of "writing" the same local system, as in physics we might describe the same local system using different choices of coordinate axes; but the choice of coordinate axes is about local geometry, not nonlocal network topology.  Here we're starting with simple descriptions of local systems, and then taking the quotient under the equivalence induced by the bookkeeping operations.  It's tempting to think of the pre-quotient simple descriptions as "classical" and analogize the quotient to "quantum", but in fact there is a second quotient to be taken.  In the metatime proof-of-concept, the rewriting kept reshuffling the entire history of the experiment until it reached a steady state — the obvious analogy is to a calculus irreducible term, the final result of the operational rewrite relation of the calculus.  And this, at last, is where Church-Rosser-ness comes in.  Church-Rosser is what guarantees that the same irreducible form (if any) is reached no matter in what order operational rules are applied.  It's the enabler for each individual state of the system to evolve deterministically.  To emphasize this point:  Church-Rosser-ness applies to an individual possible system state, thus belongs to the deterministic-foundations approach to probabilities.  The probability distribution itself is made up of individual possibilities each one of which is subject to Church-Rosser-ness.  (Church-Rosser-ness is also, btw, a property of the mathematical model:  one doesn't ask "Why should these different paths of system state evolution all come back together to the same normal form?", because that's simply the kind of mathematical model one has chosen to explore.)

The question we're trying to get a handle on is why the nonlocal effects of some operational rules would appear to be especially compatible with the bookkeeping quotient of the local geometry.

Side-effects

In vau-calculi, the nonlocal operational effects (i.e., operational substitution functions) that do not integrate smoothly with bookkeeping (i.e., with α-renaming) are the ones that support side-effects; and the one nonlocal operational effect that does integrate smoothly with bookkeeping — β-substitution — supports partial evaluation and turns out to be optional, in the sense that the operational semantics of the system could be described without that kind of nonlocal effect and the mathematics would still be correct with merely a weaker equational theory.

This suggests that in physics, gravity could be understood without bringing nonlocal effects into it at all, though there may be some sort of internal mathematical advantage to bringing them in anyway; while the other forces may be thought of as, in some abstract sense, side-effects.

So, what exactly makes a side-effect-ful substitution side-effect-ful?  Conversely, β-substitution is also a form of substitution; it engages the nonlocal network topology, reshuffling it by distributing copies of subterms, the sort of thing I speculated above may be needed to maintain the unpredictability aspect of quantum mechanics.  So, what makes β-substitution not side-effect-ful in character?  Beyond the very specific technicalities of β- and α-substitution; and just how much, or little, should we be abstracting away from those technicalities?  I'm supposing we have to abstract away at least a bit, on the principle that physics isn't likely to be technically close to vau-calculi in its mathematical details.

Here's a stab at a sufficient condition:

  • A nonlocal operational effect is side-effect-ful just if it perturbs the pre-existing local geometry.

The inverse property, called "purity" in a programming context (as in "pure function"), is that the nonlocal operational effect doesn't perturb the pre-existing local geometry.  β-substitution is pure in this sense, as it replaces a free variable-instance with a subterm but doesn't affect anything local other than the variable-instance itself.  Contrast this with the operational substitution for control variables; the key clauses (that is, nontrivial base cases) of the two substitutions are

x[x ← T]   →   T   .

(τx.T)[x ← C]   →   (τx.C[T[x ← C]])   .

The control substitution alters the local-geometric distance between pre-existing structure T and whatever pre-existing immediate context surrounds the subterm acted on.  Both substitutions have the — conjecturally important — property that they substantially rearrange the nonlocal network topology by injecting arbitrary new network connections (that is, new free variables).  The introduction of new free variables is a major reason why vau-calculi need bookkeeping to maintain hygiene; although, interestingly, it's taken all this careful reasoning about bookkeeping to conclude that bookkeeping isn't actually necessary to the notion of purity/impurity (or side-effect-ful/non-side-effect-ful); apparently, bookkeeping is about perturbations of the nonlocal network topology, whereas purity/impurity is about perturbations of the local geometry.  To emphasize the point, one might call this non-perturbation of local geometry co-hygiene — all the nonlocal operational effects must be hygienic, which might or might not require bookkeeping depending on internals of the mathematics, but only the β- and gravity nonlocal effects are co-hygienic.

Co-hygiene

Abstracting away from how we got to it, here's what we have:

  • A complex system of parts, evolving through a Church-Rosser transformation step relation.

  • Interconnections within a system state, partitioned into local (geometry) and nonlocal (network).

  • Each transformation step is selected locally.

  • The nonlocal effects of each transformation step rearrange — scramble — nonlocal connections at the locus where applied.

  • Certain transformative operations have nonlocal effects that do not disrupt pre-existing local structure — that are co-hygienic — and thereby afford particularly elegant description.

What sort of elegance is involved in the description of a co-hygienic operation depends on the technical character of the mathematical model; for β-reduction, what we've observed is functional compatibility between β- and α-substitution, while for gravity we've observed the general-relativity integration between gravity and the geometry of spacetime.

So my proposed answer to the conundrum I've been pondering is that the affinity between gravity and geometry suggests a modeling strategy with a nonlocal network pseudo-randomly scrambled by locally selected operational transformations evolving toward a stable state of spacetime, in which the gravity operations are co-hygienic.  A natural follow-on question is just what sort of mathematical machinery, if any, would cause this network-scrambling to approximate quantum mechanics.

On the side, I've got intimations here that quantum mechanics may be an approximation induced by the pseudo-random network scrambling when the system under study is practically infinitesimal compared to the cosmos as a whole, and perhaps that the network topology has a rotational aspect.

Meanwhile, an additional line of possible inquiry has opened up.  All along I've been trying to figure out what the analogy says about physics; but now it seems one might study the semantics of a possibly-side-effect-ful program fragment by some method structurally akin to quantum mechanics.  The sheer mathematical perversity of quantum mechanics makes me skeptical that this could be a practical approach to programming semantics; on the other hand, it might provide useful insights for the TOE mathematics.

Epilog: hygiene

So, what happened to hygiene?  It was a major focus of attention through nearly the whole investigation, and then dropped out of the plot near the end.

At its height of prestige, when directly analogized to spacetime geometry (before that fell through), hygiene motivated the suggestion that geometry might be thought of not as the venue where physics happens but merely as part of its rules.  That suggestion is still somewhat alive since the proposed solution treats geometry as abstractly just one of the two forms of interconnection in system state, though there's a likely asymmetry of representation between the two forms.  There was also some speculation that understanding hygiene on the physics side could be central to making sense of the model; that, I'd now ease off on, but do note that in seeking a possible model for physics one ought to keep an eye out for a possible bookkeeping mechanism, and certainly resolving the nonlocal topology of the model would seem inseparable from resolving its bookkeeping.  So hygiene isn't out of the picture, and may even play an important role; just not with top billing.

Would it be possible for the physics model to be unhygienic?  In the abstract sense I've ended up with, lack of hygiene would apparently mean an operation whose local effect causes nonlocal perturbation.  Whether or not dynamic scope qualifies would depend on whether dynamically scoped variables are considered nonlocal; but since we expect some nonlocal connectivity, and those variables couldn't be perturbed by local operations without losing most/all of their nonlocality, probably the physics model would have to be hygienic.  My guess is that if a TOE actually tracked the nonlocal network (as opposed to, conjecturally, introducing a quantum "blurring" as an approximation for the cumulative effect of the network), the tracking would need something enough like calculus variables that some sort of bookkeeping would be called for.