## Saturday, March 1, 2014

### Continuations and term-rewriting calculi

Thinking is most mysterious, and by far the greatest light upon it that we have is thrown by the study of language.
Benjamin Lee Whorf.

In this post, I hope to defuse the pervasive myth that continuations are intrinsically a whole-computation device.  They aren't.  I'd originally meant to write about the relationship between continuations and delimited continuations, but find that defusing the myth is prerequisite to the larger discussion, and will fill up a post by itself.

To defuse the myth, I'll look at how continuations are handled in the vau-control-calculus.  Explaining that calculus involves explaining the unconventional way vau-calculi handle variables.  So, tracing back through the tangle of ideas to find a starting point, I'll begin with some remarks on the use of variables in term-rewriting calculi.

While I'm extracting this high-level insight from the lower-level math of the situation, I'll also defuse a second common misapprehension about continuations, that they are essentially function-like.  This is a subtle point:  continuations are invoked as if they were functions, and traditionally appear in the form of first-class functions, but their control-flow behavior is orthogonal to function application.  This point is (as I've just demonstrated) difficult even to articulate without appealing to lower-level math; but it arises from the same lower-level math as the point about whole-computation, so I'll extract it here with essentially no additional effort.

Contents
Partial-evaluation variables
Continuations by global rewrite
Control variables
Insights
Partial-evaluation variables

From the lasting evidence, Alonzo Church had a thing about variables.  Not as much of a thing as Haskell Curry, who developed a combinatorial calculus with no variables at all; but Church did feel, apparently, that a meaningful logical proposition should not have unbound variables in it.  He had an elegant insight into how this could be accomplished:  have a single binding construct —which for some reason he called λ— for the variable parameter in a function definition, and then —I quite enjoyed this— you don't need additional binding constructs for the existential and universal quantifiers, because you can simply make them higher-order functions and leave the binding to their arguments.  For his quantifiers Π and Σ, Π(F,G) meant for all values v such that F(v) is true, G(v) is true; and Σ(F) meant there exists some value v such that F(v) is true.  The full elegance of this was lost because only the computational subset of his logic survived, under the name λ-calculus, so the quantifiers fell by the wayside; but the habit of a single binding construct has remained.

In computation, though, I suggest that the useful purpose of λ-bound variables is partial evaluation.  This notion dawned on me when working out the details of elementary vau-calculus.  Although I've blogged about elementary vau-calculus in an earlier post, there I was looking at a different issue (explicit evaluation), and took variables for granted.  Suppose, though, that one were centrally concerned only with capturing the operational semantics of Lisp (with fexprs) in a term-rewriting calculus at all, rather than capturing it in a calculus that looks as similar as possible to λ-calculus.  One might end up with something like this:

T   ::=   S | s | (. T) | [wrap T] | A
A   ::=   [eval T T] | [combine T T T]
S   ::=   d | () | e | [operative T T T T]
e   ::=   ⟪ B* ⟫
B   ::=   s ← T
Most of this is the same as in my earlier post (explicit evaluation), but there are three differences:  the internal structure of environments (e) is described; operatives have a different structure, which is fully described; and there are no variables.

Wait.  No variables?

Here, a term (T) is either a self-evaluating term (S), a symbol (s), a pair, an applicative ([wrap T], where T is the underlying combiner), or an active term (A).  An active term is the only kind of term that can be the left-hand side of a rewrite rule: it is either a plan to evaluate something in an environment, or a plan to invoke a combiner with some operands in an environment.  A self-evaluating term is either an atomic datum such as a number (d), or nil, or an environment (e), or an operative — where an operative consists of a parameter tree, an environment parameter, a body, and a static environment.  An environment is a delimited list of bindings (B), and a binding associates a symbol (s) with an assigned value (T).

The rewrite rules with eval on the left-hand side are essentially just the top-level logic of a Lisp evaluator:

[eval S e]   →   S
[eval s e]   →   lookup(s,e)     if lookup(s,e) is defined
[eval (T1 . T2) e]   →   [combine [eval T1 e] T2 e]
[eval [wrap T] e]   →   [wrap [eval T e]]
That leaves two rules with combine on the left-hand side:  one for combining an applicative, and one for combining an operative.  Combining applicatives is easy:
[combine [wrap T0(T1 ... Tn) e]   →   [combine T0 ([eval T1 e] ... [eval Tn e]) e]
Combining operatives is a bit more complicated.  (It will help if you're familiar with how Kernel's `\$vau`  works; see here.)  The combination is rewritten as an evaluation of the body of the operative (its third element) in a local environment.  The local environment starts as the static environment of the operative (its fourth element); then the ordinary parameters of the operative (its first element) are locally bound to the operands of the combination; and the environment parameter of the operative (its second element) is locally bound to the dynamic environment of the combination.
[combine [operative T1 T2 T3 T4] V e]   →   [eval  T3  match(T1,V) · match(T2,e) · T4]     if the latter is defined
where match(X,Y) constructs an environment binding the symbols in definiend X to the corresponding subterms in Y;  echild · eparent  concatenates two environments, producing an environment that tries to look up symbols in echild, and failing that looks for them in eparent;  and a value (V) is a term such that every active subterm is inside a self-evaluating subterm.

Sure enough, there are no variables here.  This calculus behaves correctly.  However, it has a weak equational theory.  Consider evaluating the following two expressions in a standard environment e0.

[eval  `(\$lambda (x) (+ 0 x))`  e0]
[eval  `(\$lambda (x) (* 1 x))`  e0]
Clearly, these two expressions are equivalent; we can see that they are interchangeable.  They both construct an applicative that takes one numerical argument and returns it unchanged.  However, the rewriting rules of the calculus can't tell us this.  These terms reduce to
[wrap [operative  `(x)`  `#ignore`  `(+ 0 x)`  e0]]
[wrap [operative  `(x)`  `#ignore`  `(* 1 x)`  e0]]
and both of these terms are irreducible!  Whenever we call either of these combiners, its body is evaluated in a local environment that's almost like e0; but within the calculus, we can't even talk about what will happen when the body is evaluated.  To do so we would have to construct an active evaluation term for the body; to build the active term we'd need to build a term for the local environment of the call; and to build a term for that local environment, we'd need to bind `x` to some sort of placeholder, meaning "some term, but we don't know what it is yet".

A variable is just the sort of placeholder we're looking for.  So let's add some syntax.  First, a primitive domain of variables.  We call this domain xp, where the "p" stands for "partial evaluation", since that's what we want these variables for (and because, it turns out, we're going to want other variables that are for other purposes).  We can't put this primitive domain under nonterminal S because, when we find out later what a variable stands for, what it stands for might not be self-evaluating; nor under nonterminal A because what it stands for might not be active.  So xp has to go directly under nonterminal T.

T   ::=   xp
We also need a binding construct for these variables.  It's best to use elementary devices in the calculus, to give lots of opportunities for provable equivalences, rather than big monolithic devices that we'd then be hard-put to analyze.  So we'll use a traditional one-variable construct, and expect to introduce other devices to parse the compound definiends that were handled, in the variable-less calculus, by function match.
S   ::=   ⟨λ xp.T⟩
governed by, essentially, the usual β-rule of λ-calculus:
[combine ⟨λ xp.T⟩ V e]   →   T[xp ← V]
That is, combine a λ-expression by substituting its operand (V) for its parameter (xp) in its body (T).  Having decided to bind our variables xp one at a time, we use three additional operative structures to deliver the various parts of the combination one at a time (a somewhat souped-up version of currying):  one structure for processing a null list of operands, one for splitting a dotted-pair operand into its two halves, and one for capturing the dynamic environment of the combination.
S   ::=   ⟨λ0.T⟩ | ⟨λ2.T⟩ | ⟨ε.T⟩
The corresponding rewrite rules are
[combine ⟨λ0.T⟩ () e]   →   T
[combine ⟨λ2.T0⟩ (T1 . T2) e]   →   [combine [combine T0 T1 e] T2 e]
[combine ⟨ε.T0⟩ T1 e]   →   [combine [combine T0 e ⟪⟫] T1 e]

Unlike the variable-less calculus, where the combine rewrite rule initiated evaluation of the body of an operative, here evaluation of the body must be built into the body when the operative is constructed.  This would be handled by the δ-rules (specialized operative-call rewrite rules) for evaluating function definitions.  For example, for variables x,y and standard environment e0,

[eval `(\$lambda (x) (+ 0 x))` e0]   →+   [wrap ⟨ε.⟨λy.⟨λ2.⟨λx.⟨λ0.[eval  `(+ 0 x)`  ⟪`x` ← x⟫ · e0]⟩⟩⟩⟩⟩]
Variable y is a dummy-variable used to discard the dynamic environment of the call, which is not used by ordinary functions.  Variable x is our placeholder, in the constructed term to evaluate the body, for the unknown operand to be provided later.

The innermost redex (reducible expression) here, [eval  `(+ 0 x)`  ⟪`x` ← x⟫ · e0], can be rewritten through a series of steps,

[eval  `(+ 0 x)`  ⟪`x` ← x⟫ · e0]
→   [combine  [eval  `+`  ⟪`x` ← x⟫ · e0]  `(0 x)`  ⟪`x` ← x⟫ · e0]
→   [combine  [wrap +]  `(0 x)`  ⟪`x` ← x⟫ · e0]
→+ [combine  +  `(`0 x`)`  ⟪`x` ← x⟫ · e0]
Where we can go from here depends on additional information of one or another kind.  We may have a rule that tells us the addition operative + doesn't use its dynamic environment, so that we can garbage-collect the environment,
→   [combine  +  `(`0 x`)`  ⟪⟫]
If we have some contextual information that the value of x will be numeric, and a rule that zero plus any number is that number back again, we'd have
→   x
At any rate, we only have the opportunity to even start the partial evaluation of the body, and contemplate these possible further steps, because the introduction of variables allowed us to write a term for the partial evaluation in the first place.

[edit:  I'd goofed, in this post, on the combination rule for λ0; it does not of course induce evaluation of T.  Fixed now.]
Continuations by global rewrite

The idea of using λ-calculus to model programming language semantics goes back at least to Peter Landin in the early 1960s, but there are a variety of programming language features that don't fit well with λ-calculus.  In 1975, Gordon Plotkin proposed a remedy for one of these features — eager argument evaluation, whereas ordinary λ-calculus allows lazy argument evaluation and thereby has different termination properties.  Plotkin designed a variant calculus, the λv-calculus, and proved that on one hand λv-calculus correctly models the semantics of a programming language with eager argument evaluation, while on the other hand it is comparably well-behaved to traditional λ-calculus.  Particularly, the calculus rewriting relation is compatible and Church-Rosser, and satisfies soundness and completeness theorems relative to the intended operational semantics.  (I covered those properties and theorems a bit more in an earlier post.)

In the late 1980s, Matthias Felleisen showed that a technique similar to Plotkin's could be applied to other, more unruly kinds of programming-language behavior traditionally described as "side-effects":  sequential control (continuations), and sequential state (mutable variables).  This bold plan didn't quite work, in that he had to slightly weaken the well-behavedness properties of the calculus.  In both cases (control and state), the problem is to distribute the consequences of a side-effect to everywhere it needs to be known; and Felleisen did this by having special constructs that would "bubble up" through the term, carrying the side-effect with them, until they encompassed the whole term, at which point there would be a whole-term rewriting rule to distribute the side-effect to everywhere it needed to go.  The whole-term rewriting rules were the measure by which the well-behavedness of the calculus would fail, as whole-term rewriting isn't compatible.

For sequential control (our central interest here), Felleisen added two operators, C and A, to λv-calculus.  The syntax of λv-calculus, before the addition, is just that of λ-calculus:

T   ::=   x | (λx.T) | (TT)
In place of the classic β-rule of λ-calculus, λv-calculus has βv, which differs in that the operand in the rule is a value (redexes have to be inside λ-terms):
((λx.T)V)   →   T[x ← V]
The operational semantics, which acts only on whole terms, uses (per Felleisen) an evaluation context E to uniquely determine which subterm is reduced:
E   ::=   ⎕ | (ET) | ((λx.T)E)
E[((λx.T)V)]   ↦   E[T[x ← V]]
For the control calculus, the term syntax adds A and C,
T   ::=   x | (λx.T) | (TT) | (AT) | (CT)
Neither of these operators has the semantics of call-with-current-continuation.  Instead, (AT) means "abort the surrounding computation and just do T", while (CT) means "abort the surrounding computation and apply T to the (aborted) continuation".  Although it's possible to build conventional call-with-current-continuation out of these primitive operators, the primitives themselves are obviously intrinsically whole-term operators.  Operationally, evaluation contexts don't change at all, and the operational semantics has additional rules
E[(AT)]   ↦   T
E[(CT)]   ↦   T(λx.(AE[x]))    for unused variable x
The compatible rewrite relation, →, has rules to move the new operators upward through a term until they reach its top level.  The compatible rules for A are dead easy:
(AT1)T2   →   AT1
V(AT)   →   AT
Evidently, though, once the A operator reaches the top of the term, the only way to get rid of it, so that computation can proceed, is a whole-term rewrite rule,
AT   ᐅ   T
The whole-term rule for C is easy too,
CT   ᐅ   T(λx.(Ax))
but the compatible rewrite rules for C are, truthfully, just a bit frightening:
(CT1)T2   →   C(λx1.(T1(λx2.(A(x1(x2T2))))))    for unused xk
V(CT)   →   C(λx1.(T(λx2.(A(x1(Vx2))))))    for unused xk
This does produce the right behavior (unless I've written it out wrong!), but it powerfully perturbs the term structure; Felleisen's description of this as "bubbling up" is apt.  Imho, it's quite a marvelous achievement, especially given the prior expectation that nothing of the kind would be possible — an achievement in no way lessened by what can now be done with a great deal of hindsight.

The perturbation effect appears to me, in retrospect, to be a consequence of describing the control flow of continuations using function-application structure.  My own approach makes no attempt to imitate function-application and, seemingly as a result, its constructs move upward without the dramatic perturbation of Felleisen's C.

Various constraints can be tampered with to produce more well-behaved results.  Felleisen later proposed to adjust the target behavior — the operational semantics — to facilitate well-behavedness, in work considered formative for the later notion of delimited continuations.  The constraint I've tampered with isn't a formal condition, but rather a self-imposed limitation on what sort of answers can be considered:  I introduce a new binding construct whose form doesn't resemble λ, and whose rewriting rules use a different substitution function than the β-rule.

Control variables

Consider the following Scheme expression:

(call/cc (lambda (c) (c 3)))
Presuming this is evaluated in an environment with the expected binding for call/cc, we can easily see it is operationally equivalent to 3.  Moreover, our reasoning to deduce this is evidently local to the expression; so why should our formalism have to rewrite the whole surrounding term (perturbing it in the process) in order to deduce this?

Suppose, instead of Felleisen's strategy of bubbling-up a side-effect to the top level of a term and then distributing it from there, we were to bubble-up (or, at least, migrate up) a side-effect to some sort of variable-binding construct, and then distribute it from there by some sort of substitution function to all free occurrences of the variable within the binding scope.  The only problem, then, would be what happens if the side-effect has to be distributed more widely than the given scope — such as if a first-class continuation gets carried out of the subterm in which it was originally bound — and that can be solved by allowing the binding construct itself to migrate upward in the term, expanding its scope as much as necessary to encompass all instances of the continuation.

I did this originally in vau-calculus, of course, but for comparison with Felleisen's A/C, let's use λv-calculus instead.  Introduce a second domain of control variables, xc, disjoint from xp, and "catch" and "throw" term structures (κx.T) and (τx.T).

T   ::=   xp | (TT) | (λxp.T) | (κxc.T) | (τxc.T)
Partial-evaluation variables are bound by λ, control variables are bound by κ (catch).  Control variables aren't terms; they can only occur free in τ-expressions, where they identify the destination continuation for the throw.  κ and τ are evaluation contexts; that is,
E   ::=   ... | (κx.E) | (τx.E)

The rewrite rules for τ are pretty much the same as for Felleisen's A, except that there is now a compatible rewrite rule for what to do once the throw reaches its matching catch, rather than a whole-term rewrite for eliminating the A once it reaches the top of the term.

(τx.T1)T2   →   τx.T1
V(τx.T)   →   τx.T
κx.(τx.T)   →   κx.T
What about rewrite rules for κ?  One simple rule we need, in order to relate expressions with κ to expression without, is "garbage collection":
κx.T   →   T    if x does not occur free in T
We also want rules for κ to migrate upward —non-destructively— when it occurs in an evaluation context; but κ may be the target of matching τ expressions, and if we move the κ without informing a matching τ, that τ will no longer do what it was meant to.  Consider a κ, poised to move upward, with a matching τ somewhere in its body (embedded in some context C that doesn't capture the control variable).
V(κx.C[τx.T])
If C happens to be an evaluation context, then it is possible for the τ to move upward to meet the κ and disappear; and, supposing x doesn't occur free in T, we'd have (VT).  Even if C isn't an evaluation context, (τx.T) thus represents the potential to form (VT).  If we move the κ over the V, then, in order for the τ to still represent the same potential it did before, we'd have to change it to (τx.(VT)).  And this has to happen for every matching τ.  So let's fashion a substitution function for control variables, T[x ← C] where C doesn't capture any variables:
y[x ← C]   →   y
(T1T2)[x ← C]   →   ((T1[x ← C])(T2[x ← C]))
(λy.T)[x ← C]   →   (λy.(T[x ← C]))    where y isn't free in C
(κy.T)[x ← C]   →   (κy.(T[x ← C]))    where y isn't x or free in C
(τy.T)[x ← C]   →   (τy.(T[x ← C]))    if y isn't x
(τx.T)[x ← C]   →   (τx.C[T[x ← C]])
The "where" conditions are met by α-renaming as needed.  Now we're ready to write our rewrite rules for moving κ upward:
(κx.T1)T2   →   κx.(T1[x ← ⎕T2] T2)    where x isn't free in T2
V(κx.T)   →   κx.(V(T[x ← V⎕]))    where x isn't free in V
κy.(κx.T)   →   κy.(T[x ← y])
As advertised, κ moves upward without perturbing the term structure (contrast with the bubbling-up rules for C).  If we need a first-class continuation, we can simply wrap τ in λ:  (λy.(τx.y)).  The δ-rule for call/cc would be
(call/cc T)   →   κx.(T(λy.(τx.y)))    for unused x
If this occurs in some larger context, and the first-class continuation escapes into that larger context, then the matching κ will have had to move outward before it over some evaluation context E, and substitutions will have transformed the continuation to (λy.(τx.E[y])).

Insights

The orthogonality of continuation control-flow to function application is, in the lower-level math, rather explicitly demonstrated by the way κ moves smoothly upward through the term, in contrast to the perturbations of the bubbling-up rules for C as it forcibly interacts with function-application structure.  The encapsulation of a τ within a λ to form a first-class continuation seals the deal.

The notion that continuations are a whole-term phenomenon —or, indeed, that any side-effect is a whole-term phenomenon— breaks down under the use of floating binding-constructs such as κ, which doesn't require the side-effect to remain encapsulated within a particular subterm, but does allow it to do so and thus allows local reasoning about it to whatever extent its actual behavior remains local.  Whether or not that makes traditional continuations "undelimited" is a question of word usage:  the κ binding-construct is a delimiter, but a movable one.

As a matter of tangential interest here, the vau-calculus handling of sequential state involves two new kinds of variables and four new syntactic constructs (two of which are binders, one for each of the new kinds of variables).  Here's a sketch:  Mutable data is contained in symbol-value assignments, which in turn are attached to environments; the identity of an environment is a variable, and its binding construct defines the region of the term over which that environment may be accessible.  Assignments are a separate, non-binding syntactic construct, which floats upward toward its matching environment-binding.  When a symbol is evaluated in an environment, a pair of syntax elements are created at the point of evaluation:  a query construct to seek an assignment for the given symbol in the given environment, which binds a query-variable, and within it a matching result construct with a free occurrence of the query-variable.  The result construct is an indivisible term.  The query is a compound, which migrates upward through the term looking for an assignment for the symbol in the environment.  When the query encounters a matching assignment, the query is annihilated (rather as a τ meeting its matching κ), while performing a substitution that replaces all matching result-constructs with the assigned value (there may by this time be any number of matching result-constructs, since the result of the original lookup may have been passed about in anticipation of eventually finding out what its value is).

As a final, bemusing note, there's a curious analogy (which I footnoted in my dissertation) between variables in the full side-effectful vau-calculus, and fundamental forces in physics.  The four forces of nature traditionally are gravity, electromagnetism, strong nuclear force, and weak nuclear force; one of these —gravity— is quite different from the others, with a peculiar sort of uniformity that the others lack (gravity is only attractive).  Whilst in vau-calculus we have four kinds of variables (partial-evaluation, control, environment, and query), of which one —partial-evaluation— is quite different from the others, with a peculiar sort of uniformity that the others lack (each of the other kinds of variable has an α-renaming substitution to maintain hygiene, and one or more separate kinds of substitution to aid its purpose in rewriting; but partial-evaluation variables have only one kind of substitution, of which α-renaming is a special case).

[Note: I later explored the physics analogy in a separate post, here.]

1. From an implementation point of view, I think there is a description of your construction that is quite clear. The original presentation of C captures the whole context of the computation (and pass it to its argument T under a functional disguise). You remark that in practice, the point of continuation capture and the point of continuation use will share a (possibly large) prefix of this global context, which needs not be explicitly part of the communicated context. Instead, kx. moves upward as necessary to delimit the boundary of this shared context, and tx. only reinstates the context "upto kx." (and then continue with the above prefix).

(This is reminiscent of calculi for call-by-need, which model thunk sharing by floating let-bindings that need not move all the way to the top, but may remain local if the sharing remained local to a part of the dynamic program.)

It is not clear, however, how to turn that implementation-level intuition into a local way to *think*, as a user, about control-using programs. It seems that even with that presentation, I personally tend to think of tx. as "capturing the whole context". Somehow it seems natural to *not* reason about the details of the value flow (how far the various tx. will move), and defensively assume they move up to the top. Is this abstraction (over-approximating upward-movement) useful for modular reasoning?

It would also be useful/interesting to make precise the changes needed to add delimited control to this framework. Intuitively it seems the delimiting prompt would block kx. moving upward.

1. Interesting that your intuition models tx as global replacement. My own intuition for continuation-calling is visible in my earlier blog post on guarded continuations, where each dynamic extent can have exit/entry guards on it, which are only triggered when control actually passes out of/into that dynamic extent. Guarded continuations seem to be even /more/ localizing than the kx/tx arrangement described here; I haven't tried, yet, to model guarded continuations using a term-rewriting calculus. As for intuitions, I suspect my intuition of continuation-calling was always local like that, and led to the idea of guarded continuations, rather than guarded continuations leading to a more localizing intuition.

2. On consideration, my intuition for continuations may have been localized by studying Scheme's dynamic-wind.

2. Thanks for writing this, it helped deepen my understanding. I'm very much looking forward to your post contrasting delimited continuations.

Unrelated: What's the best way to reach you if I've got a question about the design of Kernel? I'd just email the address on the top of the Kernel Report, but others may benefit from a public discussion.

1. Well, don't use the address on the top of the Kernel Report. That's my old wpi cs department account; it might be six months before I read it, as I /mean/ to read that account regularly but somehow times gets away from me. For private email to me, replace the "cs" in that with "alum". But, for public discussion. Hm. The only public venues I've been involved with lately (in my incarnation as a PL person) are, well, here of course, and LtU.

2. I'll just ask my question here, but color it in the light of this post to justify it.

You wrote "continuations are invoked as if they were functions, and traditionally appear in the form of first-class functions, but their control-flow behavior is orthogonal to function application", but then I thought to myself: In Kernel, operatives are invoked as if they were functions, but their control-flow behavior is orthogonal to function application. Just as moving binders is nicer than inside-out, bubbling lambda forms, I wonder if there is another way to look at operatives that would achieve similar improvements.

I've studied the Kernel report and done my best to follow the formalisms of the vau calculus. I'm totally sold on explicit evaluation and first-class environments, but I'm still not sure about first-class operatives. In particular, I can't think of any good uses for higher-order operatives. Recently, I was working on a small interpreter in the style of Cartwright and Felleisen's EDLS (well, more similar to Kiselyov's version). I realized that in the many cases where explicit evaluation (and related techniques, like expansion passing) work well, I often want a first-class construct that's a little different than operatives. It occurs to me that an operative is something like a partial function defining a method of an open eval procedure, dispatching on the head of a list. Additionally, when tinkering with rewrite systems, such as Mathematica, I've found lots of uses for first-class rules and rule sets, which also feel related somewhere deep with operatives.

Have you considered this perspective? If so, I'd be curious to hear your thoughts.

3. A preface to my reply on this: There's nothing quite so exciting as seeing people use your work in ways that you wouldn't have used it yourself. An idea that only gets used the way its proposer intended is ultimately an unsuccessful idea, sterile. Look at the myriad variations on Lisp; that's a successful idea. I'm a huge fan of finding new ways to look at things.

> You wrote "continuations [...] control-flow behavior is orthogonal to function application", but then I thought to myself: In Kernel, operatives are invoked as if they were functions, but their control-flow behavior is orthogonal to function application.

Fwiw, in saying that, my benchmark for function application was its form in lambda-calculus. I have a specific technical reason for associating operatives with λ. As I observed back when, an elementary vau-calculus has three parts. One part represents source data structures, and does no computation. One part does only computation. And one part is the machinery of evaluation, which links the other two together. The computation-only part is essentially lambda-calculus (a point elaborated later). What I seem never to have really hammered home here is, the λ-abstractions in the computation part of the calculus are operatives. In effect, lambda-calculus is a calculus entirely about operatives.

> In particular, I can't think of any good uses for higher-order operatives.

If you think you're on to an interesting appraoch, pursue it. That said, I'm not concerned with a dearth of higher-order operatives (though \$make-tag-predicate in my dissertation, a basic technique for avoiding implicit evaluation, comes close). I see the Smoothness Conjecture (dissertation section1.1.2) as predicting that one can't foresee specific particular benefits of smoothness.

> Recently, I was working on a small interpreter [...]. I realized that in the many cases where explicit evaluation (and related techniques, like expansion passing) work well, I often want a first-class construct that's a little different than operatives.

Writing interpreters can lead to all sorts of insights. I came up with Kernel-style operatives while writing a Scheme interpreter in C (or was it C++).

> It occurs to me that an operative is something like a partial function defining a method of an open eval procedure, dispatching on the head of a list. Additionally, when tinkering with rewrite systems, such as Mathematica, I've found lots of uses for first-class rules and rule sets, which also feel related somewhere deep with operatives.

I've considered a Lisp using decentralized dispatch, where you have a list of arguments (and maybe some selective keywords), and rather than calling the first element, you allow any of the elements in the list to provide a method for handling it. This tends to lead toward a complex type system, which I mistrust (I've blogged about that, of curse); and I, at least, didn't see how to do it without requiring that all the elements of the list be evaluated before dispatch.

I sense, though, you're glimpsing something else. Mathematica-like things are another world, clearly related, and of interest to me though I've yet to integrate them with the other things I've been looking at. It sounds to me like a potentially interesting path to explore.

3. I wish I understood vau-calulus better; unfortunately I still don't really have a handle on continuations. They're filed mentally under 'programming techniques I really don't want to have to use'.

I'm intrigued though by your mention of Church's pi and sigma forms. Are these functions? If so, from what to what? How do they express logical assertions? Do they return something like a function that expresses a set? Or am I thinking along the wrong lines?

I've been thinking for a while about the idea of doing to Prolog what Joy does to Lisp (and SKI calculus does to lambda calculus): creating a logic programming language without a binding construct at the core of it. It seems to me that Joy and Factor show that this can be done, to some extent - obviously binding names to commonly used functions and sharing these name/value bindings over the Internet as packages of data still is a very useful thing for humans to do - but I'm wondering now how Church approached this. I guess he wasn't thinking at all in terms of programming languages, but trying to resolve paradoxes in first-order logic. So does that mean that some of the other parts of his calculus might have useful applications today?

I'm also wondering to what extent you've explored SKI calculus and modern spinoffs from it (such as Joy and Factor as I mentioned, but also Mencius Moldbug's Urbit, or rather its core calculus Nock). It's probably a really heavy hammer to use, but wouldn't macro programming in a language without any name bindings at all completely remove any problems to do with unwanted name capture?

Also something about the idea of environments as a first-class value makes me think is really important and should be explored further. In object-oriented programing environments pretty much have become _the_ only type of value, and this seems in some ways a natural way to think about Internet-scale deeply nested data structures: if everything in our entire universe of data is really just a name/value binding, then our fundamental calculus ought to be really good at operating on name/value bindings. Yet the lambda calculus seems to treat name/value bindings very coyly, hiding them at a different layer from the 'real' computation of functions, and only allowing them to be aggregated and shadowed, not (for instance) allowing them to be taken apart or defining set operations like intersection and union on them. This seems like a fairly fundamental part of our computatinonal thinking to be missing, to me. (Object oriented programming seems also to be missing large chunks, but from the other end; it doesn't even have a fundamental calculus of any kind, just a lot of incompatible vague principles which every implementation understands differently).

1. About the Sigma and Pi operators.  They weren't functions.  That rewriting system wasn't originally meant as a computational device;  it was a set of rules for writing propositions and reasoning about them.  I mentioned this some in my post last year on "Bypassing no-go theorems".  At about the turn of the century they'd discovered there was some major problem with the foundations of mathematics, that appeared in the form of Russell's Paradox and other antinomies. The proof of Russell's Paradox (let A be the set of sets that do not contain themselves; then provably A both does and does not contain itself) uses two logical axioms: the Law of the Excluded Middle (if a proposition isn't false, then it's true), and reductio ad absurdum (if a proposition implies a contradiction, then the proposition is false).  Most people figured the Law of the Excluded Middle was wrong after all, and tried to do without it, scrambling to rescue cherished theorems whose proofs used it.  Church had the idea to instead weaken reductio ad absurdem.  While he was building a whole propositional calculus from the ground up anyway, he also took the opportunity to try out lambda and Sigma and Pi.

It's worth keeping in mind that his logic did have paradoxes in it, which were found within a few years (by this time they'd gotten pretty good at finding paradoxes).  In fact, maybe a decade or so later, Haskell Curry came up with a much simpler paradox that uses mostly just the computational part of the calculus, with only a little bit of logic mixed in, not requiring Sigma, or Pi, or almost any of the high-powered logical gadgetry Church had had in his logic.  I explain Curry's Paradox in detail in the earlier post.

I've been working with systems that involve variables, rather than the ones (like SKI) that don't.  My main focus recently, after all, has been fexprs — and the idea of variables, whose bindings (meanings) are determined by the environment they're evaluated in, is pretty strongly entangled with the idea of fexprs, which give the programmer control over when evaluation takes place (and, therefore, when and how variables get their meanings).  I haven't really thought about how, and whether, the fexpr concept might usefully apply to a system without variables.  Maybe there's a way, maybe there isn't, and either result would be interesting (but one suspects the first would be much more interesting :-).

First-class environments are, of course, a separate area of investigation from SKI and such, because environments are the very essence of variables.  Environments aren't the usual technique used in Scheme-like languages to do OOP, partly because most such languages don't have first-class environments, but also because environments are less encapsulated than combiners.  You pass your message to the combiner, and it dispatches based on the message received, using its own local environment for its persistent object-state. Using a first-class environment directly as an object (rather than encapsulated within a dispatch combiner) would produce a less encapsulated object, in which any part of the object could be freely rewritten by anyone — even rewriting its methods.  It's a matter of what sort of system you're trying for.  Myself, I'm leery of radically unencapsulated systems like that; I already work with fexprs, after all, and you need to be careful when playing with fire like that.  Reminds me of a quote from Edward Tenner: "If you have a radical idea [...] for God's sake don't be a radical in how you carry it out. [...] Become a right-wing conservative in carrying out a left-wing idea."

2. Right, I guess I'm just taking the opposite side of the problem from the mathematical history, and wondering if there's complexity here that is more a matter of legacy representations and not something inherent to the subject itself.

It seems to me that 'symbols' and 'functions' (with 'variable' being a particular type of symbol as opposed to 'constant'), emerged historically as very different mathematical concepts: one (functions) being the object under study, and the other (symbols/variables) being a syntactic invention of the mathematician. So one was an object, and the other a meta-object, and it made sense to keep them separate. And 'environments' being a feature of the interpretation of symbols (which only existed in the mathematician's head, not in the mathematical object itself), they were not generally studied in as much rigor as functions, and when they were, they were studied with different tools.

However: from the post-Lisp perspective of computer science, which is quite comfortable with recursion at all levels, the classic paradoxes of logic don't seem to really get in the way of computation. And it looks very much as if 'symbols' as such don't exist, but only 'environments' do, and 'environments' are just functions from symbol names to symbol values. .

So: if I were building the simplest possible calculus about the binding of symbol names, I would naively start trying to build it out of:
* numbers
* string atoms (to be used as names)
* pairs (to be used to create blocking/quoting constructs)
* SKI combinators, to create anonymous, unambiguous functions from which I could construct an 'environment' function
* some extra combinators for operating on pairs - CONS, CAR, CDR or the equivalent
* an initial environment (function) which defines name-to-blocks-of-SKI-combinators

This would give us a system rather like Joy, where not only environments are just a particular type of function, but functions themselves are not quite the opaque black box that they are in lambda calculus; we can put functions together and we can also pull them apart (which as far as I understand, lambda doesn't allow).

A system like this would be very much like Kernel or any other language with fexprs, in the sense that 'code' would be just a block containing text names which (when passed to an environment function to evaluate it) would get resolved into actual SKI combinators. But I'm not sure that it would require such a large amount of mathematical infrastructure in the sense of defining different types of variables.

There wouldn't be a need for a difference between 'applicatives' and 'operatives', for instance. There would only be functions, some of which might operate on blocks containing text names, which might relate to entries in an environment, or might not. While I'm not sure what laws would apply, it seems like we should be able to infer more than nothing about the behaviour of these functions.

Does that seem like a sensible way to approach the variable-binding problem? It seems to me that several working, practical languages now have been built that demonstrate at least parts of this (Joy and Factor for instance; possibly Nock/Urbit, if it stabilises). Joy doesn't quite do what I'm thinking about, as its 'define' construct uses a different syntax from function evaluation, but its 'quotation' mechanism is what inspires me to think that reducing everything in a computational calculus to functions, names and block constructs is feasible.

Apologies if this is missing something really elementary; it just seems like an interesting relatively unexplored path to perhaps a simpler mathematical model.

3. I'm not sure what you have in mind. When you talk of taking functions apart, I wonder if you're envisioning something that lacks encapsulation, which I think is necessary for a realistic programming language (see my blog post on abstractive power). That would come much closer to the pathology Wand described, of a language with trivial equational theory. I note, in that regard, that you're talking about quotation, which is something one can do in Kernel but would be ill-advised to do, as in practice it strongly encourages hygiene violations.

I've wrestled myself with the question of what variables are, in their deep nature. Iirc I mentioned this once on LtU, sometime in the past few months. If I ever get my thoughts straight on it I might write a blog post, but I've been thinking about it for decades and haven't yet had a really satisfying insight... so something could come to me at any moment, but the half-life of the problem is rather long.

4. "When you talk of taking functions apart, I wonder if you're envisioning something that lacks encapsulation"

Pretty much, yes; I'm using the word "quotation" as Manfred von Thun defines it in his "The Mathematical Foundations of Joy", to mean what is essentially a list or sequence structure which can represent either programs or data and, unlike a lambda, can be decomposed and introspected by other programs at runtime.

http://web.archive.org/web/20111007025556/http://www.latrobe.edu.au/phimvt/joy/j02maf.html

This way of approaching abstraction certainly lacks encapsulation, but I don't understand why it's not a viable programming paradigm. Since 2002, Joy has spawned an entire ecosystem of "concatenative language" research, of which Factor seems to have built the most 'practical' language. Admittedly I've not tried programming in it, but they've built something approaching the complexity of Common Lisp including incremental runtime compilers, webservers and GUI frameworks. The focus of the Factor community especially seems to be on both functional and syntactic abstraction using quotations.

http://concatenative.org/wiki/view/Concatenative%20language

http://factorcode.org/

Factor itself borrows equally from Forth and Common Lisp for its syntax and isn't entirely postfix or pure-functional; another language which seems to be approaching this, however, is Om:

http://om-language.org/

Om is just a sketch at this stage (and not mine), but I'm intrigued by it for two reasons: one, treating all atoms as function names to be evaluated immediately and all lists as data seems to create a very minimal syntax that lends itself to implementing domain specific languages with maximal syntactic abstraction. Second, the prefix 'pipeline' style suggests to me that one could pipe input/output straight into the program, allowing pure functional code to do I/O in a way that's simple and natural but very hard to do in mainstream functional languages like Haskell.

Then there's Urbit, which I think is concatenative at its lowest level (Nock), but is evolving in very strange directions that aren't really to my taste.

http://urbit.org/

I'd like to understand more about the idea of trivial equational theory and what it implies about a language. It may well be the case that Joy and similar languages have a trivial theory, yet it seems that they are still capable of doing real work. Is this a case of 'in theory it can't work, but in practice it does'? Or does the lack of a well-behaved equational theory mean that there unavoidable security or efficiency pitfalls looming ahead for these languages?

4. As a worked example of my really dumb/naive approach (which may well be actually dumb), here's how I would implement fexprs in a concatenative language with a prefix syntax like Om. First, I would reserve a function symbol - let's say '\$' - to return 'the current environment' as a data structure. I would define the 'define' function in such a way that \$ was a reserved word and could not be redefined. (The Om define function is basically just an 'eval' - the first argument is a list representing definitions to be added to the current environment, the second is an expression to be evaluated in the new environment. I think Define is defined in such a way that its definition changes to include the new environment whenever a new environment is defined). Then:

define [my-applicative [+ 1]] [
my-applicative 42 // [this returns 43]
]

define [my-operative [modify-environment]] [
my-operative \$ // [this returns a new modified environment as a data structure]
]

So, an operative in this scheme would simply be an ordinary function that happens to receive as its first (or any) parameter a copy of the runtime environment as generated by the function '\$'. There would be no special calling convention for operatives; you would always have to explicitly pass an environment to them. This would always be indicated syntactically as the '\$' being present as a parameter (much like how Kernel operatives have a \$, but this would be enforced more rigidly).

The only thing somewhat unusual about '\$' is that it can't ever be redefined by the programmer, only by the 'define' function itself. Also, perhaps, it breaks introspection by allowing the program at any time to evaluate its own environment as data - but since a program can only ever do this by accessing this special function name, we should be able to statically analyse any function or program fragment at runtime to tell whether or not it does this, and therefore limit the effects of such an operation.

Is there something fundamentally unworkable or unpredictable about such a scheme? It does take a 'quote' or 'literal list' mechanism as fundamental; but it seems to me that such a thing _is_ pretty fundamental, at least as fundamental as any kind of data literal capability in a language (and I don't know how I would write anything without literals of some kind).

1. Commenting to myself after a few days reflection: yes, there are pitfalls with this naive language; I've been assuming that a 'define' syntax block would act like a closure, but Joy doesn't use closures, so environment accesses would 'leak between definition regimes. Messy.

I'm guessing Kernel's applicatives keep things manageable by strongly restricting what operations can be performed on environments.

2. Yeah, stability of Kernel environments is a sort of conspiracy between different features. You can capture the current environment. Given an environment you can modify it. But given an environment you can't (necessarily) modify its ancestors: any changes you make to it will be locally bound, and there's no general device for acquiring the environment's ancestor(s). So it's possible for an environment to be visible yet provably stable, if it's provably not modified by any of the combiners that have access to it. It's therefore important to know that particular combiners don't access their dynamic environment — which is why the use of #ignore isn't just cosmetic, it's crucial because its use in defining an operative can guarantee the defined combiner will be unable to access its dynamic environment. And, although it's possible to construct an applicative that accesses its dynamic environment, you wouldn't do it by accident, because the easiest way to construct applicatives is \$lambda, which constructs applicatives that #ignore their dynamic environment.

5. How do you visualize dynamic extents? I can think of continuations being co-values; but that is merely static, whereas dynamic extents are about as static as the maelstrom--or so they seem from my benighted vantage. When you visualize them...what exactly are you visualizing?

1. I picture two hierarchies; a static hierarchy of environments, and a dynamic hierarchy of continuations. (This is touched on in my blog post on guarded continuations.)

In the static hierarchy, when a compound function is called its local environment is a newly created child of its static environment; there are other ways to construct an environment in Kernel, but this primary local-child-of-static construction grows trees of environments downward. The ancestors of an environment are fixed from its creation; the static notion of scope is essentially the region of code "under" an environment.

In the dynamic hierarchy, when a function is called its return continuation is a newly created child of the return continuation for the computation that made the new call; there are other ways to construct a continuation in Kernel, but this primary means of construction grows trees of continuations downward. The ancestors of a continuation are fixed from its creation; the dynamic notion of scope is essentially the computation "under" a continuation.

Btw: When a continuation is "called", control may (in general) pass out of the dynamic extents of some continuations (out from under them, in the hierarchy), and into the dynamic extents of other continuations (under them in the hierarchy). Guarded continuations regulate control flow passing into and out of these dynamic extents. There are situations where, seemingly, one might want to place similar controls on access to environments, as a matter of hygiene, and I've wondered whether one could somehow arrange guards on the static extents of environments, by symmetry with the guards on the dynamic extents of environments.

2. Hierarchies: tree-shaped data structures with root at top? Assuming...

Dynamic hierarchies...but this primary means of construction grows trees of continuations downward...the dynamic notion of scope is essentially the computation "under" a continuation.

Guarded continuations regulate control flow passing into and out of these dynamic extents...

Whew. You know that glossaries are the Medusas of amateurs like me?

3. Sorry. It'd be easier to explain if I could use a whiteboard. :-S

6. Hi: have you made the article above (Quaternions: A Case Study in the Selection of Tools for Mathematical Physics) available as a pdf? Let me know... thanks!