Structural insight: May 2015

Sunday, May 17, 2015

Numberless differentiation

The method described in I.3 resulted in an analogy between the "discrete" space of index values Z=(1,2,...) and the continuous state space Ω of the mechanical system[...]. That this cannot be achieved without some violence to the formalism and to mathematics is not surprising.
— John von Neumann, Mathematical Foundations of Quantum Mechanics, trans. Robert T. Beyer, 1955, §I.4.

In this post, I mean to explore a notion that arose somewhat peripherally in an earlier post. I noted that the abstractive power of a programming language is, in a sense, the second derivative of its semantics. I've also noted, in another post, a curious analogy between term rewriting calculi and cosmological physics, which I used to inspire some off-the-wall thinking about physics. So I've been wondering if the abstractive power notion might be developed into a formal tool for altogether non-numeric discrete systems akin to the continuous differential calculus that plays such a central role in modern physics.

Developing a mathematical tool of this sort is something of a guessing game. One hopes to find one's way to some natural structure in the platonic realm of the mathematics, using whatever clues one can discern, wherever one finds them (be they in the platonic realm or elsewhere).

I've a story from my mother, from when she was in graduate school, about a student asking for motivation for something in an algebraic topology class. The professor replied, when you build a beautiful cathedral you tear down the scaffolding. To me, honestly, this always seemed to be missing the point. You should keep a record of where the scaffolding was. And the architectural plans. And, while you're at it, the draft architectural plans.

Candidly, when I started this I did not know what I was expecting to find, and the path of reasoning has taken several unexpected turns along the way. One thing I haven't found is a purely non-numeric discrete device analogous to continuous differential calculus; but I've got a plate full of deep insights that deserve their chance to be shared, and if the non-numeric discrete device is there to discover, it seems I must be closer to it than I started.

Contents
Discrete continuous physics
Differentiation in physics
Abstractiveness in programming
Causation in term rewriting
Draft analogy
Breaking the draft analogy
Retrenching

Discrete continuous physics

Physics has been struggling with the interplay of continuous and discrete elements for centuries (arguably since ancient times). Continuous math is a clean powerful tool and therefore popular; on the face of it, one wouldn't expect to get nearly as much useful physics done with purely discrete structures. On the other hand, our ordinary experience includes discrete quantities as well as continuous ones (recalling Kronecker's attributed remark, God made the integers; all else is the work of man). Nineteenth-century classical physics reconciled continuous and discrete aspects by populating space with continuous fields and point particles.

The field/particle approach collapsed with quantum mechanics, which is commonly described as saying that various things appearing continuous at a macroscopic level turn out to be discrete at a very small scale. I'd put it almost the other way around. Our notion of discrete things comes from our ordinary, macroscopic experience, but evidently these macroscopic discrete things can be broken up into smaller things; and if we keep pushing this ordinary notion down to a sufficiently small scale, eventually it can't be pushed any further, and we end up having to postulate an underlying (non-observable) continuous wave function which, having been so-to-speak squashed as nearly flat as it can be, smears even supposedly discrete elementary particles into a continuous complex-valued field. Which gives rise to discrete events by means of wave-function collapse, the ultimate kludge. Wave-function collapse is a different way of reconciling continuous and discrete aspects of the theory, but tbh it feels a bit contrived.

In this view, howsoever we wandered into it, we've assumed a continuous foundation and sought to derive discreteness from the continuous substratum. Traditional quantum theory derives discreteness by means of wave function collapse. Some modern theories use the concept of a standing wave to derive discreteness; string theory does this, with closed loops vibrating at some overtone of their length, and the transactional interpretation does too, with waves propagating forward and backward in time between spacetime events and summing to a standing wave.

An alternative sometimes considered is that the foundation could be discrete — so-called digital physics. I've some doubts about this, intuitively, because — how to put this — I don't think Nature likes to be neatly classified as "discrete" or "continuous". I remember, in a long-ago brainstorming session with my father, posing the semi-rhetorical question, "what is neither discrete nor continuous?", to which he mildly suggested, "reality". A purely discrete foundation seems as unlikely to me as a purely continuous one; in fact, it's possible the overly continuous character of the underlying wave function in quantum theory may be one subtle reason I'm not fond of that theory.

I have a particular, if round-about, reason for interest in the prospects for introducing some form of discrete, nonlocal, nonquantum element into physics. I'd actually been thinking in this direction for some time before I realized why I was tending that way, and was rather excited to figure out why because it gives me a place to start looking for what kind of role this hypothetical discrete element might play in the physics. My reasoning comes from, broadly, Mach's principle — the idea that the way the local universe works is a function of the shape of the rest of the universe. (Mach's principle, btw, was given its name by Albert Einstein, who drew inspiration from it for general relativity.)

Suppose we want to study a very small part of the universe; say, for example, a single elementary particle. This requires us either to disregard the rest of the universe entirely, or to assume the influence of the rest of the universe on our single particle is neatly summarized by some simple factor in our system, such as a potential field. But quantum mechanics starts us thinking, somewhat at least, in terms of everything affecting everything else. Suppose this is actually more so than quantum mechanics tells us. Suppose our single particle is interacting nonlocally with the whole rest of the universe. We don't know anything about these interactions with the rest of the universe, so their influence on our one particle is —for us— essentially random. If it were all local and neatly summed up, we might try to ask for it to be described by a potential field of some sort; but since we're supposing this is nonlocal interaction, and we don't have any sort of nonlocal structure in our theory by which to describe it, we really can't do anything with it other than expect it to smear our probability distribution for our one particle.

This nonlocal interaction I'm talking about is presumably not, in general, the sort of nonlocal interaction that arises in quantum mechanics. That sort of interaction, at least when it occurs within the system one is studying, is quantum entanglement, and it's really quite fragile: at large scales it becomes very likely to decohere, its nonlocality breaking down under interactions with the rest of the universe. I'm hypothesizing, instead, some more robust sort of nonlocality, that can work on a large scale. From my reasoning above, it seems that as soon as we hypothesize some such robust nonlocal interaction, we immediately expect the behavior of our single-particle system to appear nondeterministic, simply because we no longer can have an undisturbed system, unless we consider the entire universe as a whole. In contrast to the frequent presentation of quantum nondeterminism as "strange", under this hypothesis nondeterminism at small scales is unsurprising — though not uninteresting, because we may get clues to the character of our robust nonlocal interaction by looking at how it disturbs our otherwise undisturbed small system.

Following these clues is made more challenging because we have to imagine not one new element of the theory, but two: on one hand, some sort of robust nonlocal interaction, and on the other hand, some local behavior that our small system would have had if the robust nonlocal interaction had not been present. We expect the familiar rules of quantum mechanics to be the sum of these two elements. I have already implied that the local element is not itself quantum mechanics, as I suggested something of quantum nondeterminism might be ascribed to the robust nonlocal element. One might suppose the local element is classical physics, and our robust nonlocal interaction is what's needed to sum with classical physics to produce quantum mechanics. Or perhaps, if we subtract the robust nonlocal interaction from quantum mechanics, we get something else again.

This also raises interesting possibilities at the systemic level. The sum of these two hypothetical elements doesn't have to be quantum mechanics exactly, it only has to be close enough for quantum mechanics to be useful when looking only at a small system; so, quantum mechanics could be a useful approximation in a special case, rather as Newtonian mechanics is. We may expect the robust nonlocal element to be significant at cosmological scales, and we may have misread some cosmological phenomena because we were expecting effects on that scale to be local. But just at the moment my point is that, since spacetime is apparently our primal source of continuity, and our robust nonlocal interaction is specifically bypassing that continuity, I would expect the new interaction to be at least partly discrete rather than continuous; intuitively, if it were altogether continuous one might expect it to be more naturally part of the geometry rather than remaining "nonlocal" as such.

Differentiation in physics

We want a schematic sense of the continuous device we're looking for an analogy to.

Suppose 𝓜 is a manifold (essentially, a possibly-curved n-dimensional space), and F is a field on 𝓜. The value of the field at a point p on 𝓜, Fp, might be a scalar (an ordinary number, probably real or complex), or perhaps a vector (directed magnitude); at any rate Fp would likely be some sort of tensor — but I'm trying to assume as little as possible. So Fp is some sort of field value at p.

The derivative of F at p is... something... that describes how F is changing on 𝓜 at p. If you know the derivative F ' of F at every point in 𝓜, you know the whole "shape" of F, and from this shape F ' you can almost reconstruct F. The process of doing so is integration. The process doesn't quite reconstruct F because you only know how F changes across 𝓜, not what it changes relative to. This absolute reference for all of F doesn't change, across the whole manifold 𝓜, so it isn't part of the information captured by F '. So the integral of F ' is, at least conceptually, F plus a constant.

By assuming just a bit more about the values of field F, we can see how to do more with derivative F ' than merely integrate it to almost-reconstruct F. The usual supposition is that Fp is either a number (a scalar, continuous), or a number together with a configuration relative to the manifold 𝓜. The idea here is that the configuration is a simple relationship to the manifold, and the number gives it "depth" (more-or-less literally) by saying "how big" is the configured entity. The simplest such configured value is a vector: a directed magnitude, where the direction on the manifold is the configuration, and the magnitude in that direction is the number. One can also have a bivector, where the configuration, rather than a linear direction, is a two-dimensional plane orientation. There's a more subtle distinction between vectors and covectors (in general, multivectors and comultivectors), which has to do with integration versus differentiation. These distinctions of "type" (scalar, vector, covector, bivector, cobivector) are relatively crude; most of the configuration information is a continuous orientation, some variant of "direction". And ultimately, we use the configurations to guide relations between numbers. All that information about "shape" is channeled into numerical equations.

The continuity of the manifold comes into the situation in two ways. One, already obliquely mentioned, is that the configuration of the field is only partly discrete, the rest of it (in all but the scalar-field case) being a sort of pseudo-direction on the manifold whose value is continuous, and interacts in some continuous fashion with the numerical depth/intensity of the field. The other insinuation of manifold continuity into the continuous derivative is more pervasive: the derivative describes the shape of the field at a point p by taking the shape in an ε-neighborhood of p in the limit as ε approaches zero. The very idea of this limit requires that the neighborhood size ε (as well as the field intensity and probably the pseudo-direction) be continuous.

Abstractiveness in programming

The analogy needs a similar schematic of the discrete device. Abstractiveness, in my treatment, is a relation between programming languages, in which some programming language Q is formally as abstractive as programming language P.

The basic structure on which the treatment rests is an abstract state machine, whose states are "languages" (ostensibly, programming languages), and whose transitions are labeled by texts. The texts are members of a set T of terms generated over a context-free grammar; each text is treated as an atomic whole, so when we consider a sequence of texts, the individual texts within the sequence remain distinct from each other, as if T were the "alphabet" in which the sequence is written (rather than all the texts in the sequence being run together as a string over a more primitive alphabet). Some subset of the texts are designated as observables, which are used to judge equivalences between programs. In a typical interpretation, non-observable texts are units of program code, such as module declarations, while observable texts are descriptors of program output.

Everything that matters about a state of this abstract machine is captured by the set of text sequences possible from that state (a touch of Mach's principle, there); so formally we define each language to be the set of text sequences, without introducing separate objects to represent the states. A language P is thus a set of text sequences, P ⊆ T^*, that's closed under prefixing (that is, if xy ∈ P then x ∈ P). We write P/x for the language reached from language P by text sequence x ∈ P; that is, P/x = { y | xy ∈ P }.

These structures are used to define formally when one language "does" at least as much as another language. The criterion is existence of a function φ : P → Q, also written φP = Q, that maps each text sequence x ∈ P to a text sequence φx ∈ φP such that φ preserves prefixes (φx is always a prefix of φxy) and φ also preserves some property of P understood as defining what it "does". For language expressiveness, this additional preserved property is concerned with the observables, resulting in a formal statement that Q is at least as expressive as P. For language abstractiveness, the additional preserved property is concerned with both observables and expressiveness relations between arbitrary P/x and P/y, resulting in a formal statement that Q is at least as abstractive as P.

The function φ is understood as "rewriting" P texts as Q texts; and the less intrusive this rewriting is, the more easily Q can subsume the capabilities of P. For expressiveness, the class of macro (aka polynomial) transformations is traditionally of interest; operations that can be eliminated by this sort of transformation are traditionally called syntactic sugar.

Causation in term rewriting

Anyone with a bit of background in computer science has probably worked both with grammars and with term-rewriting calculi (though likely not in depth at the same time). Have you thought about the contrast between them? I hadn't, really, until I ended up doing a master's thesis formulating a novel grammar model (pdf), followed by a doctoral dissertation formulating a novel term-rewriting calculus (here). Since I suspect grammars tend to be thought of with less, well, gravitas than calculi, I'll explain a bit of how the grammars in my master's thesis work.

The formalism is called RAGs — Recursive Adaptive Grammars. Where an attribute grammar would have a terminal alphabet for syntax, a nonterminal alphabet for metasyntax (that is, for specifying classes of syntax), and various domains of attribute values for semantics, a RAG has a single unified domain of answers serving all three roles (except that for metasyntax I'll use the deeper name cosemantics). A rule form looks like

⟨v₀,p₀⟩ → t₁ ⟨p₁,v₁⟩ ... t_n ⟨p_n,v_n⟩ t_n+1

where each t_k is a syntax string, and each ⟨•,•⟩ is a cosemantics/semantics pair suitable for labeling a parent node in a parse tree. Conceptually, cosemantics is inherited, specifying how a tree can grow downward (the cosemantics on the root node is the "start symbol"); while semantics is synthesized, specifying the resultant meaning of the parsed string. The v_k are variables, provided from other parts of the tree: cosemantics of the parent node, provided from above; and semantics of the children, provided from below. The p_k are polynomials, determined by the current rule from these variables, hence, from the cosemantics of the parent and semantics of the children. (If the grammar is well-behaved, each child's cosemantics p_k only uses variables v₀,...v_k−1, so there are no circular dependencies in the parse tree.)

Playing the role of the rule set of an attribute grammar, a RAG has a rule function ρ which maps each answer a into a set of rule forms ρ(a); in selecting answers for the variables in a rule form r (producing a rule instance of r), the inherited cosemantics v₀ has to be some answer a such that r ∈ ρ(a).

If the polynomials were only allowed to construct answers, our grammars could recognize somewhat more than context-free languages; but we also want to compute semantics using Turing-powerful computation. For this, I permitted in the polynomials a non-constructive binary operator — the query operator, written as an infix colon, •:•. Conceptually, parsing starts with cosemantics, parses syntax, and in doing so synthesizes semantics; but in the usual grammatical derivation relation, cosemantics and semantics are both on the left side of the derivation, while syntax is on the right, thus:

⟨cosemantics, semantics⟩ ⇒⁺ syntax

The query operator rearranges these elements, because it has the meaning "find the semantics that results from this cosemantics on this syntax":

cosemantics : syntax ⇒⁺ semantics

So we can embed queries in our rule forms, hence the "recursive" in "Recursive Adaptive Grammars". However, I meant RAGs to be based on an elementary derivation step; so I needed a set of derivation step axioms that would induce the above equivalence. My solution was to introduce one more operator — not permitted in rule forms at all, but used during intermediate steps in derivation. The operator: inverse, denoted by an overbar.

Imho, the inverse operator is elegant, bizarre, and cool. What it does is reverse the direction of derivation. That is, for any derivation step c₁ ⇒ c₂, we have another step c₂ ⇒ c₁. The inverse of the inverse of any term c is just the same term c back again; and every answer a is its own inverse. So, given any three answers a₁, a₂, and a₃, if ⟨a₁,a₂⟩ ⇒⁺ a₃ then a₃ ⇒⁺ ⟨a₁,a₂⟩ .

The whole derivation step relation is defined by four axioms. We've already discussed the first two of them, and the third is just the usual rewriting property of compatibility:

If c₁ ⇒ c₂ then c₂ ⇒ c₁ .
If ⟨a₁,a₂⟩ → a₃ is a rule instance of r ∈ ρ(a₁), then ⟨a₁,a₂⟩ ⇒ a₃ .
If c₁ ⇒ c₂ and C is a context in which the missing subterm isn't inside an inverse operator, then C[c₁] ⇒ C[c₂] .

The fourth axiom is the spark that makes queries come alive.

a₁ : ⟨a₁,a₂⟩ ⇒ a₂ .

So now, from ⟨a₁,a₂⟩ ⇒⁺ a₃ we can deduce by the first axiom a₃ ⇒⁺ ⟨a₁,a₂⟩ , by the third axiom a₁ : a₃ ⇒⁺ a₁ : ⟨a₁,a₂⟩ , and stringing this together with the fourth axiom, a₁ : a₃ ⇒⁺ a₂ .

The converse, that a₁ : a₃ ⇒⁺ a₂ implies ⟨a₁,a₂⟩ ⇒⁺ a₃ , can also be proven without too much fuss... if one has already proven the basic result that an answer cannot be derived from another answer. That is, if answer a is derived from c in one or more steps, c ⇒⁺ a , then c is not an answer. This would be trivial if not for the inverse operator, which allows an answer a to be the left-hand side of infinitely many derivations (the inverses of all derivations that end with a); the theorem says that once you've derived a, further derivation from there can't reach another answer. It took me two rather messy pages to prove this in my master's thesis (simple proofs, if they exist, can take decades to find), and I realized at the time it was a sort of analog to the Church-Rosser theorem for a calculus — an elementary well-behavedness result that one really wants to prove first because without it one has trouble proving other things.

I didn't think more of it until, years later when explaining RAGs to my dissertation committee (though in the end I didn't use RAGs in my dissertation), I got to see someone else go through a WTF moment followed by an aha! moment over the virulent un-Church-Rosser-ness of RAGs. Somehow it hadn't registered on me just how very like a term-rewriting calculus my RAG formalism might look, to someone used to working with λ-like calculi. Mathematical systems are routinely classified according to their formal properties, but there's something deeper going on here. RAGs, despite being in a meaningful way Turing-powerful, don't just lack the Church-Rosser property, they don't want it. A λ-like calculus that isn't Church-Rosser is badly behaved, likely pathological; but a grammar that is Church-Rosser is degenerate. It's a matter of purpose, which is not covered by formal mathematical properties.

My point (approached obliquely, but I'm not sure one would appreciate it properly if one didn't follow a route that offers a good view of the thing on approach) is that the purpose of a calculus is mainly to arrive at the result of an operation, whereas the purpose of a grammar is to relate a starting point to the usually-unbounded range of destinations reachable from it. This requires us to think carefully on what we're doing with a discrete analog to the conventional continuous notion of differentiation, because when we take a conventional derivative, the thing we're differentiating is a single-valued function, akin in purpose to a calculus meant to produce a result, whereas when we describe expressiveness/abstractiveness as derivatives of semantics, the thing we're differentiating appears, on the face of it, to be very like a grammar, describing a usually-unbounded network of roads departing from a common starting point.

Draft analogy

As foundation for a notion of discrete differentiation, we're looking for a useful correspondence between elements of these two kinds of structures, continuous and discrete. So far, it might seem we've identified only differences, rather than similarities, between the two. Continuous differentiation is necessarily in the "calculus" camp, abstractiveness/expressiveness necessarily in the "grammar" camp. Continuous differentiation at a point p is fundamentally local, depending in its deepest nature on a family of ε-neighborhoods of p as ε varies continuously toward zero. Abstractiveness/expressiveness of a language P is fundamentally global, depending in its deepest nature on a family of languages reachable from P by means of (in general, unbounded) text sequences. The local structure of the continuous case doesn't even exist in the discrete case, where language P has a set of closest neighbors; and the global structure on which the discrete case depends is deliberately ignored by the continuous derivative.

A starting point for an analogy is hidden in plain sight (though perhaps I've given it away already, as I chose my presentation to set the stage for it). Recall again Mach's principle, that the way the local universe works is a function of the shape of the rest of the universe. Wikipedia offers this anecdote for it:

You are standing in a field looking at the stars. Your arms are resting freely at your side, and you see that the distant stars are not moving. Now start spinning. The stars are whirling around you and your arms are pulled away from your body. Why should your arms be pulled away when the stars are whirling? Why should they be dangling freely when the stars don't move?

(One is reminded of the esoteric principle "as above, so below".)

In our continuous and discrete cases, we have two ways to examine the properties of a global structure at a particular location (p or P), one of which draws on an unbounded family of ε-neighborhoods approaching arbitrarily close, and the other on an (in general) unbounded family of discrete structures receding arbitrarily far. Either way, we derive our understanding of conditions at the particular location by aggregating from an unbounded, coherent structure. One might call the two strategies "unbounded locality" and "unbounded globality". If we accept that the way the local universe works and the global structure of the universe are two faces of the same thing, then these strategies ought to be two ways of getting at the same thing.

This strategic similarity is reassuring, but lacks tactical detail. In the continuous case, field F gave us a value at point p, the derivative gave us another field with again a value at p, and we could keep applying the differentiation operation (if F was sufficiently well-behaved to start with) producing a series of higher-derivative fields, each providing again a value at p. Yet in the discrete case we don't appear to have a local "value" at P, the purely local "state" being so devoid of information that we actually chose to drop the states from the formal treatment altogether; and while we may intuit that expressiveness is the derivative of semantics, and abstractiveness the derivative of expressiveness, the actual formalisms we constructed for these three things were altogether different from each other. The semantics apparently had only the text sequences, set of observables, and a (typically, infinite) discrete structure called a "language"; expressiveness added to this picture functions between languages; and abstractiveness augmented each language with a family of functions between the different nodes within each language, then defining functions between these "languages with expressive structure". On the face of it, this doesn't appear to be a series of structures of the same kind (as fields F, F ', F '' are of the same kind, granting we do expect them to vary their configurations in some regular, generalizable way).

In my treatment of abstractive power, though, I noted that if, in augmenting languages P and Q with "expressive structure", the family of functions we use is the degenerate family of identity functions, then the relation "〈Q,Id〉 is as abstractive as 〈P,Id〉" is equivalent to "Q is as expressive as P". This, together with Mach's principle, gives us a tactical analogy. The underlying "state machine", whose states are languages and whose transitions are labeled by program texts, corresponds to manifold 𝓜. A particular state, taken together with the rest of the machine reachable from that state, corresponds to point p of the manifold. An "expressive structure" overlain on the machine, creating a web of functions between its states, corresponds to a field F (or F ', etc.). Differentiation is then an operation mapping a given "field" (overlying web of functions) to another. Thus, starting with our discrete machine 𝓜, overlaying the family of identity functions gives us the semantics of 𝓜, differentiating the semantics gives 𝓜's expressive structure, differentiating again gives its abstractive structure.

This analogy provides a way for us to consider the various parts of our discrete scheme in terms of the major elements of the continuous scheme (manifold, field, point). But on closer inspection, there are a couple of flaws in the analogy that need to be thought out carefully — and these flaws will lead us to a second philosophical difference comparable in depth to the difference between grammars and term-rewriting calculi.

Breaking the draft analogy

One mismatch in the analogy concerns the difference in substance between 𝓜 and F. In the continuous case, manifold 𝓜 consists entirely of the oriented distances between its points (we know the distances themselves are numbers because they are the ε needed for differentiation). If you start with a field F of uniform value across 𝓜, and take its integral, you can reconstruct 𝓜. The key point here is that 𝓜 and F are made of the same "stuff". In our discrete scheme, though, 𝓜 is a state machine with term-labeled transitions, while F is a family of functions between term sequences. The discrete 𝓜 and F are concretely different structures, made of different "stuff" rather than commensurate in the manner of the continuous 𝓜 and F. It might not be immediately obvious what the consequences would be of this analogy mismatch; but as happens, it feeds into the second mismatch, which is more directly concerned with the practical use of derivatives in physics.

The second mismatch concerns how the manifold changes over time — or, equivalently, its curvature. Our main precedent here is general relativity.

In general relativity, mass warps the space near its position, and the resulting curvature of space guides the movement of the mass over time. It's not apparent that abstractiveness/expressiveness has anything analogous to time in this sense. It should be apparent, though —once one takes a moment to consider— that general relativity doesn't have absolute time either. Observers moving relative to each other will disagree in their perceptions of time, and of space, and of which observed events are and are not simultaneous; so rather than absolute time and space, we have four-dimensional spacetime that looks different to different observers. Thinking of spacetime as a whole, rather than trying to factor it into absolute space and time, a particle with mass traces a one-dimensional curve across the four-dimensional manifold of spacetime. When we said, a moment ago, that the mass's position warps space while space guides the mass's movement, this mutual influence is described by mathematical equations involving the manifold 𝓜 and fields F, F ', etc. These equations are all about numbers (which don't occur in our discrete scheme) and rely on the fact that the manifold and the fields are both made out of this same numeric stuff (which they aren't in our discrete scheme). Evidently, the key element here which we have failed to carry across the analogy is that of equations encompassing both 𝓜 and F.

But, when we look more closely to try to find a discrete element to support an analogy with the continuous equations, we find that the omission is after all something more primal than numbers, or even commonality of substance between 𝓜 and F. The omission is one of purpose. In the continuous case, the purpose of those equations is to enable us to solve for 𝓜 and F. But the discrete device has no interest in solving for 𝓜 at all. We don't use our theories of expressiveness and abstractiveness to define a language; we only use them to study languages that we've devised by other means.

Retrenching

At this juncture in the narrative, imho we've done rather well at reducing the question of the analogy to its essence. But we still don't have an answer, and we've exhausted our supply of insights in getting this far. We need more ammunition. What else do we know about these continuous and discrete schemes, to break us out of this latest impasse?

For one thing, the asymmetric presence of solving in the continuous scheme is a consequence of a difference in the practical purposes for which the two mathematical schemes were devised. In physics, we try to set up a mathematical system that matches reality, and then use the math to translate observations into constraints and, thus, predictions. Solving is the point of the mathematical exercise, telling us what we should expect to deduce from observations if the theory is correct. Whereas in expressiveness/abstractiveness, we consider various possible programming languages —all of which are of our own devising— and study the different consequences of them. Rather than trying to figure out the properties of the world we're living in, we study the properties of different invented worlds in hopes of deciding which we'd like to live in.

The difference goes deeper, though. Even if we did want to solve for state machine 𝓜 in the discrete scheme —in effect, solve for the programming language— we wouldn't find expressiveness/abstractiveness, nor even semantics, at all adequate to the task. Recall from the above account of the continuous scheme, when observing that 𝓜 and F are made of the same "stuff", I suggested that from F one might reconstruct 𝓜. Strange to tell, when I wrote that, I wasn't even thinking yet in terms of this gap in the analogy over solving for 𝓜 (though, obviously, that's where my reasoning was about to take me). There is nothing in manifold 𝓜 that's in any fundamental way more complicated than fields F etc. In stark contrast to the discrete case. Expressiveness/abstractiveness are constructed from families of functions over 𝓜; to the extent semantic/expressive/abstractive comparisons are anything more than directed graphs on the states of 𝓜, their complexity is an echo of a deliberately very limited part of the structural complexity latent in 𝓜 itself. While the internal structure of a programming language —state machine 𝓜 itself— is immensely complicated; tbh, it seems way more complicated than a mere space-time manifold, which is a continuum whereas a programming language is discontinuity incarnate.

The point about discontinuity recalls a key feature of the other analogy I drew, in my earlier post, between cosmological physics and term rewriting. Under that analogy, fundamental forces in physics corresponded to substitution functions in term-rewriting; while physical geometry, with its close relation to the gravitational force, corresponded to α-renaming, with its close relation to λ-substitution. In our current analogy, we've been treating continuity, with its role in ε-neighborhoods, as the defining feature of the manifold, and finding nothing like it in the syntactic complexity of a programming language. What if for purposes of solving, what we need is not so much continuity as geometry?

For this, we need to find something α-renaming-like in the discrete scheme, which is challenging because expressiveness/abstractiveness is not based on term-rewriting. There's more to it, though. Whatever-it-is, α-renaming-like, we mean to be the analog to the manifold. In other words, it's the discrete 𝓜. And since the whole programming language, or equivalently the state machine, evidently isn't α-renaming-like, 𝓜 isn't the programming language after all, but something else. Some facet of the programming language. And not just any facet — one that guides our discrete differentiation analogously to how the geometry of the manifold guides continuous differentiation.

Sunday, May 10, 2015

Computation and truth

a very great deal more truth can become known than can be proven.
— Richard Feynman, "The Development of the Space-Time View of Quantum Electrodynamics", Nobel Lecture, 1965.

In this post I'm going to take a sharper look at the relationship between computation and truth. This relates to my previously blogged interests both in difficulties with the conventional notion of type (here) and in possible approaches to sidestepping Gödel's Theorem (here).

I was motivated to pursue these ideas further now by two recent developments: first, a discussion on LtU that — most gratifyingly — pushed me to re-review the more globally oriented aspects of Gödel's Theorem (here); and second, following on that, an abrupt insight into what the Curry–Howard correspondence implies about the practical dynamics of static typing.

My biggest achievement here is on something I've wanted to do for many years: to make clear exactly why Gödel's Theorem is true. By proving it. As an elementary result. If it's really as profound as it's made out to be, it should have a really simple proof; yet I've never seen the proof treated as an elementary result. At an elementary level one usually sees hand-waving about the proof, in a manner I find unsatisfying because the details omitted seem rather essential to understanding why the result is true. I don't expect to specify here every tiny detail of a proof, but any detail I omit should be obviously minor; it shouldn't feel as if anything problematic is being hidden.

I've comparatively far less to say about Curry–Howard. It's taken me years to come up with this insight into the connection between Curry–Howard and static typing, though; so if the difficulty of coming up with it isn't just me being thick, the fact that it doesn't take long to state is a good sign.

Contents
That which is most computable is most true
Church–Turing
Enumeration
Diagonalization
Uncomputability
Gödel
Attitude
Curry–Howard
What we can't know
Gödel's Second Theorem
Turing's dissertation
Where do we go from here?

That which is most computable is most true

We want to get at truth using mechanical reasoning. Why? Because mechanical reasoning is objective. We can see what all the rules are, and decide whether we agree with them, and we can check how the particular reasoning was done and know whether or not it follows the rules, and if it all checks out we can all agree on the conclusion. Everything is out in the open, unlike things like divine revelation where an unadorned "I don't believe it" is a credible counter-argument.

In mechanical reasoning, propositions are built up using symbols from a finite alphabet, and a finite set of rules of deduction allow propositions to be proven, either from first principles (axioms) or from a finite number of previously proven propositions (theorems). In essence, this is computation (aka mechanical computation); I'm not talking about a sophisticated correspondence here, just common-sense mechanics: manipulating propositions using finite rules of deduction is manipulating strings over an alphabet using a finite set of rules, which is computation. What you can and can't do this way is governed by what you can and can't do by computation.

Some results that came out gradually over the first several decades of the twentieth century appear to say there is no one unique most-powerful notion of truth accessible by mechanical reasoning. However, it turns out there is a unique most-powerful notion of computation. By the time this second point was fully appreciated, mathematicians had already long since been induced to reconcile themselves, one way or another, to the first point. However, looking at it all from nearly a century downstream, it seems advisable to me to start my systematic study with the historically later result, the unique most-powerful notion of computation.

Church–Turing

In the 1930s, three different formal models of general computation were proven to be equi-powerful: general recursive functions (1931, Gödel and Herbrand), λ-calculus (1936, Church), and Turing machines (1936, Turing). That is, for any computation you can do with one of these models, there's an equivalent computation you can do with either of the other models. Of these three models, Turing machines are perhaps the least mathematically elegant, rather nuts-and-bolts sort of devices, but it's also most intuitively obvious that you can do something by mechanical computation iff (if and only if) you can do it with a Turing machine. When the other models were proven equi-powerful with Turing machines, it didn't add much to the credibility of Turing machines; rather, it added credibility to the other models.

The "thesis", also called a "conjecture", is that any model of mechanical computation, if it isn't underpowered, is also equi-powerful with Turing machines and all these other "Turing-powerful" models of computation. It's a "thesis", "conjecture", etc., because it's an inherently informal statement and therefore isn't subject to formal proof. But it's demonstrated its unassailability for about three quarters of a century, now. Mathematicians study what it takes to build a model more powerful than Turing machines (in fact, Turing himself touched on this in his dissertation); you have to bring in something that isn't altogether mechanical. The maximum amount of computational power you can get mechanically is a robust quantity, somehow built into the fabric of Platonic mathematics much as the speed of light is built into the fabric of physics, and this quantity of power is the same no matter which way you get there (what computational model you use). And, that amount of computational power is realizable; it isn't something you can only approach, but something you can achieve (mathematically speaking).

Enumeration

When exploring the limits of Turing-powerful computation, the basic technique is to frame computations in terms of enumerations.

Enumeration is where you just generate a list, perhaps a list that never ends; typically you don't want the whole list, you just watch it as it grows to see whether the information you actually want ever gets listed. As long as you're only asking whether a computation can be done — not how efficiently it can be done — all computations can be rearranged this way to use enumeration.

Suppose you've got any Turing machine. It computes outputs from inputs. Maybe it doesn't even always complete its computation: maybe sometimes it diverges, computing forever instead of finally producing an answer, perhaps by going into an infinite loop, or perhaps finding some more exotic way to non-terminate. But, given any such machine, you can always define a second machine that enumerates all the input/output pairs where the first machine given that input would halt with that output. How would you do that? It's straightforward, though tedious. Call the first machine T₁. We can enumerate all the possible inputs, since they're all built up from a finite alphabet; just list them in order of increasing length, and within inputs of a given length, list them alphabetically. For each of these inputs, using the mechanical rules of T₁, you can enumerate the states of T₁: the initial state, where it has the input and is about to start computing, the second state that results from what it does next from that first state, and so on. Call the m^th state of T₁ on the n^th input S(n,m). Imagine an expanding table, which we slowly fill out, where n is the row number and m is the column number. Row n is the entire computation by T₁ for the n^th input. We can't fill out the whole table by completing each row before moving on to the next, because some rows might extend to the right forever, as T₁ doesn't halt on that input. But we can mechanically compute any particular cell in the table, S(n,m). So we just have to enumerate all the pairs n,m such that we'll eventually get to each of them — for example, we can do all the entries where the sum of n and m is two (that's just the leftmost topmost entry, S(1,1)), then all the ones where the sum is three (that's S(1,2) and S(2,1)), four (that's S(1,3), S(2,2), and S(3,1)), and so on — and whenever we find a cell where T₁ halts, we output the corresponding T₁ input/output pair. We now have a T₂ that enumerates all and only the input/output pairs for which T₁ halts.

Diagonalization

Way back in the late nineteenth century, though, Georg Cantor noticed that if you can enumerate a series of enumerations — even a series of infinite enumerations — you can find an enumeration that wasn't in the series. He did this with real numbers, to show that not all real numbers are rational. This is going to sound very similar to what we just did for Turing machines.

Consider the numbers less than one and not less than zero. We can enumerate the rational numbers in this interval: each such number is a non-negative integer divided by a strictly larger integer, so just list them in order of increasing denominator, and for a given denominator, by increasing numerator (0/1, 0/2, 1/2, 0/3 1/3, 2/3, et cetera ad infinitum). For each of these ratios, we can enumerate the digits in the decimal representation of that ratio, starting with the tenths digit. (If the denominator divides a power of ten, after some point the decimal representation will be all zeros.) Call the m^th digit of the n^th ratio S(n,m). Imagine a table where n is the row number and m is the column number. Row n is the entire decimal representation of the n^th ratio. We can mechanically compute what should go in any particular entry of this table. And now comes the trick. We can construct the decimal representation of a real number, in the interval, that isn't equal to any of the ones we've enumerated. To do this, we read off the entries on the diagonal of our table from the upper left toward the lower right (S(1,1), S(2,2), etc.), and add one to each digit (modulo 10, so if we read a 9 we change that to 0): our n^th digit is (S(n,n) + 1) mod 10. This is a perfectly good way to specify a real number — it's an infinite sum of the form Σa_n10⁻ⁿ — and we know it isn't rational because every rational number in the interval has some decimal digit on which it differs from our real number.

This general technique is called diagonalization: you have an enumerable set of inputs (the labels on the rows), for each input you have an enumerated sequence (the row with that label), and you then produce an enumerated sequence that differs from every row by reading off the diagonal from upper left downward to the right. Since diagonalization is used to show something isn't enumerated, and enumeration is at the heart of computation, naturally diagonalization is useful for showing things can't be computed.

Uncomputability

Because a Turing machine, like any other algorithmic device such as a general recursive function or λ-calculus term, is fundamentally finite, it's a straightforward (if tedious) exercise to describe it as an input to another machine that can then, straightforwardly, interpret the description to simulate the behavior of the described machine. This is a universal Turing machine, which takes as input a machine description and an input to the described machine, and outputs the output of the described machine on that input — or doesn't halt, if the described machine wouldn't halt on that input.

However, although simulating a described machine is straightforward, it is not possible to determine in general, by computation, whether or not the described machine will halt on the given input. That is, we cannot possibly construct a Turing machine that always halts that determines whether any described machine halts on a given input. We can show this by diagonalization.

We can enumerate all possible machine descriptions, readily enough, since they're just alphabetic strings that obey some simple (and therefore checkable) syntactic rules. We can also, of course, enumerate all possible inputs to the described machines. Imagine a table where the entry S(n,m) at row n and column m is "yes" if the n^th machine halts on the m^th input, or "no" if it doesn't halt. Suppose we can construct a machine that computes the entries S(n,m) of this table. Then by going down the diagonal of the table we can also construct a machine whose behavior differs from every row of the table: Let machine A on the m^th input compute S(m,m), and if it's "no", halt and say "no", while if it's "yes", go into an infinite loop and thus never halt. If machine A is the n^th machine in the table, then A halts on the n^th input if and only if S(n,n) is "no". Assuming that our computation of S works right, there can't be any such n, and A was never enumerated.

Gödel

Gödel's Theorem (aka Gödel's First Theorem) says that any sufficiently powerful formal system is either incomplete or inconsistent — in essence, either it can't prove everything that's true, or it can prove things that aren't true.

To pin this down, we first need to work out what "sufficiently powerful" means. Gödel wanted a system powerful enough to reason about arithmetic: we can boil this down to, for an arithmetic function f and integer i, does f(i)=j or doesn't it? The functions of interest are, of course, general recursive functions, which are equi-powerful with Turing machines and with λ-calculus; so we can equivalently say we want to reason about whether a given Turing machine with input i will or will not produce output j. But a formal system is itself essentially a Turing machine; so in effect we're talking about a Turing machine L (the formal system; L for logic) that determines whether or not a Turing machine (the function f) on given input produces given output. The system would be consistent and complete if it confirms every true statement about whether or not f on given input produces given output, and doesn't confirm any false such statements.

Enumerate the machines f and make them the rows of a table. Enumerate the input/output pairs and make them the columns. In the entry for row n and column m, put a "yes" if L confirms that the n^th machine has the m^th input/output pair, a "no" if L confirms that it doesn't. Suppose L is consistent and complete.

It can't be both true and false that the n^th machine has the m^th input/output pair; so if L only confirms true propositions, there can't be both a "yes" and a "no" in any one table entry. What about blank table entries? For centuries it was generally agreed that a proposition must be either true or false; but this idea had fallen into some disrepute during the three decades leading up to Gödel's results. This is just as well, because, based on our supposition that L is consistent and complete, we can easily show that the table must have some blank entries. Suppose the table has no blank entries. Then for any machine f₁, and any input i, we can determine whether f₁ halts on i, thus: construct another machine f₂ that runs f₁ on i and then halts with output confirm. Because there are no blank entries in the table, we know L can determine whether or not f₂(i)=confirm, and this also determines whether or not f₁ halts on i. But we already know from the previous section that we cannot correctly determine by computation whether or not an arbitrary machine halts on an arbitrary input; therefore, there must be some blank entries in the table.

Is it possible for a proposition of this kind — that a given machine on a given input produces a given output — to be neither true nor false? If you think this isn't possible, then we have already proven to your satisfaction that L cannot be both consistent and complete. However, since we're collectively no longer so sure that propositions have to be either true or false, let's see if we can find a difficulty with the consistent complete system without insisting that every table entry must be filled in. Instead, we'll look for a particular entry that we know should be filled in, but isn't.

We're going to diagonalize. First, let's restrict our yes/no table by looking only at columns where the output is confirm (and, being really careful, suppress any duplicate column labels, so each column label occurs only once). So now our table has rows for machines f, columns for inputs i, and each entry (n,m) contains a "yes" if L confirms that the n^th machine on the m^th input produces output confirm, while the entry contains a "no" if L confirms it does not produce output confirm. The entry is blank if L doesn't confirm either proposition. Construct a machine A as follows. For given input, go through the column labels till you find the one that matches it (we were careful there'd be only one); call that column number m. Use L to confirm a "no" in table entry m,m, and once you've confirmed that, output confirm. If L never confirms that "no", then A never halts, and never outputs confirm. Since A is a Turing machine, it is the label on some row n of the table. What is the content of table entry n,n? Remember, the content of the table entry is what L actually confirms about the behavior of A on the n^th input. By construction, if the entry contains "no", then A outputs confirm, and the "no" is incorrect. If the entry contains "yes", and the "yes" is correct, then A outputs confirm, and by construction it must have done so because the entry contains an incorrect "no" that caused A to behave this way. Therefore, if L doesn't confirm anything that's false, this table entry must be blank. But if we know the table entry is blank, then we know that, by failing to put a "no" there, L has failed to confirm something true, and is therefore incomplete.

If we are sure the formal system proves everything that's true, then we cannot possibly be sure it doesn't prove anything that's false; if we are sure it doesn't prove anything that's false, we cannot possibly be sure it proves everything that's true. Heisenberg's uncertainty principle comes to mind.

Attitude

Gödel's results are commonly phrased in terms of what a formal system can prove about itself, and treated in terms of the rules of deduction in the formal system. There are both historical and practical reasons for this.

In the first half of the nineteenth century, the foundations of mathematics underwent a Kuhnian paradigm shift, settling on building things up formally from a set of axioms. In the 1890s people started to notice cracks in the axiomatic foundations, in the form of antinomies — pairs of contradictory statements that were both provable from the axioms. Mathematicians generally reacted by looking for some axiom they'd chosen that doesn't hold in general — as geometry had done in the early nineteenth century to explore non-Euclidean geometries that lack the parallel postulate. As a source of antinomies, attention fell primarily on the Law of the Excluded Middle, which says a proposition is either true or false; as an off-beat alternative, Alonzo Church considered weakening reductio ad absurdum, which says that if assuming a proposition leads to a contradiction, then the proposition is false. Thus, emphasis on choice of rules of deduction.

David Hilbert proposed to use a subset of a formal system to prove the consistency of the larger system; this would have the advantage that one might be more confident of the subset, so that using the subset to prove consistency would increase confidence in the larger system. Gödel's result was understood to mean that the consistency of the whole formal system (for a powerful system) can only be proved in an even more powerful system. Thus, emphasis on what a formal system can prove about itself.

Explorations of how to cope with the Theorem have continued to focus on the system's rules of deduction; my own earlier post tended this way. Alan Turing's dissertation at Princeton also followed this route. The emphasis on rules of deduction naturally suggests itself when looking for a way around Gödel's Theorem, because if you want to achieve a mechanical means for deriving truth, as a practical matter you can't achieve that without working out the specific mechanical means.

However, in this post I've been phrasing and treating Gödel's Theorem differently.

I phrased myself in terms of what we can know about the system — regardless of how we come to know — rather than what the system can prove about itself. (I'm not distinguishing, btw, between "what we can know" and "what can be true"; either would do, in principle, but we're no longer sure what "truth" is, and while it's awkward to talk about multiple notions of truth, it's easier to talk about multiple observers. When convenient I'll conjure a hypothetical omniscient being, to dispense with quibbles about "true but unknowable".)

My treatment of the Theorem conspicuously omits any internals of the formal system, supposing only that its conclusions are computable (and below I'll dispense with even that supposition). By depicting Gödel's Theorem without any reference to the rules of deduction, this approach seems to throw a wet blanket on attempts to cope with Gödel's Theorem by means of one's choice of rules of deduction — and frankly, I approve of discouraging attempts by that route. I'm not looking for a clever loophole in Gödel's result — invoking, say, uncountable infinities or second-order logic as a sort of Get Out of Jail Free card. In my experience, when somebody thinks they've found a loophole in something as fundamental as Gödel's Theorem, it's very likely they've outsmarted themselves and ended up with a bogus result. What I want is an obvious way of completely bypassing the Theorem; something poetically akin to cutting the Gordian Knot.

That is, I'm looking for a way around Gödel's Theorem with a high profundity index. This is an informal device I use to characterize the sort of solutions I favor. Imagine you could use numerical values to describe how difficult conceptual tasks are: each such value is a positive number, and the more difficult the task, the higher the number. Now, for a given idea, take the difficulty of coming up with the idea the first time, and divide by the difficulty of understanding the idea once it's been explained to you. That ratio is the profundity index of the idea. So an idea is profound if it was really difficult to come up with, but is really obvious once explained. If an idea that's incredibly hard to come up with in the first place turns out to be even harder to figure out how to explain clearly, the denominator you want is the difficulty of understanding it after somebody has figured out how to explain it clearly, and the numerator should include the difficulty of coming up with the explanation.

The metaphor of getting around something implies a desire to get to the other side; and it may be illuminating to ask why one wants to do so. We have here two notions, one practical and one philosophical. The notion of truth is as philosophical as you can get; it's the whole purpose of philosophy. The notion of mechanical computation is — despite quibbles about infinite resources and such — quintessentially practical, to do with getting results by an explicitly objective and reproducible procedure. Mathematicians in the second half of the nineteenth century sought to access truth through computation. The protracted collapse of that agenda across the first three decades of the twentieth century, culminating in Gödel's Theorem, has left us without a clear understanding of the proper role of computation in investigating truth; and with yet another in philosophy's long tradition of ways to not be sure what is true. So I suppose, in trying to get around Gödel's Theorem, my hopes are

to find a robust maximum of truth, as Turing power is a robust maximum of computational power.
to find a robust maximum way of obtaining truth through computation.

Gödel's Theorem tells us that Turing power itself, despite its robustness, does not provide a straightforward maximum of either truth or proof.

Curry–Howard

Though there are (of course) situations in which the Curry–Howard correspondence is exactly what one needs, in general I see it as badly overrated.

The basic correspondence is between rules of deduction in formal proof systems, and rules of static typing in a programming language (classically, the typed λ-calculus). The canonical example is that modus ponens corresponds to typed function application: modus ponens says that if proposition A is provable and proposition A⇒B is provable, then proposition B is provable; typed function application says that if a is an expression of type A and f is an expression of type A→B, then fa is an expression of type B. Moving outward from this insight, when you construct a correctly typed program you are also constructing a proof; thus proofs correspond to correctly typed programs. A theorem corresponds to a type, so that asking whether a theorem has a proof is asking whether the corresponding type has a correctly typed expression of that type — that is, provability of the theorem corresponds to realizability of the type. And so on.

The folk-wisdom version of the correspondence is that logic and computation are the same thing. The folk-corollary is that all reasoning should be done with types. This is the basis of modern type theory, and there are folks trying to recast both programming language design, logic, and mathematics in the image of types. Curry–Howard has taken on a (one hopes, metaphorically) theological significance.

It strikes me, though, that the basic correspondence does not involve computation at all. If, in the realm of programming, a type system ever becomes Turing-powerful, that's a major mark against it because we want fast automatic type-checking and, even if we're willing to wait a little longer, we certainly want our type-checks to be guaranteed to halt. In any event, types are not the primary vehicle of computation, rather they're a means of reasoning about the form of programs — thus, not even reasoning directly about our computations, but rather about the way we specify our computations.

It's easy to get tangled up trying to make sense of this proof–program connection. For example, when we say that we want our automatic type-checking algorithm to always halt, that limits the computational power involved in checking an individual step of a proof, but puts no limit on the computational power of proof in general because the length of allowable proofs is unbounded, just as the size of program expressions is unbounded. There is no evident notion of what it means for a proof to "halt", and this corresponds, through Curry–Howard, to saying there is no such thing as a "largest" expression in λ-calculus that cannot be made a subexpression of a larger expression; it has nothing whatever to do with halting of λ-calculus computations. The reason one gets tangled up like this is that although proofs and programs are technically connected through Curry–Howard, they have different and often incompatible purposes.

The purpose of a proof, I submit, is to elucidate a chain of reasoning. The more lucid, the better. Ideally one wants what Paul Erdős used to call "one from The Book" — the idea being that God has a jealously guarded book of the most beautiful proofs of all theorems (Paul Erdős was an atheist; he said "You don't have to believe in God, but you should believe in The Book"). But the first duty of a program is to elucidate an algorithm. Seriously. This shouldn't be a controversial statement, and it's scary to realize that for some people it is. I'll say it again. The first duty of a program is to elucidate an algorithm. You should be able to tell at a glance what the program is doing. That is your first line of defense against getting the program wrong. Proofs of program correctness, with all the benefits and problems thereof, are a later line of defense, possibly useful but no substitute for being able to tell what the program does. (Yes, this is yet another of my dozen-or-so topics I'm trying to draft a blog post about, under the working title "How do we know it's right?".) And this is where the two sides of the Curry–Howard correspondence part company. If you relentlessly drive your program syntax toward more lucid expression of algorithms, you obfuscate the inference rules of your type system, which is to say, the deductive steps of the corresponding proof. If you drive to simplify the type system, you're no longer headed for maximally lucid algorithms. In practice, it seems, you try to get simultaneously the best of both worlds and end up sliding toward the worst of both. (My past blog post on types is yonder.)

What we can't know

Enough type-bashing. Back on the Gödel front we were hoping for robust maxima of truth and of computable truth. What does Gödel's Theorem actually tell us about these goals?

First of all we should recognize that Gödel's Theorem is not, primarily, a limit on what can be known by mechanical means. The uncomputability of the Halting problem (proven above) is a limit on what can be known by mechanical means. Gödel's Theorem is a limit on what can be known at all. That is, it bears more on truth than it does on truth through computation. I touched on this point above. To further clarify, we can use a notion that played a brief role in Alan Turing's doctoral dissertation at Princeton (supervised by Alonzo Church), called an oracle.

Suppose we attach a peripheral device, O, to a Turing machine. The Turing machine, chugging along mechanically, can at any time ask a question of O, and O will give an answer (based on the question) in one step of the machine. O is a black box; we don't say how it works, and there's no need for it to be mechanical inside. Maybe there's a djinn, or a deity, or some such, inside O that's producing the answer. We're only assuming that it will always immediately answer any question asked of it. That's what we mean by an oracle. We'll call a Turing machine equipped with this device an O-machine.

Let's revisit the Halting problem, and see what we can —and can't— do to change the situation by introducing an oracle. Our basic result, remember, is that there is no purely mechanical means to determine whether or not an arbitrary Turing machine will halt on an arbitrary input. Okay. What if we had a non-mechanical means? Precisely, let us suppose we have an oracle O that will always tell us immediately whether a given ordinary Turing machine will halt on a given input. This supposition doesn't appear to have, in itself, any unfortunate consequences. Under the supposition that we have such an oracle O, we can easily build an O-machine that determines whether a given ordinary Turing machine will halt on a given input. What we can't do, given this oracle O, is build an O-machine that determines whether a given O-machine halts on a given input. This is one of those results that's easier to show if we make it stronger. Without even knowing what an oracle O does, we can prove that there is no O-machine that always halts that determines whether any described O-machine will halt on a given input. This is because we can describe all possible O-machines without ever having to describe the internals of O. O is the same for all of them, after all; we only have to describe the parts of the O-machines that differ from each other, and that we can do finitely no matter what O is. So we can still enumerate our O-machines, and diagonalize just as we did in the earlier section to prove an ordinary Turing machine couldn't solve the ordinary Halting problem. No matter what O is, even if it's got an omniscient being hidden inside it, an O-machine can't solve the O-Halting problem.

Likewise, our diagonalization to prove Gödel's Theorem works just as well for O-machines as for ordinary Turing machines. For any oracle O, let L be an O-machine that confirms some propositions, and fails to confirm others, about whether or not given O-machines on given inputs produce given outputs. Reasoning just as before, if we know that L confirms all true claims, then we cannot know that L doesn't confirm any false claims; if we know it doesn't confirm any false claims, then we cannot know it confirms all true claims. So even if we're allowed to consult an oracle, we still can't achieve an understanding of truth, L, that confirms all and only true claims (about O-machines, lest we forget).

We are now placing limits on what we can know, or equivalently, on what can be true. To get such results we must have started with some constraints, and we've conspicuously placed no constraints at all on O, not even computability. It's worth asking what our constraints were, that led to the result.

For one thing, we have required truth to agree with the reasoning in the diagonalization argument of our proof of the Theorem. Interestingly, this makes clear that the oracle itself is not our notion of truth, for we didn't require the oracle to respect the reasoning in our diagonalization; rather, with no constraints on how the oracle derives its answers from the questions asked of it, we proved a limitation on what those answers could mean. The O-machine L, together with our reasoning about it, became our I Ching, the means by which we mapped the oracle's behavior into the realm of truth.

The reasoning in the diagonalization isn't all we introduced; we also introduced... what? The requirement that our djinn/deity/whatever-it-is feed its answers to a mechanical device. The requirement that it base its answers only on what question the mechanical device asked of it. Ultimately, the requirement that the truth be about discrete inputs and outputs and be expressed as discrete propositions.

The discreteness of the mechanical device makes it possible to enumerate, and therefore diagonalize. The discreteness of the question/answer interactions with the oracle is apparently a corollary to that. The requirement that the oracle base its answer only on the question asked... is thought-provoking, since modern physics leans toward the concept of entanglement. Heisenberg's uncertainty principle already came up once above. One might wonder if truth ought to be reimagined in some way that makes it inherently part of the system in which it exists, rather than something separate; but here, the discrete separations — between machine and oracle, between subject and truth — seem a rather natural complement to the discreteness of the machine.

The common theme of discreteness is reminiscent of the importance attached to "chunking" of information in my earlier post on sapience and language. Moreover, the relationship between discreteness and continuity seems to be cropping up in several of my current major avenues of investigation — Gödel, linguistics, sapience, fundamental physics — and I find myself suspecting we lack some crucial insight into this relationship — hopefully, an insight with a very high profundity index, because the time required to acquire the insight, for the numerator of the index, appears to be measured in millennia, so in order for us to grok it once found we should hope for the denominator to be a lot smaller. However, I've no immediate thoughts in this direction. In the current case, it's more than a little mind-bending to try to construct a notion of truth that's inherently not separable into discrete propositions; indeed, one might wonder if the consequences of the attempt could be rather Lovecraftian. So for the nonce I'll keep looking for profundity around Gödel's results without abandoning the concept of a proposition.

Gödel's Second Theorem

Gödel's Second Theorem says that if a sufficiently powerful formal system L can prove itself consistent, then L is not consistent.

Where Gödel's First Theorem was, as I interpret it, mainly about the relationship between L and truth, his Second Theorem is much more concerned with the specific behavior of L. I covered the "sufficiently powerful" clause in the First Theorem mostly by completeness, i.e., by requiring that L confirm everything we know is true. The Second Theorem isn't about completeness, though, so we need something else. Just as we have already required truth to agree with our diagonalization argument, we'll now separately require L to agree with the diagonalization argument too. And, we don't want to require any controversial behaviors from L, like the Law of the Excluded Middle or reductio ad absurdum, since they would compromise the generality of our result. Here's a set of relatively tame specific behaviors. Write [T:i⇒o] and [T:i⇏o] for the canonical representations, recognized by L, of the propositions that machine T on input i does or does not produce output o.

[a] If T(i)=o, then L confirms [T:i⇒o].
There is a canonical way, recognizable by L, to set up a machine (S∧T) that combines two other machines S and T, by running each of them on its input, requiring them both to produce the same output, and producing that output itself. We require L to do some reasoning about these machines:
- [b] L confirms [(T∧T):i⇏o] iff it confirms [T:i⇏o].
- [c] If L confirms [(S∧T):i⇏o], then it confirms [(T∧S):i⇏o].
There is a canonical way, recognizable by L, to set up a machine (S∘T) that combines two other machines S and T, by running T on its input, then running S on the output from T, and producing the output from S. We require L to do some reasoning about these machines:
- [d] If T(i)=v, then L confirms [(S∘T):i⇏o] iff it confirms [S:v⇏o].
- [e] If S₁(i)=S₂(i)=v, then L confirms [((R∘S₁)∧T):i⇏o] iff it confirms [((R∘S₂)∧T):i⇏o].
- [f] L confirms [(((Q∘R)∘S)∧T):i⇏o] iff it confirms [((Q∘(R∘S))∧T):i⇏o].
- [g] L confirms [((S₁∘T)∧(S₂∘T)):i⇏o] iff it confirms [((S₁∧S₂)∘T):i⇏o].
There is a canonical way, recognizable by L, to set up a machine (T⇒o) that combines a machine T with an output o, by constructing a proposition that T on the given input produces output o. That is, (T⇒o)(i)=[T:i⇒o]. We require L to do some reasoning about these machines:
- [h] L confirms [(S∧T):i⇏confirm] iff it confirms [((L∘(S⇒confirm))∧T):i⇏confirm].

For us, several of these deductions would likely be special cases of more general rules, and maybe they are for L; we only require that L reach these conclusions by some means.

In order for L to prove its own consistency, we need to define consistency as an internal, checkable feature of L's behavior — rather than by the external feature that L says nothing false according to the external notion of truth. Leading candidates are

L does not confirm any antinomy; that is, there is no proposition such that L confirms both it and its negation.
There is some proposition that L does not confirm.

The second of these was widely favored in the early twentieth century: inconsistency means confirming all propositions whatever. Given reductio ad absurdum and the Law of the Excluded Middle, the first implies the second, thus: Starting with a known proof of an antinomy, consider any proposition q. Supposing not-q, we can derive an antinomy (because we could have derived it even if we hadn't supposed not-q), therefore, by reductio ad absurdum, the supposition must be false; that is, not-q is false. But then, by the Law of the Excluded Middle, since q definitely isn't false, it must be true. So every proposition can be proven.

I'll use the first of these two internal notions of consistency, because it is the weaker of the two, so that any theorem we derive from it will be a stronger result. Construct machines T_yes and T_no that, given a machine and an input/output pair, output propositions asserting the machine does and does not have that input/output: T_yes(T,i,o)=[T:i⇒o], T_no(T,i,o)=[T:i⇏o]. As self-proof of consistency, require that L confirm, for every possible input i, proposition [((L∘T_yes)∧(L∘T_no)):i⇏confirm].

Start out just as in the diagonalization for the First Theorem. Imagine a table where the column labels are all possible inputs, the row labels are all possible machines, and each entry contains a "yes" if L confirms that machine on that input outputs confirm, a "no" if L confirms that machine on that input doesn't output confirm. Construct a machine A as follows. First, construct A₀ that, on the m^th input, outputs the machine/input/output needed to construct a proposition that the m^th machine on the m^th input does or does not produce output confirm. (Remember, A₀ can do this by counting the column labels till it finds the input, then counting the row labels to find the corresponding machine.) Then just let A=(L∘(T_no∘A₀)). Let n be the row number of the row labeled by A. We're interested in the behavior of A on the n^th input; but this time, instead of just considering what is true about this behavior, we're interested in what L confirms about it.

Call the n^th input i. We've got all our dominoes lined up; now watch them fall.

By construction of A₀, A₀(i)=(A,i,confirm).

By construction of T_yes, T_yes(A,i,confirm)=[A:i⇒confirm].

By construction of T_no, T_no(A,i,confirm)=[A:i⇏confirm].

By consistency self-proof, L confirms [((L∘T_yes)∧(L∘T_no)):(A,i,confirm)⇏confirm].

By [d], L confirms [(((L∘T_yes)∧(L∘T_no))∘A₀):i⇏confirm].

By [g], L confirms [(((L∘T_yes)∘A₀)∧((L∘T_no)∘A₀)):i⇏confirm].

By _[f] and [c], L confirms [((L∘(T_yes∘A₀))∧(L∘(T_no∘A₀))):i⇏confirm].

By definition of A, L confirms [((L∘(T_yes∘A₀))∧A):i⇏confirm].

By [e], L confirms [((L∘(A⇒confirm))∧A):i⇏confirm].

By [h], L confirms [(A∧A):i⇏confirm].

By [b], L confirms [A:i⇏confirm].

By [a] and definition of A, L confirms [A:i⇒confirm].

For a bonus, by [a], L also proves itself inconsistent by confirming [((L∘T_yes)∧(L∘T_no)):i⇒confirm].

I don't think this result has much to do with the specific assumptions [b]–[h] about the behavior of L; going through the proof leaves me impressed by how relatively innocuous those assumptions were. Which, to my mind, is an insight well worth the exercise of going through the proof.

Turing's dissertation

I've mentioned Turing's American doctoral dissertation several times. (American because he already had sufficient academic credentials in Europe.)

Since Gödel had shown a formal system can't prove itself consistent, it was then of interest to ask how much more than a given formal system would be needed to prove it consistent. Gerhard Gentzen produced some interesting results of this sort, exploring the formal consequences of postulating restricted forms of mathematical induction (before he was arrested by the Germans late in World War II, transferred to the custody of the Soviets, and starved to death in a Soviet POW camp). Turing's dissertation explored another approach: when considering a formal system L, simply construct a new system L₁ that adds to L a postulate saying L is consistent. Naturally you can't add that postulate to L since it would (presuming sufficient power) cause L to become inconsistent if wasn't already; but if L actually was consistent to start with, L₁ should be consistent too since L₁ can't prove its own consistency, only that of L. To prove the consistency of L₁, you can construct an L₂ that adds to L₁ a postulate saying L₁ is consistent. And so on. In fact, Turing didn't even stop with L_k for every positive integer k; he supposed a family of formal systems L_a for any a representing an ordinal. Ordinals are (broadly) the sizes of well-ordered sets; there are infinitely many countably infinite ordinals; there are uncountably infinite ordinals. The only restriction for Turing was that since all this was being done with formal systems, the ordinals α had to be finitely represented.

Turing made the — imho, eminently reasonable — suggestion that the value of this technique lies in explicitly identifying where additional assumptions are being introduced. The additional assumptions themselves, consistency of all the individual ordinal logics in the system, are justified only by fiat, so nothing has been apparently contributed toward reaching truth through computation; the axioms weren't reached through computation, merely acknowledged in computation. Interestingly, nothing has been contributed toward getting around Gödel's Theorems, either: in terms of the framework set up in this post, the entire system of logics can be handled by a single formal system that is itself still subject to the Theorems.

I set out to simplify the situation, pruning unnecessary complications in order to better understand the essence of Gödel's results. It seems an encouraging sign for the success of that agenda, that Turing's authentically interesting but elaborate dissertation has little to say about my treatment — since this means the complexities addressed by his approach have been pruned.

Where do we go from here?

Types. The above suggests correctness proof should be separated from detailed program syntax. As I've remarked elsewhere, it should be possible to maintain propositions and theorems as first-class objects, so that proof becomes the computational act of constructing a theorem object — if one can determine the appropriate rules of deduction, which makes this concern dependent on choosing the appropriate notion of proof.

Truth. Gödel's First Theorem, as treated here, says that truth about the behavior of a formal logic must conform to a certain constraint on the shape of that truth. The impression I've picked up, in constructing this post, is that this constraint should not be particularly troubling. A Euclidean triangle can't have more than one obtuse interior angle; and the truth about a formal logic can't include both completeness and consistency. I've been progressively mellowing about this since the moment I started thinking of a formal system as a filter on truth — an insight preserved above at the comparison to the I Ching.

Proof. Gödel's Second Theorem, as treated here, not only places a notably small constraint on proof (as the First Theorem does on truth), but does so in a way notably disengaged from controversial rules of deduction. The insight I take from this is that, in seeking a robust notion of proof, Gödel's Theorems aren't all that important. The great paradigm crisis of the early twentieth century (in the admittedly rather specialized area of mathematical foundations) was about antinomies implied by the axioms, not self-proofs of consistency. The self-proofs of consistency were just an envisioned possible countermeasure, and when Gödel showed the countermeasure couldn't work, the form of his Second Theorem, together with the lines of research people had already been pursuing, led them to the red herring of an infinite hierarchy of successively more powerful systems each able to show the consistency of those below it. Whatever interesting results may accompany the infinite hierarchy, I now suspect they have nothing much to say about a robust maximum of proof. Turing's ordinal logics — in essence an explicit manifestation of the infinite hierarchy — were, remember, about assuming consistency, not establishing it.

So it seems to me the problem of interest — even in my earlier post on bypassing no-go theorems, where Gödel's results served as the rallying point for my explorations — is how to fix the axioms. In that regard, I have a thought, which clearly calls for a separate post, but I'll offer it here to mull over.

I speculated in the no-go theorems post that where the proof of Russell's Paradox ends after a few steps, leaving a foundational problem, an analogous recursive predicate in Lisp simply fails to terminate, which might or might not be considered a bug in the program but doesn't have the broad foundational import of the Paradox. If a person reasoned about whether or not the set A of all sets that do not contain themselves contains itself, they might say, "If A does not contain itself, then by definition it does contain itself; but then, since it does contain itself, by definition it does not contain itself; but then, since it does not contain itself, by definition it does contain itself..." — and the person saying this quickly sees that the reasoning is going 'round in circles. This is much like

($define! A ($lambda (P) (not? (P P)))) (A A)

except that the Lisp interpreter running this code probably doesn't deduce, short of a memory fault or the like, that the computation won't halt, whereas the human quickly deduces the non-halting. The formal proof, however, does not go 'round in circles. It says, "Suppose A does not contain itself. By definition, since A does not contain itself, A does contain itself. That's a contradiction, therefore by reductio ad absurdum, A does not not contain itself. By the Law of the Excluded Middle, since A does not not contain itself, A does contain itself. By definition, since A does contain itself, A does not contain itself. That's an antinomy." Why did this not go 'round in circles? Because the initial premise, "Suppose A does not contain itself", was discharged when reductio ad absurdum was applied. The human reasoner, though, never forgot about the initial assumption.

This isn't quite the same as a conditional proof, in which you start with an assumption P, show that this assumption leads to a consequence Q, and then conclude from this that P⇒Q regardless of whether or not P holds; the assumption has been discharged, but it lingers embedded in the conclusion. It only appears to have discharged the assumption because we have a notion of implication that neatly absorbs the memory of our having assumed P in order to prove Q. Really, all proofs are conditional in the sense that they only hold if we accept the rules of deduction by which we reached them; but we can't absorb that into our propositions using logical implication. We could still preface all of our propositions with "suppose that such-and-such laws of deduction are valid"; but when writing it formally we'd need a different notation because it's not the same thing as logical implication. We might be tempted to use something akin to a turnstile, which is (in my experience) strictly a meta-level notation — but in the case of reductio ad absurdum, it seems we don't want a meta-level notation. We want to qualify our propositions by the memory of other propositions we've rejected because they led to contradictions.

I'd expect to explore what this does to paradoxes (not just Russell's); I'm skeptical that in itself it would eliminate all the classical paradoxes. I do rather like the idea that the classical paradoxes are all caused by a single flaw, but I suspect the single flaw isn't in a single axiom; rather, it may be a single systemic flaw. It seems that, in some sense, not discharging the assumption in reductio ad absurdum is a way to detect impredicativity, the same source of paradoxes type theory was meant to eliminate.

I look forward to exploring that. Along with my dozen-or-so other draft post topics.

By construction of A₀,	A₀(i)=(A,i,confirm).
By construction of T_yes,	T_yes(A,i,confirm)=[A:i⇒confirm].
By construction of T_no,	T_no(A,i,confirm)=[A:i⇏confirm].
By consistency self-proof,	L confirms [((L∘T_yes)∧(L∘T_no)):(A,i,confirm)⇏confirm].
By [d],	L confirms [(((L∘T_yes)∧(L∘T_no))∘A₀):i⇏confirm].
By [g],	L confirms [(((L∘T_yes)∘A₀)∧((L∘T_no)∘A₀)):i⇏confirm].
By _[f] and [c],	L confirms [((L∘(T_yes∘A₀))∧(L∘(T_no∘A₀))):i⇏confirm].
By definition of A,	L confirms [((L∘(T_yes∘A₀))∧A):i⇏confirm].
By [e],	L confirms [((L∘(A⇒confirm))∧A):i⇏confirm].
By [h],	L confirms [(A∧A):i⇏confirm].
By [b],	L confirms [A:i⇏confirm].
By [a] and definition of A,	L confirms [A:i⇒confirm].