Saturday, September 22, 2018

Discourse and language

"Of course, we practiced with computer generated languages.  The neural modelers created alien networks, and we practiced with the languages they generated."

Ofelia kept her face blank.  She understood what that meant:  they had created machines that talked machine languages, and from this they thought they had learned how to understand alien languages.  Stupid.  Machines would not think like aliens, but like machines.

Remnant Population, Elizabeth Moon, 1996, Chapter Eighteen.
A plague o' both your houses!
— Mercutio, Romeo and Juliet, William Shakespeare, circa 1591–5, act 3 scene 1.

I've realized lately that, for my advancement as a conlanger, I need to get a handle on discourse, by which I mean, the aspects of language that arise in coordinating texts beyond the scope of the ordinary grammatical rules of sentence formation.  Turns out that's a can of worms; in trying to get a handle on discourse, I find myself confronting what are, as best I can work out, some of the messiest controversies in linguistics in the modern era.

I think conlanging can provide a valuable service for linguistics.  My purpose in this post is overtly conlinguistic:  for conlanging theory, I want to understand the linguistic conceptual tangle; and for conlanging practice, I want methodology for investigating the discursive (that would be the adjectival form of discourse) dynamics of different language arrangements.  But linguistics —I've in mind particularly grammar— seems to me to have gotten itself into something of a bind, from a scientific perspective.  It's got a deficit of falsifiable claims.  Since we've a limited supply of natural languages, we'd like to identify salient features of the ones we have; but how can we make falsifiable claims about which features matter unless we have an unconstrained space of possibilities within which we can see that natural languages are not randomly distributed?  We have no such unconstrained space of possibilities; it seems we can only define a space of possibilities by choosing a particular model of how to describe language, and inability to choose between those models is part of the problem; they're all differently constrained, not unconstrained.  Conlanging, though, lets us imagine possibilities that may defy all our constrained models — if we don't let the models constrain our conlanging — and any tool that lets us do conlang thought-experiments without choosing a linguistic model should bring us closer to glimpsing an unconstrained space of possibilities.

As usual, I mean to wade in, and document my journey as well as my destination; such as it is.  In this case, though a concrete methodology is my practical goal, it will be all I can manage, through mapping the surrounding territory (theoretical and practical), to see the goal more clearly; actually achieving the goal, despite all my striving for it here, will be for some future effort.  The search here is going to take rather a lot of space, too, not least when I start getting into examples, since it's in the nature of the subject —study of large language structures— that the examples be large.  If one wants to get a handle on a technique suitable for studying larger language structures, though, I reckon one has to be willing to pay the price of admission.

Contents
Advice
Academe
Gosh
Easy as futhorc
Revalency
Relative clauses
The deep end
Advice

Some bits of advice I've picked up from on-line conlangers:

  • The concept of morphemes — word parts that compose to make a word form, like in‑ describe ‑able → indescribable — carries with it a bunch of conceptual baggage, about how to think about the structure of language, that is likely counterproductive to inject into conlanging.  David J. Peterson makes this point.

  • Many local features of language are best appreciated when one sees how they work in an extended discourse.  This has been something of a recurring theme on the Conlangery Podcast, advice they've given about a variety of features.

  • Ergativity, a feature apparently fascinating to many conlangers, may not even be a thing.  I was first set on to this objection by the Conlangery Podcast, who picked it up from the linguistic community where it occurred in, afaict, the keynote address of a 2005 conference on ergativity.

The point about morphemes is that they are just one way of thinking about how word forms arise.  In fact, they are an unfalsifiable way to think about it:  anything that language does, one can invent a way to describe using morphemes (ever hear of a subtractive morpheme?).  That might be okay if you're trying to make sense of the overwhelming welter of natural languages, but doesn't make it a good way to construct a language; at least, not a naturalistic artlang.  If you try to reverse-engineer morphemic analysis to construct a language, you'll end up with a conlang whose internal structure is the sort of thing most naturally described by morphemic analysis.  It's liable to feel artificial; which, from a linguistic perspective, may suggest that morphemic analysis isn't what we're doing when we use language.

There are a couple of alternatives available to morphemic analysis.  There's lexeme-based morphology; that's where you start with a "stem" and perform various processes on it to get its different word forms.  For a noun that's called declining, for a verb it's conjugating.  The whole collection of all the forms of the word is called a lexeme; yes, that's another -eme word for an abstract entity defined by some deeper structure (like a phoneme, abstracted from a set of different allophones that are all effectively the same sound in the particular language being studied; or a morpheme that is considered a single abstract entity although it make take different concrete forms in particular words, like English plural -s versus -es that might be analyzed as two allomorphs of the same morpheme; though afaik there's no allo- term corresponding to lexeme).  The third major alternative approach is word-based morphology, in which the whole collection of word forms is considered as a set, rather than trying to view it as a bunch of different transformations applied to a stem.  For naturalistic artlanging — or for linguistic experimentation — word-based morphology has the advantage that it doesn't try to artificially impose some particular sort of internal structure onto the word; but then again, not all natlangs are equally chaotic.  For example, morpheme-based morphology is more likely to make sense if you're studying an agglutinative language (which prefers to simply concatenate word parts, each with a single meaning), while becoming more difficult to apply with a fusional language (where a single word part can combine a whole bunch of meanings, like a Latin noun suffix for a particular combination of gender case and number).

As I've tried to increase my sophistication as a conlanger, more and more I've come up against things for which discourse is recommended.  But, I perceive a practical problem here.  Heavyweight conlangers tend to be serious polyglots.  Such people tend to treat learning a new language as something done relatively casually.  Study the use of this feature in longer discourses, such a person might suggest, to get a feel for how it works.  But to do that properly, it seems you'd have to reach a fairly solid level of comfort in the language.  Not everyone will find that a small investment.  And if you want to explore lots of different ways of doing things, multiplying by a large investment for each one may be prohibitive.

So one would really like to have a method for studying the discursive properties of different language arrangements without having to acquire expensive fluency in each language variant first.  Keeping in mind, it's not all that easy even to give one really extended example of discourse, nor to explain to the reader what they ought to be noticing about it.

Okay, so, what's the story with ergativity?  Here's how the 2005 paper explains the general concern:

when we limit a collection to certain kinds of specimens, there is a question whether a workshop on "ergativity" is analogous to an effort to collect, say, birds with talons — an important taxonomic criterion —, birds that swim — which is taxonomically only marginally relevant, but a very significant functional pattern —, or, say, birds that are blue, which will turn out to be pretty much a useless criterion for any biological purpose.
Scott Delancey, "The Blue Bird of Ergativity", 2005.
The paper goes on to discuss particular instances of ergativity in languages, and the sense I got in reading these discussions was (likely as the author intended) that what was going on in these languages was really idiosyncratic, and calling it "ergativity" didn't begin to do it justice.

Now, another thing often mentioned by conlangers in the next breath after ergativity is trigger languages.  Trigger languages are yet another way of marking the relations between a verb and its arguments, different again from nominative-accusative languages (like English) or ergative-absolutive languages (like Basque).  There's a catch, though.  Trigger languages are a somewhat accidental invention of conlangers.  They were meant to be a simplification of Austronesian alignment — but  (a) there may have been some misunderstanding of what linguists were saying about how Austronesian alignment works, and  (b) linguists are evidently still trying to figure out how Austronesian alignment works.  From poking around, the sense I got was that Austronesian alignment is only superficially about the relation of the verb to its arguments, really it's about some other property of nouns — specificity? — and, ultimately, one really ought to... study its use in longer discourses, to get a feel for how it works.  Doh.

Another case where exploratory discourse is especially recommended is nonconfigurationality.  Simply put, it's the property of some languages that word order isn't needed to indicate the grammatical relationships of words within a sentence, so that word order within a sentence can be used to indicate other things — such as discursive structure.  Here again, though, there's a catch.  There's a distinct difference between "configurational" languages like English, where word order is important in determining the roles of words in a sentence (the classic example is dog bites man versus man bites dog), and "nonconfigurational" languages like ancient Greek (or conlang Na'vi), where word order within a given sentence is mostly arbitrary.  However, the massively polysyllabic terms for these phenomena, "configurational" and "nonconfigurational", come specifically from the phrase-structure school of linguistics.  That's, more-or-less, the Chomskyists.  Plenty of structural assumptions there, with a side order of controversy.  So, just as you take on more conceptual baggage by saying "bound morpheme" than "inflection", you take on more conceptual baggage by saying "nonconfigurational" than "free word order".

Is there another way of looking at "nonconfigurationality", without the phrase-structure?  The Wikipedia article on nonconfigurationality notes that in dependency grammar, the configurational/‌nonconfigurational distinction is meaningless.  One thing about Wikipedia, though:  more subtle than merely "don't take it as gospel", you should think about the people who wrote what you're reading.  In this case, some digging turns up a remark in the archives of Wikipedia's Linguistics project, to the effect that, we really appreciate your contributions, but please try for a more balanced presentation of different points of view, as for example in the nonconfigurationality article you spend more time talking about how dependency grammar doesn't like it than actually talking about the thing itself.

It seems Chomskyism is not the only linguistic school that has its enthusiasts.

The thought did cross my mind, at this point, that if dependency grammar fails to acknowledge the existence of a manifest distinction between configurational and nonconfigurational languages, that's not really something for dependency grammar to brag about. (Granting, this failure-to-acknowledge may be based on a narrower interpretation of "nonconfigurationality" than the phrase-structurists actually had in mind.)

In reading up on morpheme-like concepts, I came across the term phonaestheme — which caught my attention since I was aware J.R.R. Tolkien's approach to language, both natural and constructed, emphasized phonaesthetics.  A phonaestheme is a bit of the form of a word that's suggestive of the word's meaning, without actually being a "unit of meaning" as a morpheme would be; that is, a word isn't composed of phonaesthemes, it just might happen to contain one or more of them.  Notable phonaesthemes are gl- for words related to light or vision, and sn- for words related to the nose or mouth.  Those two, so we're told, were mentioned two and a half millennia ago, by Plato.

The whole idea of phonaesthemes flies in the face of the principle of the arbitrariness of signs.  More competing schools of thought.  This is a pretty straightforward disagreement:  Swiss linguist Ferdinand de Saussure, 1857–1913, apparently put great importance on the principle that the choice of a sign is completely arbitrary, absolutely unrelated to its meaning; obviously, the idea that words of certain forms tend to have certain kinds of meanings is not consistent with that.

That name, Ferdinand de Saussure, sounded familiar to me.  Seems he was hugely influential in setting the course for twentieth-century linguistics, and he's considered the co-founder, along with C.S. Pierce, of semiotics — the theory of signs.

Semiotics definitely rang a bell for me.  Not a particularly harmonious one, alas; my past encounters with semiotics had not been altogether pleasant.

Academe

Back around 1990, when I first started thinking about abstraction theory, I did some poking around to get a broad sense of who in history might have done similar work.  There being no systematic database of such things (afaik) on the pre-web internet, I did my general poking around in a hardcopy Encyclopædia Britannica.  Other than some logical terminology to do with defining sets, and specialized use of the term abstraction for function construction in λ-calculus, I found an interesting remark that (as best I can now recall) while Scholasticism, the dominate academic tradition in Europe during the Dark Ages, was mostly concerned with theological questions, one (or did the author claim it was the one?) non-theological question Scholastics extensively debated was the existence of universals — what I would tend to call "abstractions".  There were three schools of thought on the question of universals.  One school of thought said universals have real existence, perhaps even more real than the mundane world we live in; that's called Platonism, after Plato who (at least if we're not misreading him) advocated it.  A second school of thought said universals have no real existence, but are just names for grouping things together.  That's called nominalism; perhaps nobody believes it in quite the most extreme imaginable sense, but a particularly prominent representative of that school was William of Ockham (after whom Occam's razor is named).  In between these two extremes was the school of conceptualism, saying universals exist, but only as concepts; John Locke is cited as representative (who wrote An Essay Concerning Human Understanding, quoted at the beginning of the Wizard Book for its definition of abstraction).

That bit of esoterica didn't directly help me with abstraction theory.  Many years later, though, in researching W.V. Quine's dictum To be is to be the value of a variable (which I'd been told was the origin of Christopher Strachey's notion of first-class value), when I read a claim by Quine that the three early-twentieth-century schools of thought on the foundations of mathematics — logicism, intuitionism, and formalism — were latter-day manifestations of the three medieval schools of thought on universals, I was rather bemused to realize I understood what he was saying.

I kept hoping, though, I'd find some serious modern research relevant to what I was trying to do with abstraction theory.  So I expanded the scope of my literature search to my alma mater's university library, and was momentarily thrilled to find references to treatment of semiotics (I'd never heard the term before) in terms of sets of texts, which sounded a little like what I was doing.  It took me, iirc, one afternoon to be disillusioned.  Moving from book to book in the stacks, I gathered that the central figure in the subject in modern times was someone (whom I'd also never heard of before) by the name of Jacques Derrida.  But it also became very clear to me that the material was coming across to me as meaningless nonsense — suggesting that either the material was so alien it might as well have been ancient Greek (I hadn't actually learned the term nonconfigurational at that point, but yes, same language), or else that the material was, in fact, meaningless nonsense.

The modern growth of the internet, all of which has happened since my first literature searches on that subject, doesn't necessarily improve uniformly on what could be done by searching off-line through physical stacks of books and journals in a really good academic library (indeed, it may be inferior in some important ways), but what is freely available on-line can be found with a lot less effort (if you can devise the right keywords to search for; which was less of a problem for off-line searches before the old physical card catalogs were destroyed — but I digress).  Turns out I'm not alone in my reaction to Derrida; here are some choice quotes about him from Wikiquote:

Derrida's special significance lies not in the fact that he was subversive, but in the fact that he was an outright intellectual fraud — and that he managed to dupe a startling number of highly educated people into believing that he was onto something.
— Mark Goldblatt, "The Derrida Achievement," The American Spectator, 14 October 2004.
Those who hurled themselves after Derrida were not the most sophisticated but the most pretentious, and least creative members of my generation of academics.
— Camille Paglia, "Junk Bonds and Corporate Raiders : Academe in the Hour of the Wolf", Arion, Spring 1991.
Many French philosophers see in M. Derrida only cause for silent embarrassment, his antics having contributed significantly to the widespread impression that contemporary French philosophy is little more than an object of ridicule.  M. Derrida's voluminous writings in our view stretch the normal forms of academic scholarship beyond recognition. Above all — as every reader can very easily establish for himself (and for this purpose any page will do) — his works employ a written style that defies comprehension.
— Barry Smith et al. "Open letter against Derrida receiving an honorary doctorate from Cambridge University", The Times (London), Saturday, May 9, 1992.
This is the intellectual climate in which, in the 1990s, physicist Alan D. Sokal submitted a nonsense article to peer reviewed scholarly journal Social Text, to see what would happen — and it was published (link).

One might ask what academic tradition (if any) Derrida's work came from.  Derrida references Saussure.  Derrida's approach is sometimes called post-structuralism, as it critiques the structuralist tradition of the earlier twentieth century.  Structuralism, I gather, said that the relation between the physical world and the world of ideas must be mediated by the structures of language.  (In describing post-structuralism one may cover a multitude of sins with a delicate term like "critique", such as denying that the gap between reality and ideas can be bridged, or denying that there is such a thing as reality.)  Structuralism, in turn, grew out of structural linguistics, the theory that language could be understood as a hierarchy of discrete structures — phonemes, morphemes, lexical categories, and so on.  Structural linguistics is due in significant part to Ferdinand de Saussure.

It doesn't seem fair to blame Saussure for Derrida.  Apparently a large part of all twentieth-century linguistic theory traces back through Saussure.  Saussure's tidily structured approach to linguistics does appear to have led to both the Chomskyist and (rather less directly, afaict) the dependency grammar approaches — the phrase-structure approach is also called constituency grammar to contrast with dependency, as the key difference is whether one looks at the parts (constituents) or the connections (dependencies).  Despite my suspicion that both of those approaches may have inherited some over-tidiness, I'm not inclined to "blame" Saussure for them, either; it seems to me perfectly possible that the structural strategy may have been a good way to move things forward in its day, and also not be a good way to move things forward from where we are now.  The practical question is, where-to next?

That term phonaestheme, which reminded me of phonaesthetics associated with J.R.R. Tolkien?  Turns out phonaestheme was coined by J.R. Firth, an English contemporary of J.R.R. Tolkien.  Firth was big on the importance of context.  "You shall know a word by the company it keeps", he's quoted from 1957.  Apparently he favored "polysystematism", which afaict means that you don't limit yourself to just one structural system for studying language, but switch between systems opportunistically.  Since that's pretty much what a conlanger has to do — whatever works — I rather like the attitude.  "His work on prosody," says Wikipedia somewhat over-sagaciously, "[...] he emphasised at the expense of the phonemic principle".  It took me a few moments to untangle that; it says a lot less than it seems; as best I can figure, prosody is aspects of speech sound that extend beyond individual sound units (phonemes), and the phonemic principle basically says all you need are phonemes, i.e., you don't need prosody.  So... he emphasized prosody at the expense of the principle that you don't need prosody?  Doesn't sound as impressive, somehow.  Unsurprisingly, Saussure comes up in the early history of the idea of phonemes.

In hunting around for stuff about discourse, I've been aware for a while there's another whole family of approaches to grammar called functional grammar — as opposed to structural grammar.  So the whole constituent/‌dependency thing is all structural, and this is off in a different world altogether.  Words are considered for their purpose, which naturally puts a lot of attention on discourse because a lot of the purpose of words has to do with how they fit into their larger context (that's why the advice to consider discourses in the first place).  There are a bunch of different flavors of functional grammar, including one — systemic functional grammar — due to Firth's student Michael Halliday; Wikipedia notes that Halliday approaches language as a semiotic system, and lists amongst the influences on systemic functional grammar both Firth and Saussure.  (Oh what a tangled web we weave...)  I keep hoping to find, somewhere in this tangle, a huge improvement on the traditional —evidently Saussurean— approach to grammar/‌linguistics I'm more-or-less familiar with.  Alas, I keep being disappointed to find alien vocabulary and alien concepts, and keep gradually coming to suspect that a lot of what the structural approach can do well (there are things it does well) has been either sidelined or simply abandoned, while at the same time the terminology has been changed more than necessary, for the sake of being different.

It's a corollary to the way new scientific paradigms seek to gain dominance (somewhat relevant previous post:  yonder) that the new paradigm will always change more than it needs to.  A new paradigm will not succeed if it tries merely to improve those things about the old paradigm that need improvement.  Normal science gets its effectiveness from the fact that normal scientists don't have to spend time and effort defending their paradigm, so they can put all that energy into working within the paradigm, and thereby make rapid progress at exploring that paradigm's space of possible research.  Eventually this leads to clear recognition of the inadequacies of the paradigm; but even then, many folks will stick to the old paradigm, and we probably shouldn't think too poorly of them for doing so, even though we might think they're being shortsighted in the particular case.  Determination to make one or another paradigm work is the wind in science's sails.  But, exactly because abandoning the old paradigm for a new one is so traumatic, nobody's going to want to do it for a small reason.  And those who do want to do it are likely to want to disassociate themselves from the old paradigm entirely.  That means changing way more than necessary.  Change for its own sake, far in excess of what was really needed to deal with the problems that precipitated the paradigm shift in the first place.

Another thread in the neighborhood of functional grammar is emergent grammar, a view of linguistic phenomena proposed in a 1987 paper by British-American linguist Paul Hopper.  Looking over that paper gave me a better appreciation of the structuralism/‌functionalism schism as a struggle between rival paradigms.  As Thomas Kuhn noted, rival paradigms aren't just alternative theories; they determine what entities there are, what questions are meaningful, what answers are meaningful — so followers of rival paradigms can fail to communicate by not even agreeing on what the subject of discussion is.  Notably, even Hopper's definition of discourse isn't the same as what I thought I was dealing with when I started.  My impression, starting after all with traditional (structural) grammar by default, was that discourse is above the level of a sentence; but for functional grammarians, to my understanding, that sentence boundary is itself artificial, and they'd object to making any such strong distinction between intra-sentence and inter-sentence.  Hopper's paper is fairly willing to acknowledge that traditional grammatical notions aren't altogether illusions; its point is that they are only approximations of the pattern-matching reality assembled by language speakers, for whom the structure of language — abstract rules, idioms, literary allusions, whatever — is perpetually a work in progress.

Which sounds great... but, looking through some abstracts of more recent work in the emergent grammar tradition, one gets the impression that much of it amounts to "we don't yet have a clue how to actually do this".  So once again, it seems there's more backing away from traditional structural grammar than replacing it; I've sympathy for their plight, as anyone trying to develop an alternative to a well-established paradigm is sure to have a less developed paradigm than their main competition, but that sympathy doesn't change my practical bottom line.

It was interesting to me, looking through Hopper's paper, that while much of it was quite accessible, the examples of discourse were not so much.

Gosh

Fascinating though the broad sweep of these shifting paradigmatic trends may be, it seems kind of like overkill.  I do believe it's valuable big picture, but now that we've oriented to that big picture, it seems we ought to come down to Earth a bit if we're to deal with the immediate problem; I started just wanting a handy way to explore how different conlang features play out in extended discourse.  As a conlanger I've neither been a great fan of linguistic universals (forsooth), nor felt any burning need to overturn the whole concept of grammatical structure.  As I've remarked before, a structural specification of a conlang is likely to be the conlang's primary identity; most conlangs don't have a lot of L1 speakers with which to do field interviews.  Granting Kuhn's observation that a paradigm determines what questions and answers are possible, if a linguistic paradigm doesn't let me effectively answer the questions I need to answer to define my conlang, I won't be going in whole-hog for that linguistic paradigm.

Also, as remarked earlier, the various modern approaches — both structural and functional — analyze (natural) language, and there's no evident reason to suppose that running that analysis in reverse would make a good way to construct a language, certainly not if one hopes for a result that doesn't have the assumptions of that analysis built in.

So, for conlanging purposes, what would an ideal approach to language look like?

Well, it would be structural enough to afford a clear definition of the language.  It would be functional enough to capture the more free-form aspects of discourse that intrude even on sentences in "configurational" languages like English.  In all cases it would afford an easily accessible presentation of the language.  Moreover, we would really like it to — if one can devise a way to achieve this — avoid pre-determining the range of ways the conlang could work.  It might be possible, following a suitable structuralist paradigm, to reduce the act of building a language to a series of multiple-choice questions and some morpheme entries (or just tell the wizard to use a pseudo-random-number generator), but the result would not be art, just as paint-by-numbers isn't art; and, in direct proportion to its lack of artistry, it would lack value as an exploration of unconstrained language-space.  For my part, I see this as an important and rather lovely insight:  the art of conlanging is potentially useful to the science of linguistics only to the extent that conlanging is an art rather than a science.

The challenge has a definitional aspect and a descriptive aspect.  One way to define how a language works is to give a classical structural specification.  This can be relatively efficient and lucid, for the amount of complexity it can encompass.  As folks such as Hopper point out, though, it misses a lot of things like idioms, and proverbs, and overall patterns of discourse.  Not that they'd necessarily deny the classical structural description has some validity; it's just not absolute, nor complete.  We'd like to be able to specify such linguistic patterns in a way that includes the more traditional ones and also includes all these other things, in a spectrum.  Trouble is, we don't know how.  One might try to do it by giving examples, and indeed with a sufficient amount of work that might more-or-less do the job; but then the descriptive aspect rears its head.  Some of these patterns are apparently quite complicated and subtle, and by-example is quite a labor-intensive way to describe, and quite a labor-intensive way to learn, them.  Insisting on both aspects at once, definitional and descriptive, isn't asking "too much", it's asking for what is actually needed for conlanging — which makes conlanging a much more practical forum for thrashing this stuff out; an academic discipline isn't likely to reject a paradigm on the grounds that it isn't sufficiently lucid for the lay public.  The debatable academic merits of some occult theoretical approach to linguistics is irrelevant to whether an artlang's audience can understand it.

So what we're looking for is a lucid way to describe more-or-less-arbitrary patterns of the sort that make up language, ranging from ordinary sentence structure through large-scale discourse patterns and whatnot.  Since large-scale discourse patterns are, afaict, already both furthest from lucidity and furthest from being covered by the traditional structural approach, they seem a likely place to start.

Easy as futhorc

Let's take one of those extended examples that I found impenetrable in Hopper's paper.  It's a passage from the Anglo-Saxon Chronicle, excerpted from the entry for Anno Domini 755; that's the first year that has a really lengthy report (it's several times the length of any earlier year).  Here is the passage as rendered by Wikisource (as Hopper's paper did not fare well in conversion to html).  It's rather sparsely punctuated; instead it's liberally sprinkled with symbol ⁊, shorthand for "and" (even at junctures where there is a period).  The alphabet used is a variant of Latin with several additional letters — æ and ð derived from Latin letters, þ and ƿ derived from futhorc runes (in which Anglo-Saxon had been written in earlier times, whose first six runes are feoh ur þorn os rad cen — futhorc).

⁊ þa geascode he þone cyning lytle werode on wifcyþþe on Merantune ⁊ hine þær berad ⁊ þone bur utan beeode ær hine þa men onfunden þe mid þam kyninge wærun
The point Hopper is making about this passage has to do with the way its verbs and nouns are arranged, which wouldn't have to be arranged that way under a traditional structural description of the "rules" of Anglo-Saxon grammar.  Truthfully, coming to it cold, his point fell completely flat for me because only laborious scrutiny would allow me to even guess which of those words are verbs and which are nouns, let alone how the whole is put together.  And that is the basic problem, right there:  the pattern meant to be illustrated can't be seen without first achieving a level of comfort with the language that may be expensive.  If, moreover, you want to consider a whole range of different ways of doing things (as I have sometimes wanted to do, in my own conlanging), the problem is greatly compounded.

Since Hopper's point involves logical decomposition of the passage into segments, he does so and sets next to each its translation (citing Charles Plummer's 1899 translation); as Hopper's paper (at least, in html conversion) rather ran together each segment with its translation, making them hard to separate by eye, I've added tabular format:

Anglo-Saxon English
⁊ þa geascode he þone cyning and then he found the king
lytle werode with a small band of men
on wifcyþþe a-wenching
on Merantune in Merton
⁊ hine þær berad and caught up with him there
⁊ þone bur utan beeode and surrounded the hut outside
ær hine þa men onfunden before the men were aware of him
þe mid þam kyninge wærun who were with the king
Table 1.
I looked at that and struggled to reason out which Anglo-Saxon word contributes what to each segment (and even then it was just a guess).  The problem is further highlighted by Hopper's commentary, where he chooses to remark particularly on which bits are verb-initial and which are verb-final — as if I (his presumed interested, generally educated but lay, reader) could see at a glance which words are the verbs, or, as he may have supposed, just see at a glance the whole structure of the thing.

We can glimpse another part of the same elephant from Tolkien's classic 1936 lecture "Beowulf: The Monsters and the Critics", in which he promoted the wonderfully subversive position that Beowulf is beautiful poetry, not just a subject for dry academic scholarship.  His lecture has been hugely influential ever since; but my point here is that he was one of those polyglots I was talking about earlier, and was able to appreciate the beauty of the poem because he was fluent in Old English (as well as quite a lot of other related languages, including, of all things, Gothic).  I grok that such beauty is best appreciated from the inside; but it really is difficult for mere mortals to get inside like that.  One suspects a shortfall of deep fluency even amongst the authors of academic treatises on Beowulf may have contributed significantly to the dryness Tolkien was criticizing.  My concern here is that we want to be able to illustrate (and even investigate) facets of the structure of discourses without requiring prior fluency; if these illustrations also contribute to later fluency, that'd be wicked awesome.

The two problems with Table 1 are, apparently, that it's not apparent what's going on with the individual words, and that it's not apparent what's going on with the significant part (whatever that is) of the high-level structure.  There's a standard technique meant to explain what the individual words are doing, glossing.  There are a couple of good reasons why one would not expect glossing to be a good fit here, but we need to start somewhere, so here's at attempt at an interlinear gloss for this passage:

þa geascode he þone cyning
and
 
then
 
intensive-ask 
3rd;sg;past
he
3rd;nom;sg 
the
acc;msc;sg 
king
 
and  then  found he the king
lytle werode
small
instrumental;sg 
troop
dative;sg
with a small band of men
onwifcyþþe
on/at/about 
 
woman.knowledge
dative;sg
aboutwoman-knowledge
onMerantune
on/in/about
 
Merton
dative
inMerton
hine þær berad
and
 
he
acc;sg 
there
adverb 
catch.up.with.by.riding
3rd;sg;past
and  him there caught up with
þone bur utan beeode
and 
 
the
acc;msc;sg 
room 
 
from.without 
adverb
bego/surround
3rd;sg;past
and the hut outside surrounded
ær hine þa men onfunden
before 
 
he
acc;sg 
the
nom;pl 
man
nom;pl 
en-find
subjunctive;pl;past
before him the men became aware of
þe mid þam kyninge wærun
who/which/that 
 
with 
 
he
dative 
king
dative;sg 
be
pl;past
who with the king were
Table 2.
Okay.  Those two reasons I had in mind, why glossing would not be a good fit here, are both in evidence.  Basically we get simultaneously too much and too little information.

I remarked in a previous post that it's very easy for glossing to fail to communicate its information ("too little" information).  I didn't "borrow" the above gloss from someone else's translation (though I did occasionally compare notes with one); I put it together word-by-word, and got far more out of that than is available in Table 2.  The internal structures of some of those words are quite fascinating.  Hopper was talking about verb-initial and verb-final clauses, and I was sidetracked by the fact that his English translations didn't preserve the positions of the verbs; I've tried to fix that in Table 2, by tying the translation more closely to the original; but I was also thrown off by the translation "a-wenching", because it gave me the impression that was a verb-based segment.  I do like the translation a-wenching rather more than other translations I've found, as it doesn't beat around the bush; I also found womanizing, with a woman, and, just when I thought I'd seen it all, visiting a lady, which forcefully reminded me of Mallory's Le Morte Darthur.  The original is a prepositional phrase, with preposition on and object of the preposition wifcyþþe.

I first consciously noticed about thirty years ago that prepositions are spectacularly difficult to translate between languages, an awareness that has shaped the directions of my conlanging ever since.  Wiktionary defines Old English (aka Anglo-Saxon) on as on/in/at/among.  wifcyþþe is even more fun; not listed as a whole in Wiktionary, an educated guess shows it's a compound whose parts are individually listed — wif, woman/wife, and an inflected form of cyþþu, suggested definitions either knowledge, or homeland (country which is known to you).  So the king was on about woman-knowledge.  Silly me; I'd imagined that "biblical knowledge" thing was a euphemism for the sake of Victorian sensibilities, which perhaps it was in part, but the origin is at least a thousand years earlier and not apparently trying to spare anyone's sensibilities.  It also doesn't involve any verb, so I adjusted the translation to reflect that.

The declension of cyþþu was rather insightful, too.  The Wiktionary entry is under cyþþu because that's the nominative singular.  The declension table has eight entries; columns for singular and plural, rows for nominative, accusative, genitive, and dative; no separate row for the instrumental case, though instrumental does show up separately for some Old English nouns.  But here's the kicker:  cyþþe is listed in all the singular entries except nominative, and as an alternative for the plural nominative and accusative.  I've listed it as dative singular, because in this context (as best I can figure) it has to be dative to be the object of the preposition, and as a dative it has to be singular, but that really isn't an intrinsic property of the word.  It really seems very... emergent.  This word is somewhere in an intermediate state between showing these different cases and not showing them.  Putting it another way, the cases themselves are in an intermediate state of being:  the "reality" of those cases in the language depends on the language caring about them, and evidently different parts of the language are having different ideas about how "real" they should be (in contrast to unambiguous, purely regular inflections in non-naturalistic conlangs such as Esperanto or, for that matter, my own prototype Lamlosuo).

There's also rather more going on than the gloss can bring out in geascode and berad, which involve productive prefixes ge- and be- added to ascian and ridan — ge-ask = discover by asking/demanding (interrogating?), be-ride = catch up with by riding.  All that, beneath the level of what the gloss brings out — as well as the difficulty the gloss has bringing it out.  The gloss seems most suited to providing additional information when focusing in on what a particular word is doing within a particular small phrase; it can't show the backstory of the word ("too little" information) at the same time that it clutters any attempt to view a large passage ("too much" information; the sheer size of Table 2 underlines this point).  Possibly, for this purpose, the separate line for the translation is largely redundant, and could be merged with the gloss to save space; but there's still too much detail there.  The next step would be to omit some of the information about the inflections; but this raises the question of just which information about the words does matter for the sort of higher-level structure we're trying to get at.

Here's a compactified form based on the gloss, merging the gloss with the translation and omitting most of the grammatical notes.

þa geascode he þone cyning
and
 
then
 
ge-asked 
(found)
he
(nominative) 
the
(accusative) 
king
 
lytle werode
with a small
(instrumental) 
band of men
(dative)
onwifcyþþe
on/about 
 
woman.knowledge
(dative; wenching)
onMerantune
in 
 
Merton
(dative)
hine þær berad
and
 
him
(accusative) 
there
(adverb) 
be-rode
(caught up with)
þone bur utan beeode
and 
 
the
(accusative) 
hut 
 
outside
(adverb) 
be-goed
(surrounded)
ær hine þa men onfunden
before 
 
him
(accusative) 
the men
(nominative) 
en-found
(noticed)
þe mid þam kyninge wærun
who 
 
with 
 
the
(dative) 
king
(dative) 
were
 
Table 3.
Imho this is better, bringing out a bit more of the most important low-level information, less of the dispensable low-level clutter, and perhaps leaving more opportunity for glimpses of high-level structure.  In this particular case, since the higher-level structure Hopper wants to bring out is simply where the verbs are, one might do that by putting the verbs in boldface, thus:
þa geascode he þone cyning
and
 
then
 
ge-asked 
(found)
he
(nominative) 
the
(accusative) 
king
 
lytle werode
with a small
(instrumental) 
band of men
(dative)
onwifcyþþe
on/about 
 
woman.knowledge
(dative; wenching)
onMerantune
in 
 
Merton
(dative)
hine þær berad
and
 
him
(accusative) 
there
(adverb) 
be-rode
(caught up with)
þone bur utan beeode
and 
 
the
(accusative) 
hut 
 
outside
(adverb) 
be-goed
(surrounded)
ær hine þa men onfunden
before 
 
him
(accusative) 
the men
(nominative) 
en-found
(noticed)
þe mid þam kyninge wærun
who 
 
with 
 
the
(dative) 
king
(dative) 
were
 
Table 4.
Hopper's point is, broadly, that this follows the pattern of "a verb-initial clause, usually preceded by a temporal adverb such as a 'then'; [...] [which] may contain a number of lexical nouns introducing circumstances and participants [...] followed by a succession of verb-final clauses".  And indeed, we can now see that that's what's going on here.

The technique used by Table 4, with some success, also has a couple of limitations.  (1) It is specific to this one type of structure, with no apparent generalization.  (2) It appears to be a means only for showing the reader a pattern that the linguist already recognizes, rather than for the linguist to discover patterns, or, even more insightfully, for the linguist to experiment with how the high-level dynamics would be changed by an alteration in the rules of the language.  Are those other things too much to ask?  Heck no.  Ask, otherwise ye should expect not to receive.

Revalency

For a second case study to move things forward, I have in mind something qualitatively different; not a single extended passage with some known property(-ies) to be conveyed to the reader, but a battery of examples exploring different ways to arrange a language, meant to be exhaustive within a limited range of options.  It is, in fact, a variant of the case study that set me on the road to the discourse-representation problem.  About ten years ago, after first encountering David J. Peterson's essay on Ergativity, I dreamed up a verb alignment scheme, alternative to nominative-accusative (NA) or ergative-absolutive (EA), called valent-revalent (VR), and was curious enough to try to get a handle on it by an in-depth systematic comparison with NA and EA.  The attempt was both a success and a failure.  I learned some interesting things about VR that were not at all apparent to start with, but I'm unsure how far to credit the lucidity of the presentation — by which we want to elucidate things for both the conlanger and, hopefully, their audience — for those insights; it seems to some extent I learned those things by immersing myself in the act of producing the presentation.  I also came away from it with a feeling of artificiality about VR, but it's taken me years to work out why; and in the long run I didn't stay satisfied with the way I'd explored the comparison between the systems, which is part of why I'm writing this blog post now.

First of all, we need to choose what form our illustrations will take — that is, we have to choose our example "language".  Peterson's essay defines a toy conlang — Ergato — with only a few words and suffixes so that simply working through the examples, with a wide variety of different ways for the grammar to work, is enough to confer familiarity.  I liked his essay and imitated it, using a subset of Ergato for an even smaller language, to illustrate just the specific issues I was interested in.  Another alternative, since we're trying to explore the structure itself, might be to use pseudo-English with notes, like the translational gloss in the previous section but without the Old English at the top.  Some objections come to mind, though pseudo-English is well worth keeping handy in the toolkit.  The pseudo-English may be distracting; Ergato is, gently, more immersive.  The pseudo-English representation would be less compact than Ergato.  And a micro-conlang Ergato has more of the fun of conlanging in it.

The basic elements of reduced Ergato:

Verbs Nouns Pronoun
 English   Ergato 
to sleep
to pet
to give
sapu
lamu
kanu
 English   Ergato 
panda
woman
book
man
fish
palino
kelina
kitapo
hopoko
tanaki
 English   Ergato 
she li
 
Conjunction
 English   Ergato 
and i

Suffixes Suffixes
 English   Ergato 
 Valency Reduction 
Past Tense
Plural
-to
-ri
-ne
 English   Ergato 
Default Case
Special Case
 Recipient/Dative Case 
Oblique Case
Extra Case

-r
-s
-k
-m
Table 5.

Peterson's essay had more verbs, especially, so he could explore various subtle semantic distinctions; for the structures I had in mind, I just chose one intransitive (sleep), one transitive (pet), and one ditransitive (give).

Quick review:  NA and EA concern core thematic roles of noun arguments to a verb: S=subject, the single argument to an intransitive verb; A=agent, the actor argument to a transitive verb; P=patient, the acted-upon argument to a transitive verb.  In pure NA and pure EA, two of the three core thematic roles share a case, while one is different; in pure NA, the odd case is accusative patient, the other two are nominative; in pure EA, the odd case is ergative agent, the other two are absolutive.  There are other systems for aligning verb arguments, but I was, mostly, only looking at those two and VR.

Word order was a question.  Peterson remarks that he finds SOV (subject object verb) most natural for an ergative language, and I find that too.  (I'll have a suggestion as to why, a bit further below.)  I'm less sure of my sense that SVO (subject verb object) is natural for a nominative language, because my native English is nominative and SVO, which might be biasing me (or, then again, the evolution of English might be biased in favor of SVO because of some sort of subtle affinity between SVO and nominativity).  But I found verb-initial order (VSO) far the most natural arrangement for VR.  So, when comparing these, should one use a single word order so as not to distract from the differences, or let each one put its best foot forward by using different orders for the three of them?  I chose at the time to use verb-initial order for all three systems.

Okay, here's how VR works, in a nutshell (skipping some imho not-very-convincing suggestions about how it could have developed, diachronically); illustrations to follow.  Argument alignment is by a combination of word order with occasional case-like marking.  By default, all arguments have the unmarked case, called valent; the first argument is the subject/agent, the second is the patient, and if it's ditransitive the third is the recipient.  Arguments can be omitted by simply leaving them off the end.  If an argument is marked with suffix -t, it's in the revalent case, which means that an argument was omitted just before it; the omitted argument can be added back onto the end.  To cover a situation that can only come up with a ditransitive verb, there's also a double-revalent case, marked by -s, that means two arguments were omitted.  (The simplest, though not the only, reason VR prefers verb-initial order is that, in order to deduce the meaning of an argument from its VR marking, you have to already know the valency of the verb.)

A first step is to illustrate the three systems side-by-side for ordinary sentences.  To try to bring out the structure, such as it is, the suffixes are highlighted.  This would come out better with a wider page; but we'd need a different format if there were more than three systems being illustrated, anyway.

The woman is sleeping.
The woman is petting the panda.
The woman is giving the book to the panda.

NA EA VR
Sapu kelina.
Lamu kelina palinor.
Kanu kelina kitapor palinos.
Sapu kelina.
Lamu kelinar palino.
Kanu kelinar kitapo palinos.
Sapu kelina.
Lamu kelina palino.
Kanu kelina kitapo palino.
Table 6.

Valency reduction changes a transitive verb to an intransitive one.  Starting with a transitive sentence, the default-case argument is dropped, the special-case argument is promoted to default-case, the verb takes valency-reduction marking to make it intransitive, and the dropped argument may be reintroduced as an oblique.  In NA, this is passive voice; in EA, it's anti-passive voice.  VR is different from both, in that the verb doesn't receive a valency-reduction suffix at all (in fact, I chose revalent suffix -t on the premise that it was descended from a valency-reduction suffix that somehow moved from the verb to one of its noun arguments), and both the passive and anti-passive versions are possible.

The woman is petting the panda.
The woman is petting.
The panda is being petted.
The panda is being petted by the woman

NA EA VR
Lamu kelina palinor.
 
Lamuto palino.
Lamuto palino kelinak.
Lamu kelinar palino.
Lamuto kelina.
 
Lamuto kelina palinok.
Lamu kelina palino.
Lamu kelina.
Lamu palinot.
Lamu palinot kelina.
Table 7.
There may be a hint, here, of why Ergativity would have an affinity for SOV word order.  In this VSO order, anti-passivization moves the absolutive (unmarked) argument from two positions after the verb to one position after the verb (or to put it another way, it changes the position right after the verb from ergative to absolutive).  Under SVO, anti-passivization would move the absolutive argument from after the verb to before it (or, change the position before the verb from ergative to absolutive).  But under SOV, the absolutive would always be the argument immediately before the verb.

This reasoning doesn't associate NA specifically with SVO, but does tend to discourage NA from using SOV, since then passivization would move the nominative relative to the verb.  On the other hand, considering more exotic word orders (which conlangers often do), this suggests NA would dislike VOS but be comfortable with OVS or OSV, while EA would dislike OVS and OSV but be comfortable with VOS.

Passivization omits the woman.  Anti-passivization omits the panda.  Pure NA Ergato has no way to omit the woman, pure EA Ergato has no way to omit the panda.  Pure VR can omit either one.  English can also omit either one, because in addition to allowing passivization of to pet, English also allows it to be an active intransitive verb — "The woman is petting."  (One could say that to pet can be transitive or intransitive, or one might maintain that there are two separate verbs to pet, a transitive verb and an intransitive verb; it's an artificial question about the structural description of the language, not about the language itself.)

Table 7 is a broad and shallow study; and by that standard, imho rather successful, as the above is a fair amount of information to have gotten out of it.  However, it's too shallow to provide insight into why one might want to use these systems (if, in fact, one would, which is open to doubt since natlangs generally aren't "pure" NA or EA in this sense, let alone VR).  A particularly puzzling case, as presented here, is why a speaker of pure EA Ergato would want to drop the panda and then add it back in exactly the same position but with the argument cases changed; but on one hand this is evidently an artifact of the particular word order we've used, and on the other hand Delancey was pointing out that different languages may have entirely different motives for ergativity.

Using VR, it's possible to specify any subset of the arguments to a verb, and put them in any order.

VR English
Kanu kelina kitapo palino.
Kanu kelina palinot kitapo.
Kanu kitapot palino kelina.
Kanu kitapot kelinat palino.
Kanu palinos kelina kitapo.
Kanu palinos kitapot kelina.
The woman is giving the book to the panda.
The woman is giving to the panda the book.
The book is being given to the panda by the woman.
The book is being given by the woman to the panda.
The panda is being given by the woman the book.
The panda is being given the book by the woman.
Kanu kelina kitapo.
Kanu kelina palinot.
Kanu kitapot palino.
Kanu kitapot kelinat.
Kanu palinos kelina.
Kanu palinos kitapot.
The woman is giving the book.
The woman is giving to the panda.
The book is being given to the panda.
The book is being given by the woman.
The panda is being given to by the woman.
The panda is being given the book.
Kanu kelina.
Kanu kitapot.
Kanu palinos.
The woman is giving.
The book is being given.
The panda is being given to.
Table 8.
And this ultimately is why it fails.  You can do this with VR; and why would you want to?  In shallow studies of the pure NA and pure EA languages, we could suspend disbelief there would turn out to be some useful way to exploit them at a higher level of structure, because we know those pure systems at least approximate systems that occur in natlangs.  But VR isn't approximating something from a natlang.  It was dreamed up from low-level structural concerns; there's no reason to expect it will have some higher-level benefit.  It's not something one would do without need, either.  It requires tracking not just word order, not just case markings, but a correlation between the two such that case markings have only positional meaning about word order, rather than any direct meaning about the roles of the marked nouns, which seems something of a mental strain.  It's got no redundancy built into it, and it's perfectly unambiguous in exhaustively covering the possibilities — much too tidy to occur in nature.  There's also no leeway in it for the sort of false starts and revisions that take place routinely in natural speech; you can't decide after you've spoken a revalent noun argument to use a different word and then say that instead, because the meaning of the revalent suffix will be different the second time you use it.

It's still a useful experiment for exploring the dynamics of alignment systems, though.

But just a bit further up in scale, we meet a qualitatively different challenge

Relative clauses

Consider relative clauses.  In my revalency explorations ten years ago, I seem to have simply chosen a way for relative clauses to work, and run with it.  There was a Conlangery Podcast about relative clauses a while back (2012), which made clear there are a lot of ways to do this.  Where to start?  Not with English; too worn a trail.  My decade-past choice looks rather NA-oriented; so, how about an EA language?  Lots of languages have bits of ergativity in them — even English does — but deeply ergative languages are thinner on the ground.  Here's a sample sentence from a 1972 dissertation on relative clauses in Basque (link).

Aitak irakurri nai du amak erre duen liburua.
Father wants to read the book that mother has burned.
I had a lot more trouble assembling a gloss for this sentence than for the earlier example in Anglo-Saxon.  You might think it would be easier, since Basque is a living language actively growing in use over recent decades, where Anglo-Saxon has been dead for the better part of a thousand years; and since the example is specifically explicated in a dissertation by a linguist — one would certainly like to think that being explicated intensely by a linguist would be in its favor.  The dissertation did cover more of these words than Wiktionary did.  My main problem was with du/duen; I worked out from general context, with steadily increasing confidence, they had to be finite auxiliary verbs, but my sources were most uncooperative about confirming that.
aitak irakurri nai du amak erre duen liburua
father to read wants mother to burn has done book
(ergative)  (infinitive)  (desire has)  (ergative)  (infinitive)  (relativized)  (absolutive)
Basque is a language isolate — a natlang that, as best anyone can figure, isn't related to any other language on Earth.  Suggested origins include Cro-Magnons and aliens.

Basque is thoroughly ergative (rather than merely split ergative — say, ergative only for the past tense).  It's not altogether safe to classify Basque by the order of subject object and verb, because Basque word order apparently isn't about which noun is the subject and which is the object; it's about which is the topic and which is the focus; I haven't tackled that to fully grok, but it makes all kinds of sense to me that a language that thoroughly embraces ergativity would also not treat subject as an important factor in choosing word order, since subject in this sense is essentially the nominative case.  That whole line of reasoning about why SOV would be more natural for an ergative language than SVO or VSO exhibits, in retrospect, a certain inadequacy of imagination.  Also, most Basque verbs don't have finite forms.  Sort-of as if most verbs could only be gerunds (-ing).  Nearly all conjugation is on an auxiliary verb, that also determines whether the clause is transitive or intransitive — as if instead of "she burned the book" you'd say "she did burning of the book" (with auxiliary verb did).  There are also more verbal inflections than in typical Indo-European languages; the auxiliary verb agrees with the subject, the direct object, and the indirect object (if those objects occur).  I was reminded of noted conlang Kēlen, which arranges to have, in a sense, no verbs; if you took the Basque verbal arrangement a bit further by having no conjugating verbs at all beyond a small set of auxiliaries, and replaced the non-finite verbs with nouns, you'd more-or-less have Kēlen.

When a relative clause modifies a noun, one or another of the nouns in the relative clause refers to the antecedent — although in Basque the relative clause occurs before the noun it modifies, so say rather one of them refers to the postcedent.  In my enthusiastically tidy mechanical tinkering ten years ago, I worried about how to specify such things unambiguously.  Basque's solution?  Omit the referring word entirely.  Which also means you omit all the affixes that would have been on that noun in the relative clause; and Basque really pours on the affixes.  So, as a result of omitting the shared noun from the relative clause, you may be omitting important information about its role in the relative clause, thus important information about how the relative clause relates to the noun it modifies, leaving lots of room for ambiguity which the audience just resolves from context.  Now that's a natural language feature; I love it.

The 1972 dissertation took time out (and space, and effort) to argue, in describing this omission of the shared noun, that the omitted noun is deleted in place, rather than moved somewhere else and then deleted.  This struck me as a good example of what can happen when you try to describe something (here, Basque) using a structure (here, conventional phrase-structure grammar) that mismatches the thing described, and have to go through contortions to make it come out right.  It reminded me of debating how many angels can dance on the head of a pin.  The sense of mismatch only got stronger when I noticed, early in the dissertation's treatment, parenthetical remark "(questions of definiteness versus indefiniteness will not be raised here)".  He'd put lots of attention into things dictated by his paradigm even though they don't correspond to obvious visible features of the language, while dismissing obvious visible things his paradigm said shouldn't matter.

Like I was saying earlier:  determination to make one or another paradigm work is the wind in science's sails.

It's tempting to perceive Basque as a bizarre and complicated language.  Unremitting ergativity.  Massive agglutinative affixing.  Polypersonal agreement on auxiliary verbs.  Even two different "s" phonemes (that is, in English they're both allophones of the alveolar fricative).  I'm given to understand such oddness continues as one drills down into details of the language.  The Conlangery Podcast's discussion of Basque notes that it has a great many exceptions, things that only occur in one corner of the language.  But, there's something wrong with this picture.  All I've picked up over the years suggests there is no such thing as an especially complicated or bizarre natlang.  Basque is agglutinative, the simply composable morphological strategy that lends itself particularly well to morpheme-based analysis.  The Conlangery discussion notes that Basque verbs are extremely regular.  Standard Basque phonology has the most boring imaginable set of vowels (if you're looking for a set of vowels for an international auxlang, and you want to choose phonemes basically everyone on Earth will be able to handle, you choose the same five basic vowel sounds as Basque).  From what I understand of the history of grammar, our grammatical technology traces its lineage back to long-ago studies of either Sanskrit, Greek, or Latin, three Indo-European languages whose obvious similarities famously led to the proposal of a common ancestor language.  It's to be expected that a language bearing no apparent genetic relationship whatsoever to any of those languages would not fit the resultant grammatical mold.  If somehow our theories of grammatical structure had all been developed by scholars who only knew Basque, presumably the Indo-European languages wouldn't fit that mold well, either. 

The deep end

All this discussion provides context for the problem and a broad sense of what is needed.  The examples thus far, though, have been simple; even the Anglo-Saxon, despite its length.  There's not much point charging blindly into complex examples without learning first what there is to be learned from more tractable ones.  Undervaluing conventional structural insights seems a likely hazard of the functional approach.

My objective from the start, though, has been to develop means for studying the internal structure of larger-scale texts.  Not these single sentences, about which hangs a pervasive sense of omission of larger structures intruding on them from above (I'm reminded (tangentially?) of the "network" aspect of subterms in my posts on co-hygiene).  Sooner or later, we've got to move past these shallow explorations, to the deep end of the pool.

We've sampled kinds of structure that occur toward the upper end of the sentence level.  (I could linger on revalency for some time, but for this post that's only a means to an end.)  Evidently we can't pour dense information into our presentation without drowning out what we want to exhibit — interlinear glosses are way beyond what we can usefully do — so we should expect an effective device to let us exhibit aspects of a large text one aspect at a time, rather than displaying its whole structure at once for the audience to pick things out of.  It won't be "automatic", either; we expect any really useful technique to be used over a very wide range of structural facets with sapient minds at both the input and output of the operation — improvising explorations on the input side and extracting patterns insightfully from the output.  (In other words, we're looking not for an algorithm, but for a means to enhance our ability to create and appreciate art.)

It would be a mistake, I think, to scale up only a little, say to looking at how one sentence relates to another; that's still looking down at small structures, rather than up at big ones.  It would also be self-defeating to impose strict limitations on what sort of structure might be illustrable, though it's well we have some expectations to provide a lower bound on what might be there to find.  One limitation I will impose, for now:  I'm going to look at reasonably polished written prose, rather that the sort of unedited spoken text sometimes studied by linguists.  Obviously the differences between polished prose and unedited speech are of interest — for both linguistics and conlanging — but ordinary oral speech is a chaotic mess of words struggling for the sort of coherent stream one finds in written prose.  So it should be possible to get a clearer view of the emergent structure by studying the polished form, and then as a separate operation one might try to branch outward from the relatively well-defined structures to the noisily spontaneous compositional process of speech.  The definition of a conlang seems likely to be more about the emergent structure than the process of emergence, anyway.

So, let's take something big enough to give us no chance of dwelling in details.  The language has got to be English; the point is to figure out how to illustrate the structure, and a prerequisite to that is having prior insight (prior to the illustrative device, that is) into all the structure that's there to be illustrated.  Here's a paragraph from my Preface to Homer post; I've tried to choose it (by sheer intuition) to be formidably natural yet straightforward.  I admit, this paragraph appeals to me partly because of the unintentional meta-irony of a rather lyrical sentence about, essentially, how literate society outgrows oral society's dependence on poetic devices.

Such oral tradition can be written down, and was written down, without disrupting the orality of the society.  Literate society is what happens when the culture itself embraces writing as a means of preserving knowledge instead of an oral tradition.  Once literacy is assimilated, set patterns are no longer needed, repetition is no longer needed, pervasive actors are no longer needed, and details become reliably stable in a way that simply doesn't happen in oral society — the keepers of an oral tradition are apt to believe they tell a story exactly the same way each time, but only because they and their telling change as one.  When the actors go away, it becomes possible to conceive of abstract entities.  Plato, with his descriptions of shadows on a cave wall, and Ideal Forms, and such, was (Havelock reckoned) trying to explain literate abstraction in a way that might be understood by someone with an oral worldview.
Considering this as an example text in a full-fledged nominative-accusative SVO natlang, with an eye toward how the nouns and verbs are arranged to create the overall effect — there's an awful lot going on here.  The first sentence starts out with an example of topic sharing (the second clause shares the subject of the first; that's another thing I explored for revalency, back when), and then an adverbial clause modifying the whole thing; just bringing out all that would be a modest challenge, but it's only a small part of the whole.  I count a little over 150 words, with at least 17 finite verbs and upwards of 30 nouns; and I sense that almost everything about the construction of the whole has a reason to it, to do with how it relates to the rest.  But even I (who wrote it) can't see the whole structure at once.  How to bring it into the light, where we can see it?

The only linguistic tradition I've noticed marking up longer texts like this is incompatible with my objectives.  Corpus linguistics is essentially data mining from massive quantities of natural text; in terms of the functions of a Kuhnian paradigm, it's strong on methodology, weak on theory.  The method is to do studies of frequencies of patterns in these big corpora (the Brown Corpus, for example, has a bit over a million words); really the only necessary theoretical assumption is that such frequencies of patterns are useful for learning about the language.  There is btw, interestingly, no apparent way to reverse-engineer the corpus-linguistics method so as to construct a language.  There is disagreement amongst researchers as to whether the corpus should be annotated, say for structure or parts of speech (which does entail some assumption of theory); but annotation even if provided is still meant to support data mining of frequencies from corpora, whereas I'm looking to help an audience grok the structure of a text of perhaps a few hundred words.  Philosophically, corpus linguistics is about algorithmically extracting information from texts that cannot be humanly apprehended at once, whereas I'm all about humanly extracting information from a text by apprehension.

We'd like a display technique(s) to bring out issues in how the text is constructed; why various nouns were arranged in certain ways relative to their verbs and to other nouns, say.  Why did the first sentence say "the orality of the society" rather than "the society's orality"?  The second sentence "Literate society is what happens when" rather than "Literate society happens when" (or, for that matter, "When [...], literate society happens")?  More widely, why is most of the paragraph written in passive voice?  We wouldn't expect to directly answer these, but they're sorts of things we want the audience to be able to get insight into from looking at our displays.

Patterns of use of personal pronouns (first, second, third, fourth), and/or animacy, specificity, or the like are also commonly recommended for study 'to get a feel for how it works'; though this particular passage is mostly lacking in pronouns.

A key challenge here seems to be getting just enough information into the presentation without swamping it in too much information.  We can readily present the text with a few elements —words, or perhaps affixes— flagged out, by means of bolding or highlighting, and show a small amount of text structure by dividing it into lines and perhaps indenting some of them.  Trying to use more than one means of flagging out could easily get confusing; multiple colors would be hard to reconcile with various forms of color-blindness, conceivably one might get away with about two forms of flagging by some monochromatic means.  But, how to deal with more than two kinds of elements; and, moreover, how to show complex relationships?

One way to handle more complex flags would be to insert simple tags of some sort into the text and flag the tags rather than the text itself.  Relationships between the tags, one might try to make somewhat more apparent through the text formatting (linebreaks and indentation).

Trying to ease into the thing, here is a simple formatting of the text, with linebreaks and a bit of indentation.

Such oral tradition        can be written down,
                               and was     written down,
  without disrupting the orality of the society.
Literate society is what happens when the culture itself embraces
    writing                               as a means of preserving knowledge
    instead of an oral tradition.
Once literacy is assimilated,
  set patterns         are no longer needed,
  repetition            is   no longer needed,
  pervasive actors are no longer needed,
  and details become reliably stable
    in a way that simply doesn't happen in oral society —
  the keepers of an oral tradition are apt to believe
    they tell a story exactly the same way each time,
  but only because they and their telling change as one.
When the actors go away,
  it becomes possible to conceive of abstract entities.
Plato, with his        descriptions of shadows on a cave wall,
                        and Ideal Forms,
                        and such,
  was (Havelock reckoned) trying to explain literate abstraction
    in a way
      that might be understood
        by someone
          with an oral worldview.
This brings out a bit of the structure, including several larger or smaller cases of parallelism; just enough, perhaps, to hint that there is much more there that is still just below the surface.  One could imagine discussing the placement of each noun and verb relative to the surrounding structures, resulting in an essay several times the length of the paragraph itself.  No wonder displaying the structure is such a challenge, when there's so much of it.

One could almost imagine trying to mark up the paragraph with a pen (or even multiple colors of pens), circling various words and drawing arrows between them.  Probably creating a tangled mess and still not really conveying how the whole is put together.  Though this does remind us that there's a whole other tradition for representing structure called sentence diagramming.  Granting that sentence diagramming, besides its various controversies, doesn't bring out the right sort of structure, brings out too much else, and is limited to structure within a single sentence; it's another sort of presentational strategy to keep in mind.

Adding things up:  we're asking for a simple, flexible way to flag out a couple of different kinds of words in an extended text and show how they're grouped... that can be readily implemented in html.  The marking-two-kinds-of-words part is relatively easy; set the whole text in, say, grey, one kind of marked words in black, and a second kind of marked words (better perhaps to choose the less numerous marked kind) in black boldface.  For grouping, indentation such as above seems rather clumsy and extremely space-consuming; as an experimental alternative, we could try red parentheses.

Taking this one step at a time, here are the nouns and verbs:

Such oral tradition can be written down, and was written down, without disrupting the orality of the society.  Literate society is what happens when the culture itself embraces writing as a means of preserving knowledge instead of an oral tradition.  Once literacy is assimilated, set patterns are no longer needed, repetition is no longer needed, pervasive actors are no longer needed, and details become reliably stable in a way that simply doesn't happen in oral society — the keepers of an oral tradition are apt to believe they tell a story exactly the same way each time, but only because they and their telling change as one.  When the actors go away, it becomes possible to conceive of abstract entitiesPlato, with his descriptions of shadows on a cave wall, and Ideal Forms, and such, was (Havelock reckoned) trying to explain literate abstraction in a way that might be understood by someone with an oral worldview.
Marking that up was something of a shock for me.  The first warning sign, if I'd recognized it, was the word "disrupting" in the first sentence; should that be marked as a noun, or a verb?  Based on the structure of the sentence, it seemed to belong at the same level as, and parallel to, the two preceding forms of write, so I marked "disrupting" as a verb and moved on.  The problem started to dawn on me when I hit the word "writing" in the second sentence, which from the structure of that sentence wanted to be a noun.  The word "preserving", later in the sentence, seems logically more of an activity than a participant, so feels right as a verb although one might wonder whether it has some common structure with "writing".  The real eye-opener though —for me— was the word "descriptions" in the final sentence.  Morphologically speaking, it's clearly a noun.  And yet.  Structurally, it's parallel with "trying to explain"; that is, it's an activity rather than a participant.

The activity/participant semantic distinction is a common theme in my conlanging.  I see this semantic distinction as unavoidable, although the corresponding grammatical and lexical noun/verb distinctions are more transitory.  My two principal conlang efforts each seek to eliminate one of these transitory distinctions.  Lamlosuo, the one I've blogged about, shuns grammatical nouns and verbs, though it has thriving lexical noun and verb classes.  My other conlang, somewhat younger and less developed with current working name Refactor, has thriving grammatical nouns and verbs yet no corresponding lexical classes.  (The semantic distinction is scarcely mentioned in my post on Lamlosuo; my draft post on Refactor, not nearly ready for prime time, has a bit more to say about activities and participants.)

In this case, had "descriptions" been replaced by a gerund —which grammatically could have been done, though the prose would not have flowed as smoothly (and why that should be is a fascinating question)— we've already the precedent from earlier in the paragraph of choosing to call a gerund a noun or verb depending on what better fits the structure of the passage.  Imagine replacing "descriptions", or perhaps "descriptions of", by "describing".  (An even more explicitly activity-oriented transformation would be to replace "with his descriptions of" by "when describing".)

The upshot is that I'm now tempted to think of noun and verb as "blue birds", in loose similarity to Delancey's doubts about ergativity.  I'm starting to feel I no longer know what grammar is.  Which may be in part a good thing, if you believe (as I do; cf. my physics posts) that shaking up one's thinking keeps it limber; but let's not forget, we're trying to aid conlanging, and the grammar of a conlang is apt to be its primary definition.

Meanwhile, building on the noun/verb assignments such as they are, here's a version with grouping parentheses:

(Such oral tradition ((can be written down), and (was written down)), without (disrupting (the orality (of the society.))))  (Literate society (is what happens (when the culture itself (embraces (writing as a (means of (preserving knowledge))) instead of (an oral tradition.)))))  ((Once literacy (is assimilated,)) (set patterns (are no longer needed,)) (repetition (is no longer needed,)) (pervasive actors (are no longer needed,)) and (details (become reliably stable (in a way that simply (doesn't happen (in oral society))))))(the keepers (of an oral tradition) (are apt to believe (they (tell (a story) (exactly the same way (each time,))))) but only because ((they and their telling) (change (as one))))(When (the actors (go (away,))) it (becomes possible (to (conceive of (abstract entities.)))))  (Plato, (with his descriptions of (shadows (on a cave wall,)) and (Ideal Forms,) and (such,)) (was (Havelock reckoned) trying to explain literate abstraction (in a way that (might be understood by someone (with an oral worldview.)))))
Maybe I should have been prepared for it this time, after the noun/verb marking shook my confidence in the notions of noun and verb.  Struggling to decide where to add parentheses here showing the nested, tree structure of the prose has convinced me that the prose is not primarily nested/tree-structured.  This fluent English prose (interesting word, fluent, from Latin fluens meaning flowing) is more like a stream of key words linked into a chain by connective words, occasionally splitting into multiple streams depending in parallel from a common point — very much in the mold of Lamlosuo.  Yes, that would be the conlang whose structure I figured could not possibly occur in a natural human language, motivating me to invent a thoroughly non-human alien species of speakers; another take on the anadew principle of conlanging, in which conlang structures judged inherently unnatural turn out to occur in natlangs after all.  In fairness, imho Lamlosuo is more extreme about the non-tree principle than English, as there really is an element of "chunking" apparent in human language that Lamlosuo studiously shuns; but I'm still not seeing, in this English prose, nearly the sort of syntax tree that grade-school English classes, or university compiler-construction classes, had primed me to expect.  (The tree-structured approach seems, afaict, to derive from sentence diagramming, which was promulgated in 1877 as a teaching method.)

So here I am.  I want to be able to illustrate the structure of a largish prose passage, on the order of a paragraph, so that the relationships between words, facing upward to large-scale structure, leap out at the observer.  I've acquired a sense of the context for the problem.  And I've discovered that I'm not just limited by not knowing how to display the structure — I don't even know what the structure is, not even in the case of my own first language, English.  Perhaps the tree-structure idea is due to having looked at the structure facing inward toward small-scale structure rather that outward to large-scale; but I'm facing outward now, and thinking our approach to grammatical structure may be altogether wrong-headed.  Which, as a conlanger, is particularly distressing since conlangs tend to use a conventionally structured grammar in the primary definition of the language.

Saturation point reached.  Any further and I'd be supersaturated, and start to lose things as I went along.  Time for a "reset", to clear away the general clutter we've accumulated along the path of this post.  Give it some time to settle out, and a fresh post with a new specific focus can select the parts of this material it needs and start on its own path.