Thursday, March 16, 2023

Word break conventions and emergent typology

I've been doing a lot of free-form writing in Koa this year and it's been a pretty revealing experience. There's nothing that exposes semantic gaps and structural shortcomings like trying to write complex, expressive prose; initially all my writing felt unbelievably clumsy, with none of the grace, sophistication or subtlety that I try to embody when I write in other languages I know well. After a month or two, though, I feel like I'm starting to find my voice in Koa -- or maybe more accurately, Koa is finding its own nascent voice.

This is really the first constructed language in which I've navigated this process and it's fascinating (and intimidating): coming from a place of only having written single, unconnected example sentences, how does the language in question construct, say, a whole paragraph? How does it flow structurally? I feel so practiced in other areas of language design, but here I'm just doing my best to move through it all in an intuitive way without getting hung up on my own anxiety. Someday I'll have to try to actually articulate some of these emergent principles, but I think they need time to emerge further first.

In the mean time, another thing that came up as I began keeping a regular journal in Koa was a discovery I only made when I tried to read what I'd written later on. For one thing, I knew theoretically that production and comprehension were different disciplines, but I wasn't quite prepared for just how unpracticed I was at understanding my own language. It makes sense: I'd really never had the opportunity to try to interpret speech or writing coming at me before! In response I added a word recognition module to my vocab learning program; previously it had only been testing me on production in the target language.

More surprisingly, though, it turned out that the way I've always represented Koa is kind of hard to parse. Here's an example block of text written in the traditional style:

Ta lai la ka ásulo ta la ko vúakupu e ko mivami, sii, ta mene la ko kóuva e tule lai la ni. Ni si vima poli lo kopato ve hua i cu misucu, ala he lopu poka i pea pono e ka lila ni sai i si kali. E ka tana i kali i koe ka sena. Hala kehe nu lu nike la ko mova ka kecu, ka nu lu ete la ko mupea ka háote nu ne kene koa.

As soon as I start to read it my eyes sort of go out of focus; with such a rapid stream of little words it's hard for me to keep track of where I am in the text, let alone where I am in the syntactic tree. As a result, over the past month I've been experimenting with writing roots with their particles attached to them. The precise rules about what should be attached and what should be left separate are still developing, but the essence of the system has come together nicely. Here's what that previous paragraph looks like with the new conventions:

Talai lakaásulota lakovúakupu e komivami, sii, tamene lakokóuva e tule lai laní. Nisivima poli lokopato ve hua i cumisucu, ala helopu poka i pea pono e kalílani sai i sikali. E katana i kali i koe kasena. Hala kehe nulunike lakomova kakecu, kanuluete lakomupea kaháotenu nekene koa.

Even though this was unfamiliar, I instantly found it massively easier to parse. Allison said that made sense to the extent that there were many more word shapes now for the brain to grab onto; it's also entirely clear which particles belong to which roots, and morpheme clusters mirror natural intonation groups. Here's an attempt to articulate the principles of the system.

1. Particles whose scope is a predicate -- regardless of how complex it is is -- are written together with that predicate. This may require the use of additional accentuation where possessive pronouns and directionals are suffixed to the root.

ninasitemuláheta = "I couldn't make him leave"

2. Particles whose scope is a clause with a pronominal subject are joined joined to that clause (but see point 6)

nisánota lakomutulu kakúmumani  = "I said it to make my teacher angry"

3. Particles whose scope is a clause with a full subject NP are separated from surrounding words

nitovo ko le Kéoni i cutule = "I hope that John will come"

4. Predicate clusters -- compounds and incorporated objects -- are written together, but plain adjectival phrases are not joined to their head nouns

kalopuviko = "the weekend," but
kapasano vime = "the last statement"

5. Pronominal particles follow the same rules as predicates when used as the head of an NP, but must be marked with an accent.

laní = to me
nahunú = none of us

6. Certain particles, principally with clause-level scope, are always written separately: i, e when it means "and," au, ai, ha when it means "if," ve when used as a complementizer, and ko when used alone as a complementizer (this list may not be exhaustive). Le is also separated from its head to avoid muddlement with capitalization and foreign words.

One point of uncertainty: when a particle is written separately from its head but is itself within the scope of other particles, are those particles also separated or should they be attached to the "frontmost" one? For example, which of the following should be the convention?

nisánota lakole Kéoni i cutule
nisánota lako le Kéoni i cutule
nisánota la ko le Kéoni i cutule
"I said it so that John would come"

I'm not sure yet; I'll get back to you after more experimentation. I suspect a standard will shape itself over time.

A bunch of this, incidentally, may actually be an artifact of trying to smoosh Koa into an alphabetic writing system. If the language could be written with a syllabary rather than an alphabet, and if there were some marking that identified the stressed syllable of predicates -- in other words, if predicates were instantly differentiated visually from particles -- then there would be a much closer match between writing and Koa's native structure.

But what, then, is Koa's native structure? I had always thought of it as a basically isolating language, but one thing that really surprised me when I first saw text written with these new conventions is how...agglutinating it looks. I'm sort of shocked that I've never asked this question before, but...where does the structure of Koa really fit, typologically?

The language is certainly about as close as you can get to monoexponential in that each morpheme is (theoretically) encoding one and only one semantic, and since I've been thinking of all particles and predicates as individual "words," my unconsidered classification of isolating seemed justified. But looking at forms like this one from above...

"I couldn't make him leave"

...I really wonder on what grounds I would not call that a "word." A word constituting a complete sentence, with seven morphemes, which a Turkish speaker could feel right at home with. And if that resemblance isn't just incidental but in fact diagnostic, then classifying those first five morphemes as "particles" is obscuring something important: they're actually prefixes. Occupying slots, in a specific order. Like an agglutinative language.

I'm actually not sure how to make a ruling on this, and more thought and research may be required. Some of those particles certainly can stand on their own in certain contexts -- nate "no, I can't," or keka sa? ni "who is it? me" -- and maybe more revealingly, the pronouns can appear to be gapped: 

"the one who couldn't make him leave"

"the one I couldn't make leave"

On the other hand I've vigorously maintained previously that gapping is in fact not the best explanation for these structures despite the fact that it's possible to draw the trees that way. It may be that this new word break convention and the kinds of apparent agglutinative "words" it produces is itself also obscuring some of the true nature of the base structures. Ultimately this is not a question of graphical representation -- whether we write ni na si te mu lahe ta or ninasitemuláheta -- but what's really happening below the surface. And I'm starting to tie my brain in knots which is a pretty clear sign that I need to put down this problem for a bit.

More to come, clearly.

No comments: