Author: Linguistic History

2. Me

In the last post, we saw that the origin of Language (with a capital L) could maybe be placed back at least a couple of hundred-thousand years. But going that far back we know nothing at all about any specific, lower-case languages. There’s a truly enormous expanse of time after the origin of Language during which we can say very little concrete. We don’t even know, for instance, if humans started out speaking a single, “Proto-Human” language (a monogenesis scenario), or whether there were multiple languages from the get-go (polygenesis). A subtly different question is whether all human languages known today descend from a single Proto-World language (different, because it’s possible that there were multiple original languages, but only one happened to leave any surviving descendant languages). These and other related questions were discussed by Piotr Gąsiorowski in a series over on his excellent Language Evolution blog, so head on over there if you want to read more.

For my part, I’m going to fast-forward us through hundreds of millennia. Human languages, like humanity itself, probably developed in Africa, but already very long ago spread to every continent but Antarctica. I like to think of the main landmasses of the world as forming a rough three-pointed star: Africa is one point, another extends through Southeast Asia to Australia (when sea levels were lower, Australia formed a combined landmass with New Guinea — a continent called Sahul — which was separated from mainland Asia only by a fairly narrow strait), while the third crosses the Bering Strait (or, in earlier times, land bridge) to extend to the tip of South America.

Other than the fact that languages ended up across all these regions, we don’t know too much about what they were like, or what their ancient relationships were. In the Americas, for instance, some have claimed that most Native languages belong to a single super-family called “Amerind”, often thought to be a single language (or closely related group) brought over the Bering land bridge at some ancient date. But the existence of Amerind is not supported by any strong linguistic evidence, and the languages of Native America could well have developed from any number (large or small) of languages that were either wholly unrelated, or had already drastically diverged long before they came to the Americas. All we can say is that, however similar or different early human languages were in distant prehistory, by the time we catch any glimpses of direct linguistic data they had significantly diverged the world over into thousands of distinct languages, many of which (probably around half) have no discernible relationship to any other language. These are called language isolates.

This sort of divergence is not unexpected. Languages always change from generation to generation, and if a language comes to be spoken across an area of any size, different parts will change in different ways. Think of Latin spoken across the Roman Empire, and eventually diverging into the various Romance languages and dialects. This happened just over the last couple of millennia, in a language group that was, for a fair chunk of this period, limited to a relatively small geographic zone in Europe — and many of the Romance dialects remained in contact with one another, and changed along similar lines. Now imagine this sort of diversification, but multiply the timescale hundredfold or more, and make the geographical span global and some of the geographical boundaries very sharp, and you’ll see why even if there was a single “Proto-World’” the languages of this planet have gone so far down their own paths that any “deep” linguistic relationships are thoroughly obscured.

Before charging on about languages and prehistory, I want to pause for a moment on the word “relationship”. Linguists regularly use terms like “(linguistic) descent”, “relationship”, and “family”, but these are just metaphors. We’re talking only about the languages, not about the descent or genetics of the speakers, which can have its own, often very divergent history. Not all speakers of Romance languages are descended from ancient Italians, even though Latin came from Italy. Not all speakers of English are descended from people who once lived in the British Isles.

Back to the main story, I was saying that we can really say nothing at all about the truly deep linguistic relationships, going back tens of thousands of years or more. Still, scholars have of course tried to go as far back as they can, pushing the linguistic evidence to (and often beyond) the limit. The Amerind case is one example, and something similar has been proposed for the group of languages that includes English. Here, some linguists have attempted to reconstruct an ancient Old World macro-family called “Nostratic” (from the Latin adjective nostrat– “of our (country)” — you’ll be shocked to learn that this name was coined by a European scholar, Holger Pedersen). This hypothesis holds that a huge number of primarily (though not exclusively) Eurasian language groups all developed from a single early language, spoken (it is claimed) perhaps 14 000 to 17 000 years ago somewhere in Eurasia (the Fertile Crescent has been proposed as a location), and which eventually spread (in many long stages) across a large territory and many populations, diverging enormously in the process. The languages people have classified as Nostratic include Indo-European (much more on that in future posts), some of the languages of the Caucasus (including Georgian), the Uralic group (with languages such as Finnish and Hungarian), the circumpolar Eskimo-Aleut family, the Dravidian languages of southern India (e.g. Tamil), the enormous Afro-Asiatic family (including all the Semitic languages, as well as African languages like Ancient Egyptian, Berber, and Chadic), and even the first language to ever be written down, Sumerian (in what is now southern Iraq; Sumerian itself died out a long time ago). That’s a lot of languages, though still only a portion of those spoken in the Old World.

Map of the different branches of Nostratic, from Bomhard 2018, p. 312.

One potential relic of Nostratic grammar is the system of personal pronouns, where there are some similarities (superficial or otherwise) across at least a fair number of possibly-Nostratic languages. Take English me, for instance. We’re very certain that this pronoun has a long history, and we can reconstruct a Proto-Indo-European (again, more on what that is in future posts!) oblique pronoun stem *me– (“oblique” basically means that it’s used for functions other than the grammatical subject). Nostraticists point to first-person pronouns with m in many other languages (though many of these are not specifically oblique forms). For example, Allan Bomhard cites (pp. 339ff.), among other forms mostly meaning “I”, Georgian me-, men-, mena-; Finnish (Uralic) minä/minu-; Chukchi (spoken near the Bering Strait) ɣəm (the –m part is meant to be the original pronoun); Etruscan (an ancient language of Italy) mi; and Sumerian me-e (among other variants). All of this could, Nostraticists argue, reflect an ancient first-person pronoun *mi (the asterisk is important – it marks the form as a hypothetical reconstruction rather than a form actually directly attested).

How plausible is this Nostratic theory? It’s hard to evaluate. It’s certainly not a crackpot theory — people propose all sorts of fanciful linguistic connections all the time, most of which haven’t the slightest basis in reality. They can do this because if you compare enough languages, you’ll be bound to run across some curious linguistic coincidences. Finding a few vaguely similar-looking words with vaguely similar meanings in some languages doesn’t come remotely close to demonstrating, or even suggesting, a linguistic relationship. You need a lot more than one example like *mi-, and the coming posts I’ll go a bit more into the kinds of standards that a linguistic theory needs to meet to be taken seriously. Nostraticists, unlike the crackpots, take these standards seriously, and they (or at least some of them) do attempt to meet the challenge of linguistically demonstrating that Nostratic once existed.

But… there are problems with doing work on Nostratic. For one thing, the number of languages involved is massive. I’ve already mentioned Allan Bomhard’s book, which runs to over 2700 pages! Treating all the linguistic data with the kind of close, detailed rigour that we ideally want is a tough call, and relatively few scholars are willing to invest the necessary time to fully engage with what might be a fruitless hypothesis. It’s hard to get the kind of critical dialogue going that we really want to see in scientific conversation. Furthermore, many of the potentially Nostratic languages and language families are under-researched on their own rights, so that essential scholarly tools like reliable etymological dictionaries and historical grammars for particular languages often don’t exist. Another problem is that Nostratic ultimately rests on the comparison of proto-languages which are themselves reconstructed. Since each reconstruction involves a certain amount of uncertainty (often we have to decide which of several plausible hypotheses seems most likely, not which one is absolutely certain), Nostratic involves a compounding of uncertainties: different choices in, say, how to reconstruct Proto-Dravidian or Proto-Afroasiatic could have a significant impact on how comparable these proto-languages appear to be to other possibly Nostratic languages. One dramatic case is the potential “Altaic” language family. Some scholars have proposed that the Turkic languages, the Mongolic languages, and a group called Tungusic formed a linguistic family called Altaic. But many specialists now don’t think that Altaic is actually a real family, and that the arguments once used to support it don’t hold up under closer scrutiny. This is a clear case where changing judgements (and increasing knowledge) about one smaller group of languages could have obvious implications for the larger scale Nostratic, which assumes that Altaic is real, and a sub-group within Nostratic.

In short, Nostratic is now usually viewed as a highly speculative, unproven possibility. We might call it Schrödinger’s proto-language.

Whether or not Nostratic actually existed, we can be sure there were languages of some sort spoken throughout Eurasia, and that there was probably a long history of some languages spreading around, displacing others, diversifying, being displaced, and so on and so forth. We see all these things happening in more recent periods, and there’s no reason to think they didn’t also happen in the more distant past. One strand of these languages is what would eventually develop into English. Perhaps this pre-English passed through a Nostratic stage, or maybe not. Maybe English is related only to some of the supposedly Nostratic languages (maybe it’s distantly related to the Uralic languages, for example), or maybe not. However all that may be, eventually one particular distinct language did emerge from this mass of language, and by perhaps 5000 years ago (more on dates later!) had developed into a language we are able to talk about with much more certainty: Proto-Indo-European. This is the oldest stage of English that we can reliably cite specific words for, or talk about the grammar of.

In the following posts, we’ll start to explore Proto-Indo-European as the earliest reconstructible stage of (what would become) English. But as old as Proto-Indo-European is, we should remember that the past 5000 years probably constitutes a mere 2.5% or less or so of the time since the origin of human language. This post and the one before have already covered the overwhelming majority of the history of English, chronologically speaking.

Further Reading

Much of the literature on early language prehistory is fairly speculative, but there’s been some interesting and useful work done. Johanna Nichols has published prolifically, and her 1997 article “Modelling Ancient Population Structures and Movement in Linguistics” (Annual Review of Anthropology 26, pp. 359-384) is a good starting point, giving a global overview.

The classification and prehistory of Native American languages is a complex and fascinating area. For an overview, I recommend Lyle Campbell’s 1997 book American Indian Languages: The Historical Linguistics of Native America, especially chapters 3, 7, and 8. He also gives ample references to the earlier literature, if you’re interested. If anyone knows of a good survey that’s a little more recent, please let me know about it in the comments!

For Nostratic, a one-stop-shop is Allan Bomhard’s book, A Comprehensive Introduction to Nostratic Comparative Linguistics, which he has made freely available online. He’s obviously convinced of Nostratic’s validity, and interprets his linguistic data in that light (sometimes generously so), but it’s an intelligent and useful (and often wryly funny) survey of most aspects of the subject, including the history of research. For more on the me-type pronouns, and the interesting pattern where m-forms in the first person (“I/me”) recur with t-forms in the second person (archaic English thou, Latin/Romance tu, etc.), I recommend the page on the World Atlas of Language Structures Online (which is just in general a fantastic website if you’re interested in language). You might sometimes find the term Mitian used of languages with this pattern, coming from mi-ti-; this is often discussed as part of the Nostratic hypothesis, but it’s been looked at from other angles too.

There’s an interesting essay collection on Nostratic called Nostratic: Sifting the Evidence (ed. Salmons and Joseph, 1998), which contains essays from a wide range of perspectives, and is probably the closest thing we have to a critical dialogue about the family. This book contains Don Ringe’s important article on Indo-Uralic, “A Probabilistic Evaluation of Indo-Uralic” (pp. 153-197), which does not reject the idea of a connection between Indo-European and Uralic outright, but urges extreme caution: “Indo-Uralic is probably the part of the Nostratic hypothesis that is MOST likely to be correct; yet sober statistical testing of the relationship can barely establish it even probabilistically” (p. 187).

March 31, 2026
1. What

Let’s begin at the beginning. For the history of English, that ultimately means the beginnings of human language. Unfortunately, this is a rather hard subject to say much concrete about. The oldest written records in any language extend only about 5000 years into the past, and comparative linguistic reconstruction (something we’ll get into more in later installments) can only go a few millennia further back at best. We don’t know exactly when “modern human linguistic capacity” developed, but it was definitely at least many tens of thousands of years ago. To say anything about how human language developed, we have to make a lot of indirect inferences (also known as guesses), using evidence from primatology (looking at how our nearest evolutionary relatives communicate), genetics, the human fossil record (limited by the fact that language is most closely associated with squishy bits that preserve poorly, like the brain and larynx), archaeology, and the nature of language itself. Each of these things on its own makes for a rather complicated subject, with its own evidence and problems, and there’s no way I can really give even an adequate summary in one post.

Instead, I’m just going to touch on two key questions: when did humans develop the ability to use language, and — maybe more importantly — just what is “language” anyway?

The “when” of language is probably the easier question, but it’s still hard to pin down. One concrete anchor is that we’re pretty sure humans had fully-developed “linguistic capacity” by around 40 000 years ago at the latest. Why? Because that’s when the Aboriginal peoples of Australia — speaking the precursors to the native languages of Australia — started to become relatively isolated from the rest of the world.

But 40 000 years ago is just the absolutely most recent limit. Humans might have been, and probably were, speaking for a long time before this. There are a few considerations that bear on this question (when did the human larynx move into a position to allow vocalization? did sign language perhaps precede spoken language? when did certain genetic mutations possibly connected with linguistic capacity take place?), but none of them are really conclusive. Quite a few people working in these areas now see the period of around 200 000 to 250 000 years ago as a promising period, but I don’t think that we’re really to the point (at least not yet) that we can confidently talk about when language developed.

And there’s a complicating factor, which takes us to my second point: what is language? If we’re talking about when “language” developed, what features specifically do we mean? Defining “language” in any precise way gets a bit complicated, and takes us into debates I don’t really want to get into, but if we boil things down there are a few key things we can see as essential to language:

1) The existence of abstract signs to stand for things (the syllable cat can evoke the small furry creature sleeping on my lap right now, even though there’s no inherent connection between those sounds and the animal in question). Note that some non-human animals can do this, at least on a basic level, with arbitrary cries that distinguish between, for example, a threatening snake and a threatening eagle.

2) The ability for abstraction and generalization (we can say not just that there’s a snake there right now and you should be careful, but that there are usually snakes in a certain place).

3) The ability to qualify information in terms of things like when it happened, how trustworthy a source is, or the probability of a future occurrence.

4) The ability to ask questions of varying levels of complexity.

It’s not clear whether all of these (and other) key formal properties of language developed suddenly or gradually, together as a bundle or scattered across a long period of time. We could have a slow and steady development of language, a sudden “great leap forward”, or something in between, like a series of smaller hops, skips, and jumps. Ideally, we’ll eventually be able to tell a satisfactory story about exactly how and when each part of the human language capacity came to be. But for now, at least, we don’t have much in the way of answers, and the best we can do at the moment is to figure out how to ask relevant questions that we might be able to answer.

This is already a rather long post, but there’s one more point about the nature of language I want to touch on before wrapping this up, and I’ll approach it by quoting one of my favourite authors, J.R.R. Tolkien:

The human mind, endowed with the powers of generalisation and abstraction, sees not only green-grass, discriminating it from other things (and finding it fair to look upon), but sees that it is green as well as being grass. But how powerful, how stimulating to the very faculty that produced it, was the invention of the adjective: no spell or incantation in Faërie is more potent. And that is not surprising: such incantations might indeed be said to be only another view of adjectives, a part of speech in a mythical grammar. The mind that thought of light, heavy, grey, yellow, still, swift, also conceived of magic that would make heavy things light and able to fly, turn grey lead into yellow gold, and the still rock into a swift water. If it could do the one, it could do the other; it inevitably did both. When we can take the green from grass, blue from heaven, and red from blood, we already have an enchanter’s power — upon one plane…
(Tolkien, On Fairy Stories, quoted from Flieger and Anderson 2008, p. 41)

Tolkien goes on to invoke the phrase green sun as an example of how the basic possibilities of language can be exploited by the imagination to extend far beyond the communication of reality. His comments aren’t actually entirely sound from a linguistic perspective — for one thing, not all languages even have adjectives (though all can express attribution and predication, which are the main functions of adjectives) — but I like Tolkien’s discussion because it really hits at two key properties of language: its incredible flexibility, and its importance for humanity in ways that extend far beyond basic communication. Language can transmit all sorts of information (even information that is not simply false, but unreal), and this is unquestionably one of its distinctive and useful traits — but it is just one part of how language makes us human. As Tolkien points out, language is closely bound up with mythmaking and storytelling. And more prosaically, language is an essential part of how we socialize with other humans. When a group of humans hangs out chatting at the pub (say), no one involved may learn much — the information content transmitted over a long evening may be close to zero — but language will still play a crucial social role.

Everything I’ve said so far has been about Language, with a capital “L”: the human linguistic capacity. For most of the rest of this series, I’ll be focusing not on Language, but on a language, English (including its precursors, known by various labels at various stages). There’s nothing at all to say about specific languages during the time periods covered by this post, since those sorts of details won’t become available to us until a much more recent date. The next post in this series will race through the vast expanses of prehistory for which we can make only the most general of observations. After that we’ll finally be able to sink our teeth into something solid!

Note: The actual etymology of what isn’t terribly involved, since it’s a question word as far back as we can trace it. It’s undergone its share of sound changes, developing from something like *kʷod a few thousand years ago (this is preserved quite well in Latin quod), but we’ll get to these sorts of things in due course. It’s also shifted some of its specific uses a bit over time — among other things, in Old English hwæt served as an exclamative adverb, often occurring at or near the beginning of sections of discourse, famously including its use as the first word of Beowulf. It seemed appropriate to start this series of with this word as well.

Further Reading

The origin of language is an understandably fraught topic: it’s a question whose interest (touching even on the question of what it means to be human) extends far beyond our ability to discuss it concretely in a data-driven way. For a good overview of the subject, I recommend Andrew Carstairs-McCarthy’s chapter “Origins of Language” in The Handbook of Linguistics (2008, ed. Aranoff and Ress-Miller). For an interesting comparison of two very different approaches, you might contrast the anthropological discussion of Lieberman and McCarthy, “The Evolution of Speech and Language”, Handbook of Paleoanthropology (2015, ed. Henke and Tattersall) with the influential article “The Faculty of Language: What Is It, Who Has It, and How Did It Evolve?” by Hauser, Chomsky, and Fitch (2002, Science 298.22, pp. 1569-1579), written (unsurprisingly, given Chomsky’s involvement) from a more generative perspective. The full-length monograph Origins of Human Communication by Tomasello (2008) makes an argument based especially on the evidence from primatology, and makes for interesting reading.

I really want to emphasize that the literature on this topic is substantial, and fragmented across a number of different disciplines and specializations, and I am by no means an expert in this arena (which can sometimes seem a bit remote from the day to day work of a lot of linguistics). If you’ve noticed any mistakes, or think I’ve overlooked something important, please just let me know, and I’ll make any necessary corrections.

March 30, 2026

Author: Linguistic History

2. Me

1. What