Corpus

Oxford Dictionaries have been making proud announcements recently about their research resource, the Oxford English Corpus (OEC).

A corpus is a collection of written material in machine-readable form that has been put together for linguistic research. The word is the Latin for body and is the source of several other English words, such as corpse, corporeal, corpulent, corpuscle, corps, and corporal. Corporate, of a business or firm, has the same origin, so body corporate is etymologically speaking a tautology. Crime writers are fond of corpus delicti, for the facts and circumstances surrounding a crime (literally, it means “body of offence”). And there are many medical terms for bits of the body that include it, such as corpus callosum and corpus luteum. English writers have been using corpus for a body of writing since the early eighteenth century (the first known usage was in Chambers Cyclopaedia in 1727); linguists only started to use it in this specialised sense in the 1950s.

The OEC began in 2000 and by April this year had grown to contain more than a billion (a thousand million) words — not all different, of course: the humble word the alone appears about 50 million times. It represents every type of English, from literary novels, specialist journals, newspapers and magazines to the text of Hansard and the contents of chatrooms, e-mails, and weblogs. All the material has been trawled from the World Wide Web using a custom-built web crawler (similar to those that search engines like Google use to index the Web), so there are probably some World Wide Words pages in it somewhere. The OEC also includes every regional variety of the language, not only the major ones of the UK and US that make up about 80% of the total, but also material from the Caribbean, India, Singapore, Hong Kong, and South Africa.

One of its more valuable features is that researchers can discover which words most often appear together, for which the dictionary makers’ term is collocation. When, for example, the corpus is examined for verbs which are most often used with man or boy but not woman or girl, they discover that men assault, hijack, crouch, kidnap, rob, grin, shoot, dig, stagger, leap, invent, or brandish. But they don’t consent, faint, sob, cohabit, undress, clutch, scorn, or gossip because, according to the corpus, that’s what women do. Eccentric usually appears with words like endearingly, old, and millionaire, suggesting that only elderly, wealthy people can be eccentric, the rest of us just being plain crazy. The word vivacious most often appears with beautiful, young, blonde, outgoing, and intelligent and the corpus evidence makes clear that women may be vivacious but men may not. Evidence like this helps to tie down what we really mean by words; for example, vivacious is now defined in the Oxford Dictionary of English as “(especially of a woman) attractively lively and animated”.

To mark the publication of the revised 11th edition of the Concise Oxford Dictionary, a list of the 25 most common nouns in the corpus has appeared this week. It is a mark of our hurrying lifestyle that the word at the top of the list is time. The complete list, in decreasing order of frequency, is: time, person, year, way, day, thing, man, world, life, hand, part, child, eye, woman, place, work, week, case, point, government, company, number, group, problem, and fact. You could erect a novel on that scaffolding.