|, Posts
comments 2

All the Words in All the Books

A few years ago I was involved in the secret negotiations with Google over the lawsuit by authors who objected to the company’s unauthorized copying of our books onto their servers. (We reached a good settlement, pleasing all sides, but there have been some objections, and court approval remains in doubt.)

As everyone recognized, Google was compiling a database that amounted to the holy grail: All the World’s Books. Apart from enriching the search engine, that raised some interesting possibilities. One was the idea of making this vast corpus available to scholars for pure research, into the statistics of language, cultural trends over history, the evolution of syntax, the life histories of catch phrases and clichés. What fun!

The settlement, if approved, will allow this to happen. While we wait for the judge, Google seems to be going ahead anyway. Patricia Cohen writes about it today in The Times. The journal Science has just published a paper titled “Quantitative Analysis of Culture Using Millions of Digitized Books,” signed by 14 scholarly authors, one of whom is “The Google Books Team.” They announce grandly:

We report the creation of a corpus of 5,195,769 digitized books containing ~4% of all books ever published. Computational analysis of this corpus enables us to observe cultural trends and subject them to quantitative investigation.

This is brilliant. I’m not in love with the unpretty catchphrase they’re promoting: “culturomics,” defined as the introduction of “quantitative methods into the study of culture.” But there is a lot to be learned from this extraordinary resource, and the authors lay out a buffet of possibilities: In political history, the detection of censorship and suppression. In lexicography, the discovery of some very low frequency words. In celebritology, the the rise and fall of famous names: they plotted trajectories for Virginia Woolf, Felix Frankfurter, Bill Clinton, and Steven Spielberg—and then “all 42,358 people in the databases of Encyclopaedia Britannica.”

A few facts:

  • The corpus contains more than 500 billion words, mostly from books in English but also in French, Spanish, German, Chinese, Russian, and Hebrew.
  • In the 1500s, new books were published at the rate of a few hundred thousand words per year. By 1800 it was 60 million words per year. By 2000 it was 8 billion.

The first cut at the data has been to count frequencies of individual words, then all the two-word combinations, and so on up to five-word combinations. These are the “n-grams” up to “5-grams.” They are sorted by year. So the trigram I love you can be seen to have taken off around 1920, soared during the next decades, peaked around 1975, and fallen off since then.

What that tells us about love or about literature remains to be seen. Humanities scholars are bound to debate whether “quantitative methods” are all that desirable; whether we’re going to start counting the frequency of the letter e in Shakespeare’s sonnets. I don’t think it matters. The corpus is there. Let the games begin.

Best of all, anyone can play. You don’t have to show your scholarly credentials. Google is offering the whole world a simple window into the data. It’s still a work in progress—more books to scan, more scanning errors to correct—but one can go to the Books Ngram Viewer right now, type in some n-grams and see the results, colorfully graphed. Do your own culturomics. I tried dinosaurs:

Who knew that tyrannosaurus rex was such a late starter, culturally speaking?

More ideas? I’d love to hear what people discover.

2 Comments

Leave a Reply

Your email address will not be published.