Distant Reading: Jane Austen

Step 1: Choosing the Corpus

I wanted to choose a corpus that I was familiar enough with, but not completely well-versed in. I have been in a Jane Austen class this semester, and had only read a few of the books by the time this project came around, so I thought that this would be an interesting task to take up. I knew some basic themes involving marriage and social status were present throughout a good portion of the novels, and it seemed like a good place to start and then branch out from to allow me a different perspective on Austen’s work as a whole.

Step 2: Collecting the Corpus

I got all of the text from Project Gutenberg. I chose to only focus on her novels. For pre-washing, I only took out the text that was added in by Project Gutenberg.

Step 3: The Cycles

Note: Taken from my previous post.

Cycle 1

First question: How big of a theme is marriage throughout Jane Austen’s novels?

Going into this, I knew that most, if not all, of Jane Austen’s novels touches upon marriage in some aspect, whether it be a major focus or something that occurs to fulfill the typical marriage plot ending. What I wanted to look at was which novels were most concerned with marriage, and what appears to be the circumstances that might cause it to be more prevalent or not.

To do this, I went rather general with the words that I looked up, and searched: marry*, marriage*, wedding*, husband*, and wife*. The words marry, marriage, wife, husband seemed to be consistently at the top when looking at trending words, and was especially high in the books Emma, Persuasion, and Pride and Prejudice. Across all the books, it was specifically the word wife that stuck out to me the most, especially because it was the highest of the words for Emma and often very close to the top for several of the other books.

Looking at the context of the term, it seems to often come up in relation to men and the idea of having a wife, or being a wife.

Cycle 2

With this I go to my second question: What is the importance of having a title within these novels, specifically for men?

I looked up common English honorifics such as: Mr., Sir, and Lord. I also looked up the word gentleman, because it can be used to refer to a specific type of man within society, especially at the time these books were written.

The title Mr. is regarded as one of the most used words across all six Austen novels, appearing a total of 3,011 times, so it was not surprising that it was at the top of the trending terms. Sir comes up as second for all of the books, though it was not nearly used as much as Mr. Lastly, Lord and gentleman were both rarely used across all six novels.

This was interesting because it allowed me to think about the exclusivity of titles such as “Sir” and “Lord” as compared to a more general one such as “Mr.” The former titles can only be used by people of a specific status, while the latter is much more general. Another search using the word “baronet” shows that it’s only used 26 times. Perhaps this can be used to understand the audience that Austen is writing to, by looking at the class status of the characters she is writing about.

Cycle 3

This brings me to the final question: What role does wealth play across all of the novels?

I ended up searching words such as: Fortune, pounds, money, and rich (which I took into account that it may also be used non-monetarily and also as a shortened version of Richard).

Fortune ended up at the top for the trending across all of the books, and looking at the context for the different books it was mostly used in relation to money. The word pounds was actually at the bottom for four out the six books. It was second to last for Pride and Prejudice. But for Sense and Sensibility it was actually rather close in trending to fortune. Looking at the context, there seemed to be much more talk of money and specific amounts in Sense and Sensibility as compared to other Jane Austen novels. The word money was interesting because in a lot of the novels it seems to be often tied to talk of marriage.


This was an extremely interesting process, because of how it allows you to look at a corpus without having to actually read and understand the full context everything that is being shown. The little bits of context lines that are shown for individual words don’t reveal how the story is playing out. By looking at single words, it allows concepts and ideas to be highlighted among a large body of text. This allowed room for more nuanced ideas to form about Austen’s novel, just based on my original general question about marriage.

Step 4: NGram

For Google NGram, I decided to take a look at the words “money” and “marriage.” I decided to sort the years from 1765 (10 years before Austen was born) to 1820 (3 years after her death) and used the corpus of British English.

For all of it, “money” was far above “marriage” in terms of frequency and peaked the highest in 1794, was about 19 years old. While I can’t say or tell if “money” was frequently used in just fictional literary texts, there seemed to be more talk on money, rather than marriage during Austen’s lifetime.

I did another search, adding the words “wife” and “husband.” The words “wife” and “money” were often intertwined throughout the years, “wife even overtaking “money” a few times and going the lowest around 1815. “Husband” was just above “marriage” in the timeline, starting in 1865, but before that it fluctuated between the two. Just as in my voyant searches, the word wife sticks out the most, especially not that its frequency is similar to money. Of course, the saying “correlation does not imply causation” applies here, but it is interesting to consider the possibilities, especially considering how marriage is often viewed as an economic institution.


As I stated in the overview of step 3, this was an extremely interesting experience. There are so many ways to look at themes and ideas within these novels that I had no considered before. Looking at highlighted words or phrases, even without context, gives a the general frequency of themes and brings to the surface just how prevalent some things are across a large body of text, as well as how some things are left out, and what that could possibly mean. A lot of the analysis was interpretation, but I think that’s a good starting place to try to look even further.

I think both Voyant and Google NGram could both be incredibly good starting points when looking at large bodies of text as it allows general ideas to form. From there, further research can be done to narrow down the ideas, but this is a good process to get started. For a text analysis newbie, I repeat the idea that “correlation does not imply causation” and that this is simply a stepping point to even further research. Just because words might seem frequently used, it does not mean they are used in relation to one another, but do allow ideas to form and take whatever the next step is in order to find out more.

One thought on “Distant Reading: Jane Austen”

  1. Indeed, it’s critical to keep correlation and cause apart in distant reading!

    I found your analysis of Mr./Sir/Lord/gentleman to be really interesting. Would it be possible to see this as a class lexicon? I.e. Mr. is more middle class while “sir” is more aristocratic? In that case, perhaps it’s possible to chart this distinction across the corpus but also to see where there are spikes and valleys from novel to novel? I wonder too how Mr. and Sir figure in the NGrams – – i.e. is there a rise in Mr/Mister vs. sir as the 18th century closes and the 19th century begins? And, might this show a shift in class relations and especially the classed audiences of the English novel?

Leave a Reply

Your email address will not be published. Required fields are marked *