Text mining the Burney Corpus in R Part I.

This series of blogposts will document my adventures in applying textual analysis in R to Frances Burney’s corpus.

a screenshot of R studio showing a corpus loaded.

Frances Burney (1752 – 1840) was a British novelist, born in Kings Lynn, UK just before the Seven Years War. In what can be described as a ‘full life’, she wrote 4 novels over 1778 – 1814, forming close (and fraught) friendships with Edmund Burke, Hester Thrale Piozzi, Warren Hastings, Samuel Crisp, and the daughters of Queen Charlotte and King George III. During five horrendous years at court as second keeper of the robes (a position which was considered to be a great honour for her, and her father the musicologist (and total bastard) Charles Burney (1726 – 1814), she saw the first pangs of George’s illness, Dr Willis’ attempts at a cure, and the regency crisis at first hand. Leaving court with the friendship of the royal family and a pension of £100/ year in 1790, she met the french emigre constitutionalist Alexandre d’Arblay, who’d been present at the storming of the tuilleries, the flight to varennes, and had escaped capture by his own insurrectionist regime to flee, penniless, to Juniper Hall in Mickleham, Surrey where he joined his friends de Stael, Lafayette, and Narbonne. It was with him that Frances would spend ten years on the continent during the Napoleonic Wars, during which she wrote her final novel.

As I argue in my PhD thesis, about to be submitted to the History Faculty at Cambridge, Burney’s early interest in philosophy and history, her informed correspondence with politicians and philosophers, combined with recent work on the philosophical and political function of romance plots by Miranda Burgess, point to a wider project of political philosophy. Considering her own Catholic roots and connections, and the religious and political references which pepper her work, this has profound implications for what it meant to be an Anglican Woman.

Nevertheless, and for good reason, this was well hidden. My PhD project has relied on extracting these opinions and accounts through context and paratexts in emended letters, diary entries, and an examination of the novel form.

This project forms part 1 of a wider – and much more difficult! – attempt to automate transcription of the archival research on which much of my own work was based. Once that’s done, applying ml + GIS will hopefully excavate details Burney expunged from her and her family’s letters in the final decades of her life.

I’d already installed RStudio in order to play around with the possum mapping project (more on which here), and had recently installed ggplot2 ( for another project – forthcoming – about my book-reading habits / following the instructions / guide found here). A brief search revealed this ‘gentle introduction’, and this is what I’ll be more or less following in this series of blogposts.

The first step, of course, was to assemble a data set. I chose her novels and diaries, downloaded them from Gutenberg, then amalgamated them into a single utf-8 text file. Along the way I realised that the inclusion of letters _to_ Frances Burney included in the public domain version of diaries and letters would corrupt the corpus. Thankfully, Burney’s novels being quite long (to put it mildly), all 4, along with a pamphlet she wrote in 1793 in support of the emigre french Catholic clergy, gives us 1172137 words to play with.