MTA subway data project Part One.

Webscraping is a pretty fun beginner’s project, and something i’ve already touched upon in Sweigart’s Automate the boring stuff with Python. Sweigart’s downloading of every xkcd, while fun, isn’t much use for grabbing data en masse for data science.

Enter the NYC subway data!

While I could have manually downloaded each txt file, it’s much more useful experience to get python to do it. I followed the instructions here, which used the standard beautifulsoup package and syntax to grab more than enough files.

The next instructions for clearing up the data came from here. The problem, however, was that it assumed one single file, and not a set of over 20 25mb files.

The first thing to do was to merge the files into one csv. I used one simple brute force method found on stackexchange to create a 1gb+ txt file. But when I ran it through the code in tabular rasa, it resulted in a number of dtype errors about “mixed data types in columns 9/10” Hm. After a good hour or so of trying to specify the data types and searching stack exchange, I… realised that I’d merged the files without paying enough attention to the content of each individual files. In other words, by brute forcing the files I repeated the header, which the main program was attempting to interpret as entrances / times respectively.

Back to square one. Thanks to stackexchange I found another way of doing it, which can be found below:

import glob
interesting_files = glob.glob("*.txt")
header_saved = False
with open('output.csv','w') as fout:
for filename in interesting_files:
with open(filename) as fin:
header = next(fin)
if not header_saved:
header_saved = True
for line in fin:
view raw hosted with ❤ by GitHub

This, however, didn’t work at first: I was using text files in csv format, and it had been written in python2.x, and I was using Python3.x; changing the wb to w fixed this, and produced a monstrous 3gb csv.

In the next post, we’ll go onto data visualisation, and in the third, we’ll discuss applications.

Evelina project.

So as part of my wider attempts to automate archival transcription, i’ve become interested in automatic identification of authors. I can see two main uses for this in Burney studies: firstly, to figure out who-wrote-what in re-drafting of correspondence late in Burney’s life by her wider circle, and also settling once and for all which books by Mrs Meeke were written by Bessie Allen, Burney’s ne’er-do-well stepsister.

As a first step, I’ve been following this project. AFAICT, if everything runs smoothly, the only issue will be cleaning up the data before feeding it in if we want to apply it to authors beyond Kaggle’s data set.

Web-scraping in Python I.

As previously stated, I’ve had some experience with using tweepy and python to extract data via twitter’s api. This project, however, dispenses with twitter altogether. Some weeks ago, I was lucky enough to find three books from the early eighteenth century in amnesty international bookstore on Mill Road, Cambridge.

A quick google later suggested a _large_ investment. But early modern books are rare, their appearance on the market volatile, and so keeping track of their price and judging when to sell is difficult.

This project aims to collect several year’s worth of data to make these decisions smarter.

Text mining the Burney Corpus in R Part I.

This series of blogposts will document my adventures in applying textual analysis in R to Frances Burney’s corpus.

a screenshot of R studio showing a corpus loaded.

Frances Burney (1752 – 1840) was a British novelist, born in Kings Lynn, UK just before the Seven Years War. In what can be described as a ‘full life’, she wrote 4 novels over 1778 – 1814, forming close (and fraught) friendships with Edmund Burke, Hester Thrale Piozzi, Warren Hastings, Samuel Crisp, and the daughters of Queen Charlotte and King George III. During five horrendous years at court as second keeper of the robes (a position which was considered to be a great honour for her, and her father the musicologist (and total bastard) Charles Burney (1726 – 1814), she saw the first pangs of George’s illness, Dr Willis’ attempts at a cure, and the regency crisis at first hand. Leaving court with the friendship of the royal family and a pension of £100/ year in 1790, she met the french emigre constitutionalist Alexandre d’Arblay, who’d been present at the storming of the tuilleries, the flight to varennes, and had escaped capture by his own insurrectionist regime to flee, penniless, to Juniper Hall in Mickleham, Surrey where he joined his friends de Stael, Lafayette, and Narbonne. It was with him that Frances would spend ten years on the continent during the Napoleonic Wars, during which she wrote her final novel.

As I argue in my PhD thesis, about to be submitted to the History Faculty at Cambridge, Burney’s early interest in philosophy and history, her informed correspondence with politicians and philosophers, combined with recent work on the philosophical and political function of romance plots by Miranda Burgess, point to a wider project of political philosophy. Considering her own Catholic roots and connections, and the religious and political references which pepper her work, this has profound implications for what it meant to be an Anglican Woman.

Nevertheless, and for good reason, this was well hidden. My PhD project has relied on extracting these opinions and accounts through context and paratexts in emended letters, diary entries, and an examination of the novel form.

This project forms part 1 of a wider – and much more difficult! – attempt to automate transcription of the archival research on which much of my own work was based. Once that’s done, applying ml + GIS will hopefully excavate details Burney expunged from her and her family’s letters in the final decades of her life.

I’d already installed RStudio in order to play around with the possum mapping project (more on which here), and had recently installed ggplot2 ( for another project – forthcoming – about my book-reading habits / following the instructions / guide found here). A brief search revealed this ‘gentle introduction’, and this is what I’ll be more or less following in this series of blogposts.

The first step, of course, was to assemble a data set. I chose her novels and diaries, downloaded them from Gutenberg, then amalgamated them into a single utf-8 text file. Along the way I realised that the inclusion of letters _to_ Frances Burney included in the public domain version of diaries and letters would corrupt the corpus. Thankfully, Burney’s novels being quite long (to put it mildly), all 4, along with a pamphlet she wrote in 1793 in support of the emigre french Catholic clergy, gives us 1172137 words to play with.

First article.

My first article, on Frances Burney d’Arblay’s attitude to politeness in the eighteenth century, is now available. You can find the pdf by clicking here, and the abstract is below:

The influence of courtesy literature on Frances Burney’s Cecilia (1782) has been well documented. Yet the question of religion remains overlooked. This article both reasserts the Anglican nature of Cecilia’s behaviour and asserts the Catholicism of the Delvile family. It argues that Cecilia constitutes a sustained engagement with the Gordon riots of 1780 and critiques the utility of female politeness as a social glue. In a romance plot that reflects contemporary legal attempts to reconcile Britons after centuries of religious warfare, Burney ultimately suggests that politeness lacks the vocabulary with which to confront social and economic inequalities.


PhD progress.

I have reached the beginning of easter term, though for postgraduates such terms (ha!) are pretty much meaningless. I know it’s something of a cliche, but the time really does fly, and I’m getting to the point where the hand in date, though 2.5 years away, doesn’t look half as remote as it did at the start. I hope my creative writing discipline for 1k words a day, every day, come rain or shine, kicks into action. Indeed, having written 3 full manuscripts already, an 80k thesis doesn’t seem quite so… daunting, if that’s the right word. Of course, everything about the PhD is daunting, but those things – archives, notes, general research – are mostly connected to meticulous note keeping rather than sitting down and writing the damned thing.


I have never blogged about what my PhD is about, exactly. But then ‘what a PhD is about’ is rather like asking a writer what their novel is ‘about.’ The short answer is ‘religious toleration and British national ident[y]ies in the 18th century’, if you’ve made the mistake of looking interested, I’d add ‘how Frances Burney’s novels reflected her inner struggle between her conservatism, anglicanism, French and Catholic sympathies.’ Most people, unless you’ve got a secret love for 18th century literature or have studied an English degree, won’t have heard of Burney, so to avoid peoples’ eyes glossing over I tend to eliminate that part and move the conversation swiftly on.


But since you’ve read this far: Frances Burney d’Arblay (1752 – 1840) wrote 4 novels (each declining in popularity), several dramas, and has only in the last 30 years grown out of being a footnote to Austen. Critics and biographers all argued that her love for her French Roman Catholic grandmother Francis Sleepe was formative in the development of her social criticism, a criticism which their work argued placed her as a great, socially astute, writer. Yet despite the critical attention given to this love, almost nothing has been said about religion in her novels or wider life. This is obviously a bit odd, especially when we consider not only the garish yet sympathetic franco-british character of Madame Duval in Evelina, the prominence of religion in the 18th century, but also how quietly yet persistently sympathetic she is to France and Catholics in her novels as a whole. Then there’s her life: her father (the musicologist Charles Burney) certainly feared her love for her grandmother was a potential source of conversion, and though she never did convert as far as we can tell, she did marry a Roman Catholic French emigre general in a Roman Catholic ceremony, and spent ten years in Paris amongst a group of Catholic friends at the height of the Napoleonic wars.

What I broadly argue is that her novels show her not only deeply sympathetic to the Britishness of Roman Catholics in the late eighteenth century, but also reveal how her own national identity was split between the need to conform to the sectarian protestantism of britishness, and an understanding of how this britishness was predicated on the misrepresentation of loyal English Catholics as ‘papists.’ British identity, she understood, was a voracious, colonising thing, kicking out Catholics from history and community, denying their local toleration, and seeking to assert the pure claims of Protestantism to British bodies, culture, history, and land.

It’s particularly interesting, then, that her long life involved correspondents and friendships with some of the highest cultural, political, and theological forces. She spent five years at court, sparking a loving correspondence with the princesses that lasted for decades. She knew Burke, Johnson, Thrale, and attended the trial of Warren Hastings. Her husband and his group of friends – the juniper hall set – were French constitutionalist emigres. Her own and her family’s correspondents reach to Pitt, to Shute Barrington, to the Plowdens. Her life was interwoven with an eighteenth-century society struggling with questions of emancipation and revolution. That’s what makes my project so interesting for historians – and partially why i’m in the history faculty – now that we’ve (well, me) noticed these connections, it gives us a new perspective on the formation on national identity, catholic emancipation, and the lived experience of ‘britishness.’

A lot of the historiography so far on British national identity has focussed on the big questions: i.e, whether it was anti-papist or not, to what extent the state was confessional – i.e sectarian – or not, the various theological and political wrangling that went on around the government, the extent to which it – and protestantism – was influenced by what went on in Europe, and more recently, how Britishness rubbed against other national identities in Britain. But very little has been said about how local identities rubbed against this, how a British subject weighed their own local identities against the overbearing legal force of the state (tentative answer: with a lot of angst and use of toleration-filled kinship networks to get around the worst of state repression). Burney gives us such a record.

Similarly, historians of Catholicism have spent a lot of time in the last 50 years dragging the discipline out of its recusant corner and into the wider historiography. Many of the early Catholic historians were Catholics themselves: either lay members or in Aveling’s case, a member of a religious community. There’s nothing wrong with this, of course, but a tendency to write insular histories seals off the discipline from wider historiographical currents and tends towards narratives of self-fashioning. (of course, all history and historians can be guilty of the charge, hence why it’s  important to be read and critiqued as widely as possible). Again, current trends have focussed on the big questions still to be answered in the wider period: what was Catholic life like in the first half of the 18th century, to what extent were Catholics integrated into society? (Answer: much more than we thought) and picking away at the wider issue of catholic involvement in public life in a deeply oppressive society. But again, little has been said about the actual day to day lives of English Catholics under Britishness.

To some extent this is the fault of what survives: little enough primary material of 18c lives survives, and criticising the government was risky business for anyone, let alone a Anglican Catholic sympathising woman.

This is why Burney’s lives and selves, hidden and whispered between the lines, is so exciting.

Cambridge offer.

Good news in my inbox the other day, I’m going to receive a Vice-Chancellor’s award from the University of Cambridge (fees + stipend) for a PhD in History at Queens, Cambridge from Michaelmas 2015. I may still receive an AHRC award from Cambridge on top of that, apparently.

Several days before that, I was put forward to the second round of the AHRC competition at York.