MTA subway data project Part One.

Webscraping is a pretty fun beginner’s project, and something i’ve already touched upon in Sweigart’s Automate the boring stuff with Python. Sweigart’s downloading of every xkcd, while fun, isn’t much use for grabbing data en masse for data science.

Enter the NYC subway data!

While I could have manually downloaded each txt file, it’s much more useful experience to get python to do it. I followed the instructions here, which used the standard beautifulsoup package and syntax to grab more than enough files.

The next instructions for clearing up the data came from here. The problem, however, was that it assumed one single file, and not a set of over 20 25mb files.

The first thing to do was to merge the files into one csv. I used one simple brute force method found on stackexchange to create a 1gb+ txt file. But when I ran it through the code in tabular rasa, it resulted in a number of dtype errors about “mixed data types in columns 9/10” Hm. After a good hour or so of trying to specify the data types and searching stack exchange, I… realised that I’d merged the files without paying enough attention to the content of each individual files. In other words, by brute forcing the files I repeated the header, which the main program was attempting to interpret as entrances / times respectively.

Back to square one. Thanks to stackexchange I found another way of doing it, which can be found below:

import glob
interesting_files = glob.glob("*.txt")
header_saved = False
with open('output.csv','w') as fout:
for filename in interesting_files:
with open(filename) as fin:
header = next(fin)
if not header_saved:
header_saved = True
for line in fin:
view raw hosted with ❤ by GitHub

This, however, didn’t work at first: I was using text files in csv format, and it had been written in python2.x, and I was using Python3.x; changing the wb to w fixed this, and produced a monstrous 3gb csv.

In the next post, we’ll go onto data visualisation, and in the third, we’ll discuss applications.

Evelina project.

So as part of my wider attempts to automate archival transcription, i’ve become interested in automatic identification of authors. I can see two main uses for this in Burney studies: firstly, to figure out who-wrote-what in re-drafting of correspondence late in Burney’s life by her wider circle, and also settling once and for all which books by Mrs Meeke were written by Bessie Allen, Burney’s ne’er-do-well stepsister.

As a first step, I’ve been following this project. AFAICT, if everything runs smoothly, the only issue will be cleaning up the data before feeding it in if we want to apply it to authors beyond Kaggle’s data set.