MTA subway data project Part One.

Webscraping is a pretty fun beginner’s project, and something i’ve already touched upon in Sweigart’s Automate the boring stuff with Python. Sweigart’s downloading of every xkcd, while fun, isn’t much use for grabbing data en masse for data science.

Enter the NYC subway data!

While I could have manually downloaded each txt file, it’s much more useful experience to get python to do it. I followed the instructions here, which used the standard beautifulsoup package and syntax to grab more than enough files.

The next instructions for clearing up the data came from here. The problem, however, was that it assumed one single file, and not a set of over 20 25mb files.

The first thing to do was to merge the files into one csv. I used one simple brute force method found on stackexchange to create a 1gb+ txt file. But when I ran it through the code in tabular rasa, it resulted in a number of dtype errors about “mixed data types in columns 9/10” Hm. After a good hour or so of trying to specify the data types and searching stack exchange, I… realised that I’d merged the files without paying enough attention to the content of each individual files. In other words, by brute forcing the files I repeated the header, which the main program was attempting to interpret as entrances / times respectively.

Back to square one. Thanks to stackexchange I found another way of doing it, which can be found below:

import glob
interesting_files = glob.glob("*.txt")
header_saved = False
with open('output.csv','w') as fout:
for filename in interesting_files:
with open(filename) as fin:
header = next(fin)
if not header_saved:
header_saved = True
for line in fin:
view raw hosted with ❤ by GitHub

This, however, didn’t work at first: I was using text files in csv format, and it had been written in python2.x, and I was using Python3.x; changing the wb to w fixed this, and produced a monstrous 3gb csv.

In the next post, we’ll go onto data visualisation, and in the third, we’ll discuss applications.