A Data-driven Perspective on the Notorious MTA

How geographic data analysis led to canvassing strategy for optimum impact and engagement

Nick Wilders
6 min readSep 29, 2020

--

There isn’t much that will get New Yorkers talking quite as much as the MTA. The Mass Transit Authority encompasses the many trains, buses, and other forms of transit that keep the 22.82 square miles of Manhattan and beyond in motion. Whether you find peacefulness in the daily commute, or dread it while you sip your morning coffee, it’s a staple of the busy lifestyles in NYC. COVID-19 has substantially diminished my personal subway usage, so my first project with Metis brought pangs of nostalgia and, incidentally, more attention to the turnstiles than I ever thought possible.

For fellow data science enthusiasts, feel free to check out and follow along with the resources on our GitHub repository.

The Basics

For our first challenge with Metis, my team of four was charged with recommending a street team schedule for WomenTechWomenYes (WTWY), a thriving, and completely fictitious non-profit with the mission of promoting awareness and opportunities for women in tech (sounds cool, right? Don’t fret — Women Who Code, or WWCode, is an organization that really does this work!) With big plans for a Fall gala (which we chose to plan for a future Fall season, global pandemic notwithstanding), and the desire to “fill [their] event space with individuals passionate about increasing the participation of women in technology, and to concurrently build awareness and reach”, the (admittedly imaginary) stakes were high, and the path to a solution was unclear — how can we target potentially high donors, while spreading awareness in the communities that needed WTWY? The answer, of course, was in the data.

Our Tools

The curricular goal of this project was to learn how to gather, utilize, and present data through some fairly straightforward tools. Turns out, the MTA tracks every turn of a turnstile in this city, from Inwood to Far Rockaway. We utilized this data, cross-referenced with MTA station geographical data collected using GeoPy and an extensive domain knowledge of NYC subway systems. With over 400 stations and nearly 5,000 unique turnstiles, that’s a lot of data to handle. Thankfully, we were in good hands with Python (the native data science language, of sorts), pandas, matplotlib, NumPy and the aforementioned powerhouse that is GeoPy — all derivative code libraries from Python, the mother (serpentine?) tongue.

The most important tools at our disposal was, satisfyingly enough, our own expertise. With a decade-long history of working with non-profits, I brought an extensive domain knowledge that was well-complemented by a team who brought computer science, data visualization, and external research expertise. Our varied skill level in programming was an easily surmountable challenge, as we used these new tools to solve a key problem — where do we send this hypothetical street team?

Breaking Down the Boroughs

One thing that was clear to our team was a need to reach a wide geographic area within the five boroughs. After a cursory look at the data and a couple days of cleaning (an observation that each turnstile had well over 10 million “entries” per four hour period led us to the conclusion that these were cumulative, representing the total entries over the turnstiles lifespan), we sought out more data to help us address this need.

A classic NYC map and it’s counterpart, a map of every subway station using latitudinal/longitudinal data

One of our greatest joys in this project was finding latitudinal and longitudinal data for each individual subway station. When joined on the UNIT column commonly used as a numerical assignment to an individual subway stop, we were able to create the subway map you see above, with each blue dot representing a unique subway station. Some of the observations are no-brainers for a New Yorker: central Manhattan is busy, and Queens is dependent on two major lines. However, there’s something deeply satisfying in seeing your native intuition and domain knowledge validated by a visual you created. The boroughs were becoming clearer, but we needed one more step to finally assign this data.

Nominatim, an open-source tool in GeoPy, helped us correlate each individual subway stop to a rough address and, consequently, a borough. That let us to the information picture right, including that Manhattan accounts for over 36 million entries, or 41% of the 88 million total turnstile entries in September 2019. With Staten Island accounting for <1% of entries, and “Unknown” accounting mostly for PATH or irrelevant stations, we chose to focus on the four top boroughs: Manhattan, The Bronx, Brooklyn, and Queens.

At this point, we took more time to clean up our data- this included removing duplicate values (did you know there are four separate 86th St stops??), determining how far we could afford to go with outliers (using Pearson’s mode skewness) and assigning each date value its corresponding day of the week (more on that later). With a clean data set, we were ready to hear the story our data told.

“Please use all available doors…”

A slideshow from our final presentation (available here!)

One of the clearest observations between the four boroughs we chose to study was the relationship between turnstile turns and number of turnstiles in a station. From Manhattan to the Bronx, decreasing in order of total turns (and number of turnstiles) we see a decrease in a fit with linear regression. To put it simply, there’s a direct correlation between the number of turnstiles and the amount of turnstile turns in Manhattan — we start to see that relationship break down as we move into the Bronx. For us, this connoted less reliability in Bronx and Brooklyn — we would have to look to the few specific outliers that consistently delivered (familiar stops like Flushing-Main St and 161st St — Yankee Stadium) to utilize the foot traffic project there. We have more flexibility on Manhattan and Brooklyn, able to hit heavy commuter hubs(42nd St, 32th St- Penn Station) and bustling but less “connected” areas (Dekalb Ave, Canal St) alike.

As you might expect, Manhattan ridership is largely concentrated in the afternoon rush hour, especially on weekdays. For all four boroughs, foot traffic was most common during the week and a bit less so on the weekends. However, a closer look at the Brooklyn station helped us realize that the relationship to time of day was not quite so clear-cut.The distribution below the represents the time of day for a turnstile turn in the top 20 Brooklyn stations specifically:

Top 10 Busiest Stations in Brooklyn (at different times of day) vs Stations 11–20

The Top 10 busiest stations, like Brooklyn stalwarts Jay St.-Metrotech and Bedford Ave, consistently perform well in the afternoon and nighttime. However, we found an unexpected surprise in Flatbush Ave — while morning and afternoon times were comparatively unremarkable, it held decent performance into the nighttime hours. on the other hand, Nevins St and Brighton Beach had outstanding morning and evening performance, but absolutely crashed into the evening. This disparity in stations drove home a key point — high volume does not connote consistent performance, and the relative strength of stations 11–20 in our Brooklyn ranking were not to be easily dismissed.

Exiting the Station

Ultimately, we would need more time and resources to truly dig into a geographically effective distribution. Non-profits like WTWY must tap into the population they serve, not just the wealthy homes that keep them running. We saw a number of areas for improvement and further partnership in these studies:

  • Look across multiple years of data, specifically addressing the growth of tech in NYC over the past five years
  • Incorporation with U.S. Census data, targeting high income areas in addition to areas with prior exposure to STEM through nearby universities or tech jobs
  • Engaging with residential New Yorkers in their home neighborhoods in order to create lasting relationships and, ideally, annual fund contributions.

This project utilized new and old skills alike, and definitely challenged me. I enjoyed the opportunity to lean into my speaking and presentational background, along with the chance to apply some new programming skills. I will certainly miss the group format, but am looking forward to our next several projects at Metis!

In the meantime, I’ll never look at a subway turnstile the same way again…

--

--

Nick Wilders

NYC musician, Metis data scientist, ponderer. Aiming to amplify and harness the power of data utilization in arts and entertainment.