Let’s talk a little more about the data. This dataset is based on the “Destination Inference” step of our attendance model, which we detailed in a previous blog post. As this article points out, the basis of this model is the assumption that the destination of a subway trip is the station at which the rider next passes/taps. If a MetroCard passes at Bowling Green at 9:15, and then that same MetroCard passes at the 103rd Street stop in East Harlem later that afternoon, we make the imperfect (but fairly good) inference that that 9:15 trip traveled from Bowling Green to 103rd Street. These “linked trips” form the basis of our understanding of how riders move through the system (Note 1).
In this subway Origin-Destination (OD) dataset, we took these attributed destinations generated by our destination inference process and aggregated them by complex origin-destination station pair and time of day. These totals are then aggregated by averaging over a calendar month. The removal of personally identifiable information, such as MetroCard ID numbers, and the aggregation of ridership data over a calendar month are done to protect the privacy of MTA riders by preventing the association of a single MetroCard swipe or subway trip with a specific person or time. The format of this aggregated dataset allows users to understand for “an average time of 9 a.m. in May,” approximately how many people traveled between two subway complexes.
It is important to keep a few things in mind when using this data:
- Since these data are the result of a modeling process, the attendance figures for each origin-destination pair are estimates, not exact values. This modeling process, along with the monthly aggregation, results in fractional attendance values. We have intentionally left the attendance estimates in decimal form to reflect the uncertainty inherent in this dataset.
- Since this data represents a monthly average, users should keep in mind that holidays, construction, or other major events occurring during a given month may impact attendance estimates.
- Because the modeling process only considers metro station entrances, we cannot quantify how many of these trips actually started and ended at these metro station complexes and how many may have included a transfer to or from another mode of transportation (e.g., a bus) at one or both ends.
- When using the data to examine arrivals at a metro station, users should note that the timestamp for each OD pair is rounded to the nearest time of the entry scan (or tap) and does not account for travel time between the entry scan and arrival at the destination (Note 2).