At a recent transit conference, an IT professional attending a data session, which one of the authors moderated, stood up and said:
“I ride the bus to work every day and rely on my agency’s app. The app provides scheduled arrival times and predicted arrival times. My bus is scheduled to arrive at 8:00 a.m. Most days my app says my bus will arrive at 8:00 a.m. and yet it sometimes arrives 10 minutes early, 10 minutes late, or not at all, and the app doesn’t reflect this. It is very frustrating. I don’t know how many customers we are losing because of this, but I assume it is a lot.”
Ensuring projected arrival times are correct in transit is difficult and there are a number of causes ranging from driver shortages to technology shortcomings. Regardless of the cause, they all result in inaccurate data quality and unsatisfied customers.
The management and effective use of data has become essential for high performing transit operations. Buses have become computers on wheels that produce massive amounts of data from systems such as CAD-AVL, passenger counting, GNSS, vehicle health monitoring, and payment systems.
A 2023 National Academies report concluded “the sheer volume and diversity of that data is a problem for many agencies. They are not able to view or use all the data they collect; and, as a result, they may not be able to comprehend the value of the data available.” Many agencies struggle to optimize the use of data that they have.
Most agencies that provide bus service make their data available to the public following the General Transit Feed Specification (GTFS) Schedule and GTFS Real Time (RT), which are open standards used to distribute relevant information about transit systems to riders.
GTFS Schedule data feeds include seven underlying text files: agency; stops; routes; trips; stop times; and calendar dates that are recommended to be updated on a weekly basis. GTFS RT is an additional standard that provides data on the position of the vehicle, and thus, feeds trip updates, travel alerts, and vehicle location and is recommended to be updated every 30 seconds.
Transit Agencies are not required to use these standards, but they have become the norm for most and a minimum requirement for an improved rider experience.
The GTFS feed is essential for planning and communications purposes. Unfortunately, its value is only as good as the quality of the data provided. As one transit agency put it, “making GTFS and GTFS-RT publicly available in real time is problematic. We are not comfortable with data quality. We don’t have enough time to reconcile it.”
Accurate and Reliable Data
Agencies are required to provide National Transit Data (NTD) to the Federal Transit Administration (FTA).
The NTD is a national database that records the financial, operating, and asset condition of transit systems, helping to keep track of the industry and provide the public with information and statistics. Formula grant allocations are impacted by this data. An FTA representative stated that “NTD financial data is generally good. The quality of the data gets worse as you move to operational data. The quality of what is reported varies greatly across agencies.”
Many transit agencies do not have the resources or appropriate incentives to make sure their data is accurate and reliable. It is hard to determine a distinct ROI for cleaning and engineering data. Should it be customer satisfaction ratings; riders lost or not recovered; or financial savings? The challenge is how best to draw a direct coalition between quality data and customer satisfaction.
One transit professional said he believes his agency’s data is good “because we are no longer getting as many customer complaints about our arrival predictions.”
Another creative transportation professional was able to derive a financial ROI from their investment in ensuring quality data stating that, “we saved hundreds of thousands of dollars last year by using data to help drive route decisions. We were able to eliminate an unnecessary route.”
The first step to ensuring data is reliable is to make sure it is clean. Dirty data, which is incomplete, incorrect, inaccurate, or irrelevant, can lead to misinformed decisions and missed opportunities.
To clean its data, a transit agency must remove corrupt or inaccurate data and then enhance it to ensure that it is complete, up-to-date, and reliable. Cleaning GTFS data, for example, can be a time consuming and resource intensive process that includes multiple steps. Much of it can be done by outside consultants that can run the data through machine learning algorithms to identify anomalies. Even so, it still requires visual inspection by the staff at various stages.
One consultant that focuses on helping agencies with transit data commented that “agencies will pay only for data management if it supports their NTD reporting obligations.” Our experience is that most large agencies and a number of small to mid-size agencies are making this investment. The common attribute they all share is there is somebody on staff who understands the importance of quality data and is willing to be an evangelist.
The second key to obtaining quality data, is data standards that establishes the ground truth.
The Need for Standardization
The purpose of standardization is to ensure an organization’s data is consistent for all of its users and there is a uniform structure that makes the data easier to manage, analyze, and exchange across different systems and organizations.
While GTFS and GTFS-RT are open data standards, they are not consistently used across the country, across agencies, or even within the same organization. For example, an agency might have a GTFS driven public schedule, but they may not use GTFS RT for on-time predictions.
Additionally, many agencies have mobile app providers that use their own arrival prediction data because they don’t rely on the GTFS feed from the bus because of its quality.
Standardization requires defining and implementing rules and formats to organize the data. It can provide numerous benefits to all organizations and the transit industry in particular. Key benefits can include:
- Enabling different systems and applications to work together seamlessly. In transit this means ticketing, route scheduling, and passenger information systems would all be relying on the same set of information.
- Data is higher quality because it is less prone to errors, inconsistencies, and duplications. Standards eliminate uncertainties and ensures that data is accurate, complete, and up to date.
- Many organizations use a variety of software applications and data sources. Following a standard simplifies the process of integrating data from the variety of different sources by providing a common structure and format for all data.
- Data is easier to analyze and report on. IT professionals will spend less time cleaning and preparing data and more time analyzing to make informed decisions.
- It becomes easier to share with external vendors, customers, and stakeholders. This is especially important in transit when agencies rely on numerous vendors to help them supply services to customers and data sharing is an integral part of providing real-time travel information.
- Data will maintain its relevance longer as technology and business requirements evolve. Consistently applying the same standard enables historical data sets to influence future planning decisions.
The Benefits Derived from Standardization
Because of the lack of clear, consistent, and enforceable standards, regional solutions have proliferated. Data standards like the Transit ITS Data Exchange Specification (TIDES) Google group, the Mobility Data Slack group, and the MNDOT transit data specification have sprung up as various regional groups seek the benefits of common standards. They are even beginning to require compliance with their data standards by their vendors, which creates its own set of challenges as vendors are required to present their data in different formats for different clients.
Leadership at the federal and state level to create and enforce national standards would ensure conformity across the public and private sector. This will increase the utility of the data gathered and reduce the inefficiency in gathering and distributing real-time information, improving the agency’s operations and the customer experience.
In response to this need, the FTA has created a new Standards Development Program (SDP) to develop voluntary standards, best practices, guidance, and tools for the transit industry.
Unfortunately, this process will be a long-term investment. In the meantime, regional groups will grow and ideally evangelists and consultants will educate more agencies about the value of continuing to improve the quality of their data.
Standardized data provides a foundation for better decision-making. When data is consistent and reliable, organizations can make better informed strategic and operational decisions that drive performance and customer satisfaction.
Transit agencies can benefit in several areas: fleet management, regulatory compliance, safety and security, and passenger experience.
Data cleansing and standardization in public transportation is essential for creating a more efficient, integrated, and passenger-friendly transit system. It promotes interoperability, efficiency, cost savings, and data quality; unlocks innovation; and improves the overall rider experience. Data standardization is a key enabler for the modernization and improvement of public transportation services.
About the Author: Scott Belcher is President and CEO at SFB Consulting LLC; and Mark Talbot is Principal for EFCT Consulting LLC