Same data, different zip: How to identify unique GTFS feed versions

If you've ever downloaded a GTFS feed twice from the same URL and gotten two different zip files, you've encountered a surprisingly common problem in the transit data ecosystem. The data inside hasn't changed (the same stops, routes, and schedules) but the zip archive itself is different. Maybe the server regenerated it dynamically. Maybe the compression level changed. Maybe the file timestamps inside the archive shifted by a second.

For Transitland, which archives and imports thousands of GTFS feeds from around the world, this matters. We need to know: has this feed actually changed, or is it just the same data in a new wrapper?

The answer lies in how we calculate checksums — and you can use the same tools we do.

The problem with hashing zip files

The naive approach to detecting changes is straightforward: compute a SHA1 hash of the zip file and compare it to what you had before. If the hashes differ, something changed.

But zip archives are containers, and containers have metadata. A zip file's checksum can change for reasons that have nothing to do with the transit data inside:

Dynamic generation: Some transit agencies serve GTFS through vendor platforms that create the zip on the fly for each download. Same data, different zip bytes every time.
Compression differences: Repackaging a zip with a different compression level produces a different file.
Timestamps and ordering: The internal file modification times and the order of entries within the archive can vary between builds.

If Transitland treated every new zip checksum as a new feed version, we'd be archiving and importing duplicate data constantly, wasting storage, processing time, and cluttering the archive with phantom "updates."

Two checksums, two purposes

Transitland calculates two SHA1 checksums for every feed version:

Zip SHA1: a conventional hash of the entire zip archive file. This is fragile by nature (it changes whenever the packaging changes), but it's useful as a unique identifier for the exact file you downloaded. Transitland uses this as the primary key for feed versions in its REST API and website URLs.

Directory SHA1: a content-aware hash that looks inside the zip, directly at the GTFS CSV files. This is the one that solves the deduplication problem. Two zip files with completely different Zip SHA1 values will share the same Directory SHA1 if the actual transit data is identical.

When Transitland's feed fetching service downloads a new copy of a feed, it checks both hashes against its archive. If either one matches an existing feed version, it knows the data isn't new and skips the import.

Inside the Directory SHA1

The Directory SHA1 algorithm in transitland-lib is deliberately simple, but each design choice is intentional:

Open the zip and enumerate its entries. We look at what's inside the archive, not the archive itself.
Filter to GTFS content files. Only lowercase .txt files in the root directory of the archive are included — stops.txt, routes.txt, trips.txt, and so on. Directories, hidden files (starting with .), files in subdirectories, and non-.txt files are all excluded.
Sort the files alphabetically by name. This neutralizes any variation in the order files appear within the zip.
Stream each file's bytes sequentially into a single SHA1 hash. The contents of agency.txt, then calendar.txt, then calendar_dates.txt, and so on, all fed into one running hash computation.

The result is a 40-character hex string that represents the actual transit data, independent of how it was packaged. Re-zip the same GTFS files with different compression? Same Directory SHA1. Download the feed again tomorrow from a server that regenerates the zip? Same Directory SHA1, as long as the CSV contents haven't changed.

Notably, the algorithm does not reorder rows within files or normalize field ordering within CSVs. It operates on the raw byte contents of each file. This is a pragmatic choice: it avoids the complexity and potential bugs of CSV parsing at the checksum stage, while still achieving the goal of being packaging-independent.

💡

The transitland-lib CLI does also include a diff command that can do deeper, row-level comparisons between two feed versions, but that's a topic for another blog post.

Try it yourself

The transitland command-line interface includes a checksum command that computes both hashes for any GTFS feed on your local machine.

Here's a real example using MBTA's static feed:

$ wget https://cdn.mbta.com/MBTA_GTFS.zip
$ transitland checksum MBTA_GTFS.zip

Zip SHA1 (archive file): 6d768217c1e441cc13f313595856669d4c24c013
Directory SHA1 (feed contents): 34e21ac222935b7acdd395e5f6b36dc996da0d60
Find via Transitland website: https://www.transit.land/feed-versions/6d768217c1e441cc13f313595856669d4c24c013
Find via Transitland REST API: https://transit.land/api/v2/rest/feed_versions/6d768217c1e441cc13f313595856669d4c24c013?apikey=YOUR_API_KEY

The CLI gives you everything you need to cross-reference the local file against Transitland's global archive. Click the website link and you'll land directly on that feed version's page, where you can see when it was fetched, what date range it covers, and explore its stops, routes, and agencies. The CLI's output can also be compared against another zip file you have locally.

This is useful in a few scenarios:

Agency staff can verify that Transitland has successfully fetched and archived their latest feed. Publish an update, wait for the next fetch cycle, then run transitland checksum on the same file and check if the Zip SHA1 appears in the archive.
Developers responsible for legacy systems can ensure their pipelines are only emitting new GTFS feeds when the Directory SHA1 changes from its previously cached value.
Analysts who download GTFS feeds directly from agencies can confirm they're working with the same version that Transitland has processed, which is helpful when comparing your analysis against Transitland's API results.
Researchers working with historical GTFS data can use the checksum to determine exactly which feed version in Transitland corresponds to a file they've been handed. They can then use Transitland's additional metadata to inform their research.

Installing the CLI

The transitland CLI is a single binary. You can:

download it prebuilt for Linux and macOS from the transitland-lib releases page on GitHub
install it using the Homebrew package manager
build it from Golang source if you prefer

A small detail that makes a big difference

Checksumming transit feeds might sound like plumbing. It's the kind of plumbing that makes a global transit data platform reliable. Without content-aware hashing, Transitland would either miss real updates (by relying on unreliable zip hashes) or drown in false positives (by treating every re-downloaded zip as new data).

The transitland checksum command also puts this capability in your hands. Whether you're an agency verifying that your feed updates are being picked up, or an analyst reconciling local data against the Transitland archive, you can generate the same fingerprints that Transitland uses internally.

If you have questions or want to explore further, check out the Transitland documentation or browse the transitland-lib source code on GitHub.

Same data, different zip: How to identify unique GTFS feed versions

The problem with hashing zip files

Two checksums, two purposes

Inside the Directory SHA1

Try it yourself

Installing the CLI

A small detail that makes a big difference

Read next

Easily inspect GTFS Realtime using Transitland's website or API

Announcing transitland-lib v1.0.0

Transitland for nighttime