GeoJSONL: An optimized format for large geographic datasets



March 8, 2019: We've published a follow-up blog post: Even more geospatial tools supporting the "GeoJSONL" format. Now you can stream GeoJSONL files up to HERE XYZ, down to a web browser, or between your own programs using a growing list of software libraries.

GeoJSON is a popular format for exchanging geographic data. It builds on the success and flexibility of JSON to describe structured data, and simplifies many of the more complicated aspects of traditional GIS file formats. Initially developed through an ad hoc process, GeoJSON is now formalized by IETF RFC 7946, with a working group to guide future additions. The universality of the underlying JSON format and the WGS84 coordinate system make GeoJSON nearly ideal for web-based mapping projects.

However, very large data sets are common in GIS, and the structure of a GeoJSON file generally requires the entire file to be read into memory and decoded all at once. A large file (e.g. 1 Gb) can easily exhaust the resources of a desktop computer. While streaming JSON parsers exist, they are generally not part of standard libraries and introduce yet another dependency and complexity.

One potential solution to the problem of large JSON files is to restructure the file as an array of objects, rather than having a single root object. Because the JSON specification prohibits the newline \n character from being used in string literals without escaping, it opens the possibility of having one JSON object per line in a way that is compatible with both standard UNIX text-processing tools and can be read line-by-line with existing JSON parsers.

This convention has arisen independently in many contexts, and is known as Newline Delimited JSON (ndjson), JSON Lines (jsonl), or GeoJSON Text Sequences.* For GeoJSON specifically, the root level FeatureCollection object is replaced with a simple array of features, one per line. This file can then be read line-by-line, feature-by-feature, and easily integrated with other tools that use newline-delimited records such as GNU parallel. Newline-delimited GeoJSON (GeoJSONL) has been casually proposed several times before, and is natively supported by some tools such as the Osmium export command, Mapbox’s Tippecanoe, and jq, the Swiss-army knife of JSON processors.

At Interline, our OSM Extract service now provides these files as well as traditional GeoJSON and OSM PBF formats, with the goal of providing a simple basis for reading and filtering OSM extracts without the need for more complicated and specialized libraries. Here is an example GeoJSONL file, which you can compare to regular GeoJSON.

While the file sizes are nearly identical, reading and parsing the entire ~27 Mb GeoJSON file requires several seconds and ~240 Mb of memory. Iterating through the GeoJSONL file line-by-line takes a similar amount of time, but uses negligible amounts of memory:

import json
with open('honolulu_hawaii.geojsonl') as f:
    for feature in f:
        print json.loads(feature)

The above Python snippet also demonstrates that no additional dependencies are required to read the GeoJSONL file; a line iterator and parsing each row separately works well. Still, libraries are available for many languages that provide native support.

As mentioned above, GeoJSONL also provides excellent input for traditional text-processing tools. The following (contrived) example runs jq to find the frequency of names for features with the highway tag, without loading the entire file into memory:

# cat honolulu_hawaii.geojsonl | jq 'select(.properties.highway!=null) | .properties.name' | sort | uniq -c | sort -n
...snip...
  36 "State Highway 83"
  49 "John A. Burns Freeway"
  59 "Makakilo Drive"
 106 "Farrington Highway"
 107 "Kamehameha Highway"

Given the increasing complexity and importance of geographic data in nearly every domain, and with increasing file sizes, GeoJSONL is useful and timely. We’re glad to now support the format in Interline OSM Extracts and we encourage other tools to offer native support when possible.


Notes

* Minor differences exist between ndjson and jsonl.

GeoJSON Text Sequences adds a record separator (RS) character, in addition to the newline used by ndjson and jsonl.

Based on these existing options, we recommend the use of ndjson, which has a simple spec document and allows the presence of blank lines. This supports the widest range of tools for parsing and consuming. We use geojsonl as the file extension, although this is just a matter of taste.

Originally posted to the Interline blog on September 11, 2018.

Written by: