Spatial grid systems in BigQuery

Spoiler: postal codes are not polygons

Travis Webb
3 min readJul 11, 2024

We recently published a new guide in our BigQuery documentation on using grid systems for spatial analysis. I’d like to share the inside scoop on how this guide came to be, why we wrote it, and how it can help you improve the way you do geospatial analysis in BigQuery.

Ever since BigQuery added geospatial capabilities in 2018, it has used S2 geometry under the hood to organize the data in spatial tables. About 6 months later, the ability to cluster tables on a GEOGRAPHY column was made generally available to our Cloud customers. This made most kinds of spatial queries much more performant and cost-efficient.

S2 is a fine system for organizing spatial data into regular grid cells — as is H3, developed by Uber. They slice up the world into smaller sections that can be reasoned about, indexed, queried, and compared. They are very useful. But many customers I work with are not using them.

That’s how we always did it

This isn’t to say they are using nothing, because for many use cases, it’s necessary to group things together somehow. Unfortunately, one of the common grouping methods I run across is to group nearby features together by postal code, census tract, or county.

Retailers like to compare demographics by postal code or county. Insurers often tally results by census tract.

These use cases can work reasonably well much of the time, but maintenance headaches and data quality issues inevitably creep in. Census tracts change. Counties annex other counties. And postal codes are not actually polygons at all — they never were.

The US Census Bureau has dutifully published Zip Code Tabulation Areas (ZCTA) datasets, lending this dubious method the imprimatur of the US government and making it appear that postal codes are more useful for “tabulation” than they really are. For a long time, they were the only available system to easily subdivide earth’s surface into manageable chunks, so their ubiquity is not surprising. But there is a better way.

S2 and H3 grid systems

When you’re comparing areas, you don’t usually want those areas to be arbitrary shapes drawn by humans — or worse: politicians. You want a regular subdivision that you (and your database) can reason about mathematically and logically. You want simple four or five or six-sided shapes, instead of a complex boundary with thousands of points. You want to be able to easily “zoom” in and out of an area to expand or narrow your analysis’ area of interest. You want to quickly and efficiently navigate from one area to an adjacent area without needing to perform a spatial intersection.

Crucially for machine-learning use cases, which often need to look at examples over long periods of time for training, a regular grid system provides your training process with an apples-to-apples comparison of area statistics over time. Because administrative boundaries (like postal codes) change over time, these changes can heavily influence and distort the training process.

Regular grid systems like S2 and H3 give you all these properties. They are easier on your database engine (and, therefore, your wallet), and easier on your eyes when you need to visualize the results of your analysis.

To learn more about how to incorporate these into your BigQuery analyses, check out the full guide.

--

--

Travis Webb
Travis Webb

No responses yet