I’m going to kick off a multi-part series on US Census data by offering a totally free download, in XLS or CSV format, of something strangely hard-to-Google: the 2010 US Census population by Zip code (technically, by ZCTA). Splitwise is offering these files free of charge and in the public domain, and I can’t believe how many other sites are charging for them!
But the difficulty I had in creating this data set and using the US Census website has inspired me to write a bit more about how to use one of the world’s most interesting open data sources.
The US Census data sets are incredibly valuable, despite their origin as a matter of mere political bookkeeping. Ancient astronomers watched the movements of the sky for the practical task of navigating ships, without knowing that Kepler and then Newton would use their recordings to discover the laws of gravity. In a similar way, while the US Census was created to turn the crank of representative democracy, its beautiful data sets have surely been the basis for countless demographic, civic, and business insights unforseen by the Census itself.
The US Census is both “big data” and “big science” – $13B for the 2010 Census (~$42 per capita), and the Census Bureau’s annual budget in 2012 was $1B. For comparison, the Large Hadron Collider at CERN, the world’s biggest Physics experiment, had a budget of $9B, and an annual operating cost of $1.1B in 2012.
Considering just the decennial census, and not any of the supplementary work, the summary tables alone contain 73,028 census tracts (not even the smallest geographic region in use) and 8940 different query variables, many as arcane as “P0410001, Concept P41: Grandchildren under 18 years living with grandparent householder.” Using US Census data can be very intimidating indeed, and my sense is that Census Bureau themselves, faced with a Herculean task, has only a limited understanding of what summary data products would be most useful to publish.
Luckily for normal people who don’t enjoy hunting through messy Excel with strange jargon, the US Census has at long last released a simple, lightweight, JSON API for pulling data out of these arcane databases. This makes using the data much, much more straightforward. (All that you have to do is untangle the jargon-filled API documentation.)
In a series of posts to follow, I will document my journey through the US Census data as a newcomer, and share the tools I used to make the data so much easier to work with. The posts will assume that you are “a normal data analyst or consultant,” by which I mean you are good at Excel and/or Google Docs, but don’t know much anything APIs and might not even know what an API is.
P.S. An FAQ on why the hell Splitwise is doing this
Q: Wait, what? I thought you made that bill-splitting app that my roommates and I use?
A: Well, yeah, but we’re also working on a new set of fairness calculators and needed to do some background research. And also, we’re just nerding out.