One of the most vexing sets of data to make usable for a data analyst is the earthquake dataset available via the NCEDC search site. While the site returns results quickly enough to an anonymous FTP site, they do not contain any columns representing the country, state, county, or city. These columns are some of the most useful for analysis of questions such as: “Oklahoma now rivals and even exceeds California for the number of significant earthquakes?”
Believe it or not, the answer to the preceding question is “True,” especially when one can analyze reverse-geocode earthquakes using relatively simple SQL queries.
The difficulty was in the reverse-geocoding of the latitudes and longitudes to their respective countries, states, counties, and cities. Originally, I had authored a Java program that used various ESRI shape files and discerned to which administrative units a lat/long belonged. That is, if you wanted to wait 12 hours for it to run.
Given the long run time and inconvenience of obtaining the shape files, I declined to publish it except as source code with little, if any, explanation as to its operation and use. I just didn’t think it was suitable for public consumption yet, as the reverse-geocoding was only tediously repeatable. I knew there was a better way, and as of a few weeks ago, after some research, I authored a better mousetrap:
- a Python 3.4 script
- using the reverse-geocoder package
- which uses K-D trees
- and datasets from GeoNames
- reverse-geocoding 2.8 million rows in approximately 210 seconds
So, in the next few weeks, the Python script will be pushed to GitHub and the reverse-geocoded earthquake dataset to http://frackingdata.info/downloads. A posting or postings will be pushed when this is done.
How cool, from 12 hours to 210 seconds.
Finally, some progress…
12 April 2016