Skip to content

Instantly share code, notes, and snippets.

View lecy's full-sized avatar

Jesse Lecy lecy

View GitHub Profile
@technickle
technickle / ValidateOpen311GeoReportBulk.r
Last active December 15, 2022 19:42
R validator script for Open311 GeoReport Bulk specification compatibility
# this R script evaluates a data file for compatibility with the Open311 GeoReport Bulk specification.
# see here for the most recent version of the specification:
# http://wiki.open311.org/GeoReport/bulk
#
# it implements nearly all of the checks identified in this document
# https://docs.google.com/document/d/1GLRniiT3xvmG-i6PPeZPZDK_FhBDGCpuVh5fCexEiys/preview
# however, it is very bare bones and the results need to be interpreted.
#
# written by Andrew Nicklin (@technickle) with contributions from the Open311 community.
#

Instructions

These instructions will help you better analyze the IRS 990 public dataset. The first thing you'll want to do is to read through the documentation over at Amazon. There's a ~108MB index file called index.json.gz that contains metadata describing the entire corpus.

To download the index.json.gz metadata file, you'll want to issue the following command: curl https://s3.amazonaws.com/irs-form-990/index.json.gz. Once you've downloaded the index.json.gz file, you can extract its contents with the following command: gunzip index.json.gz. To take a peek at the extracted contents, use the following command: head index.json.

Looking at the index.json file, you'll notice that it contains a json structure represented as a string. It contains an array of json objects that look like the following:

{"EIN": "721221647", "SubmittedOn": "2016-02-05", "TaxPeriod": "201412", "DLN": "93493309001115", "LastUpdated": "2016-03-21T17:2