Removing unwanted units from data with “chomp” in Google Refine

I recently wanted to make a county-by-county map of food stamp use, and a Google search turned up this article which has data from 2009. It’s a bit out of date, but as the data is meant to drive a map-making example, it will serve my purpose – especially since the authors were good enough to make the data available:

datasource

The source data is structured and complete, but there’s a problem: there’s a “%” sign on the end of all the readings. We won’t be able to do any sort of math on this or use it as a source for a Fusion Tables choropleth map unless we can remove those first.

There are a few different ways to do this, but I prefer using Google Refine (soon to be Open Refine). It’s powerful, versatile, free, and open-source. All the good things. Plus, you can point it directly to the URL of your data, like so:

URLDataSource

It also happily opens files from your computer as well.

Click “Next” and then in the preview pane, check “Ignore first…lines”:

updatePreview

Change the value from 0 to 1, and click “Update Preview” so that the preview changes from:

BeforeUpdate

to:

afterUpdate

Be sure to make a note of the source and contact information, and keep that original URL handy.

Now we’re ready to tackle those “%” signs. Click the down arrow next to the header of the PctOnFoodStamps column, and under “Edit column” choose “Add column based on this column..”:

addColumn

In the resulting popup, enter a new column name, and in the expression box, enter chomp(value, "%"). The preview pane should show just the number, sans percent sign, in the right-hand column:

chompformula

Copy the formula to your clipboard, then click “OK”.

Your file will now have two copies of the same data; one with the “%”, one without. You’ll want to delete the original column, by once again clicking the down arrow next to the header of the PctOnFoodStamps column, and under “Edit column” selecting “Remove this column”:

removeColumn

You can then repeat this process with the remaining columns, pasting in the formula from your clipboard each time. Once you’ve finished, you can export your project in your format of choice (I often prefer tab-separated) and continue with your analysis or visualization.

The chomp function isn’t limited to removing just percent signs, of course. You can have it chomp any string from the end of your data. Use it to remove whole words, phrases, or sentences, or even extra zeroes.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>