GIS data processing with python

1. Introduction to GIS data processing

There are many sources of public data available in the US that can be used to create spatial visualizations, such as the Census Bureau, American Community Survey, Agricultural Census, County Health Ranking…etc. However, datasets typically come in a wide variety of formats and require some type of process to be suitable for use with GIS softwares. This tutorial will use an example dataset Local Area Personal Income (CA1-3) from the Bureau of Economic Analysis for its relative complexity involving multiples files and format conversion. This data will be used to join with the US county map distributed the Census Bureau.

API_web2. Preparations

The following is the list opensource / cross platform software used in this tutorial.

QGIS - http://www.qgis.org/en/site/
Enthought Canopyhttps://www.enthought.com/products/canopy/
Python dbf library - https://pypi.python.org/pypi/dbf/0.95.004

The following data is used with this tutorial.

TIGER/Line Shapefile US County - ftp://ftp2.census.gov/geo/tiger/TIGER2013/COUNTY/tl_2013_us_county.zip
Local Area Personal Income CA1-3 - http://www.bea.gov/regional/downloadzip.cfm

If you’re new to Python, Enthought Canopy is a IDE for learning. Downlaod the software and name a new file, save your file to a folder and make sure to put .py in your file extension. In the iPython window in the lower right, click “Change to Editor Directory” and check “Keep Directory Synced to Editor”. The folder you saved your script to is now your working directory, that should be the folder where you unzipped the CA1-3 file, and also put the dbf.py file from the dbf library.

3. Importing Data into Python

The CA1-3 data unzips into 59 files, 52 State files, 2 metadata, and 5 Metropolitan / Microplitan subdivisions. This tutorial will concentrate on pairing all the county information with the TIGER/Line, so only keep the 52 State files in the folder and move the rest to a different folder.

The GIS county shapefile uses the 5-digit FIPS county code to identify all the counties, and the CA1-3 data also uses the same GeoFIPS code, so the goal would be to match the two dataset through this column of data. However, upon closer inspection of the CA1-3 data, each FIPS is used 3 times to represent Total Personal Income, Total Population, and Per Capita Personal Income. To make this data work for GIS, we will need to write a code that will only pick out the one piece of data within this dataset. This can easily be done with the Pandas library in Python.

#import the pandas library
import pandas as pd

#locate one of the csv file
read_csv = ("./CA1-3/CA1-3_AK.csv" )

#import the csv file into python
st = pd.read_csv(read_csv)

Upon running the code, an EOF error is returned for line 106. Line 104 to 106 in the csv file contain 3 lines of notes. Since Pandas cannot process those notes, we need to come up with another code that would eliminate the notes. And this is an oppportunity to introduce another CSV library for reading and writing data files as a way to manage the data.

Since all the state files contain these 3 lines of notes beginning with “Note: See the included footnote file.”, a piece of code can be written to take advantage of this pattern and the code can be written to read the original data file and write it directly into a new file until it sees the note, and stop writing.

import csv

#locate one of the csv file
read_csv = ("./CA1-3/CA1-3_AK.csv" )
#write the modified data to a new csv file
write_csv = ("./CA1-3/CA1-3_AK_mod.csv")

with open(write_csv,'wb') as f, open(read_csv,'rb') as w:
    writer = csv.writer(f, delimiter = ',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
    reader = csv.reader(w, delimiter = ',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
    for row in reader:
        if row == ['Note: See the included footnote file.']:
            break
        else:
            writer.writerow(row)

Once the code has been executed we get a new file named original_mod.csv. This is the file we will use to bring into Python with pandas.

import csv
import pandas as pd

#locate one of the csv file
read_csv = ("./CA1-3/CA1-3_AK.csv" )
#write the modified data to a new csv file
write_csv = ("./CA1-3/CA1-3_AK_mod.csv")

with open(write_csv,'wb') as f, open(read_csv,'rb') as w:
    writer = csv.writer(f, delimiter = ',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
    reader = csv.reader(w, delimiter = ',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
    for row in reader:
        #print row
        if row == ['Note: See the included footnote file.']:
            break
        else:
            writer.writerow(row)

st = pd.read_csv(write_csv)

Now that the data is in python through pandas, we can continue to process the as mentioned above. The idea is to only pick out entries that relates to average personal income, which has the Linecode of 3, we can then deploy the following code to trim the dataset to that specfic criteria.

import csv
import pandas as pd

#locate one of the csv file
read_csv = ("./CA1-3/CA1-3_AK.csv" )
#write the modified data to a new csv file
write_csv = ("./CA1-3/CA1-3_AK_mod.csv")

with open(write_csv,'wb') as f, open(read_csv,'rb') as w:
    writer = csv.writer(f, delimiter = ',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
    reader = csv.reader(w, delimiter = ',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
    for row in reader:
        #print row
        if row == ['Note: See the included footnote file.']:
            break
        else:
            writer.writerow(row)

st = pd.read_csv(write_csv)

st = st[st["LineCode"]==3]

To look at the dataset now with the command: print st, we can see the rows of data has been reduced to 34 from 102. The next thing is to reduce the number of columns as many of them do not contain relevant information for GIS, and to change the NO DATA value, in this format represented by “(NA)”, to 0, so all the data can be properly represented as numeric values. Finally, to fix the issue with the FIPS number as the 0 digit has been dropped because the data is considered as a number. We need to change it so it’s considered as text.

import csv
import pandas as pd

#locate one of the csv file
read_csv = ("./CA1-3/CA1-3_AK.csv" )
#write the modified data to a new csv file
write_csv = ("./CA1-3/CA1-3_AK_mod.csv")

with open(write_csv,'wb') as f, open(read_csv,'rb') as w:
    writer = csv.writer(f, delimiter = ',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
    reader = csv.reader(w, delimiter = ',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
    for row in reader:
        #print row
        if row == ['Note: See the included footnote file.']:
            break
        else:
            writer.writerow(row)

st = pd.read_csv(write_csv)

for s in xrange (0, len(st)):
    st.loc[s] = st.loc[s].replace('(NA)',0)
    st.loc[s,['GeoFIPS']] = (format(st['GeoFIPS'].values[s],'05d'))

#keep only entries with LineCode 3
st = st[st["LineCode"]==3]

#dropping the irrelevant fields
st = st.drop(['Region'], axis = 1)
st = st.drop(['Table'], axis = 1)
st = st.drop(['LineCode'], axis = 1)
st = st.drop(['IndustryClassification'], axis = 1)
st = st.drop(['Description'], axis = 1)

See the data with the command: print st, you should notice the index changed due to the line of code that drops all the entries other than ones with LineCode == 3. The leftmost column is the index value and you can see it skips every 2 numbers. Although it’s not a big problem now but we need to re-index all the rows to start from 0 for data processing down the line. Write the modified version to the same CSV file under the same name. Add the following code to the end.

#reindex the data to start from 0
st = pd.DataFrame(st.values, index = range(0,len(st)), columns = st.columns)

st.to_csv(write_csv, index = False, header = True, quotechar = '"', quoting=csv.QUOTE_NONNUMERIC)

4. Convert CSV format to DBF

Shapefiles uses an old ascii file format to store data. Although the format is quite robust, there’re a few limitations we have to deal with. The following code will illustrate the nuances of working with dbf files.

We need to define the field names, field type and the number of characters. Essentially we need to convert the column names and make a single line of string. To see the column names we use the following command.

In [34]: st.columns
Out[34]: Index([u'GeoFIPS', u'GeoName', u'1969', u'1970', u'1971', u'1972', u'1973', u'1974', u'1975', u'1976', u'1977', u'1978', u'1979', u'1980', u'1981', u'1982', u'1983', u'1984', u'1985', u'1986', u'1987', u'1988', u'1989', u'1990', u'1991', u'1992', u'1993', u'1994', u'1995', u'1996', u'1997', u'1998', u'1999', u'2000', u'2001', u'2002', u'2003', u'2004', u'2005', u'2006', u'2007', u'2008', u'2009', u'2010', u'2011', u'2012'], dtype='object')

With DBF format the field names need to have definition for its field type. C(5) means it’s a character with 5 charater entries, N(11,0) means it’s a numeric value with 11 digits and 0 decimal points. The first column name is the FIPS code and it should be considered as a text so the 0 digit don’t get dropped, and it only has 5 digits. The 2nd column name varies a lot but giving it 50 characters should be plenty. The rest of the columns are income values and 11 digits seem to be the maximum.

import dbf

fnames = ''
for n in xrange (0, len(st.columns)):
    if n == 0:
        fname = (st.columns[n] + ' C(5); ')
        fnames = fnames + fname
    elif n == 1:
        fname = (st.columns[n] + ' C(50); ')
        fnames = fnames + fname
    elif n > 1 and n < len(st.columns)-1:
        fname = ('N' + st.columns[n] + ' N(11,0); ') 
        fnames = fnames + fname
    elif n == len(st.columns)-1:
        fname = ('N' + st.columns[n] + ' N(11,0)')
        fnames = fnames + fname
In [35]: fnames 
Out[35]: 'GeoFIPS C(5); GeoName C(50); N1969 N(11,0); N1970 N(11,0); N1971 N(11,0); N1972 N(11,0); N1973 N(11,0); N1974 N(11,0); N1975 N(11,0); N1976 N(11,0); N1977 N(11,0); N1978 N(11,0); N1979 N(11,0); N1980 N(11,0); N1981 N(11,0); N1982 N(11,0); N1983 N(11,0); N1984 N(11,0); N1985 N(11,0); N1986 N(11,0); N1987 N(11,0); N1988 N(11,0); N1989 N(11,0); N1990 N(11,0); N1991 N(11,0); N1992 N(11,0); N1993 N(11,0); N1994 N(11,0); N1995 N(11,0); N1996 N(11,0); N1997 N(11,0); N1998 N(11,0); N1999 N(11,0); N2000 N(11,0); N2001 N(11,0); N2002 N(11,0); N2003 N(11,0); N2004 N(11,0); N2005 N(11,0); N2006 N(11,0); N2007 N(11,0); N2008 N(11,0); N2009 N(11,0); N2010 N(11,0); N2011 N(11,0); N2012 N(11,0)'

Now all the columns names are in this single string, we will use this to make a new dbf file.

table = dbf.Table("CA1-3_AK", fnames)

You should see a new file in CA1-3_AK.dbf in your working directory. The next steip is to get the data from CSV file and format it into DBF. From the Python DBF website - http://pythonhosted.org//dbf/, we learn that the data entry needs to be in the following format, ('John Doe', 31, dbf.Date(1979, 9,13)), ('Ethan Furman', 102, dbf.Date(1909, 4, 1)). That means we need to turn each row of the data into a single string, and append that to the DBF file.

table.open() 
for s in xrange (0, len(st)): 
    record = [] 
    for fn in st.columns: 
        record.append(st.loc[s,fn]) 
    table.append(tuple(record)) 
table.close()

This is all the processing we need to convert a single CSV file from the BEA set and make a DBF file that can be used with GIS. The whole code would look like this.

import csv
import pandas as pd
import dbf

#locate one of the csv file
read_csv = ("./CA1-3/CA1-3_AK.csv" )
#write the modified data to a new csv file
write_csv = ("./CA1-3/CA1-3_AK_mod.csv")

with open(write_csv,'wb') as f, open(read_csv,'rb') as w:
    writer = csv.writer(f, delimiter = ',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC) 
    reader = csv.reader(w, delimiter = ',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)

    for row in reader:
        #print row 
        if row == ['Note: See the included footnote file.']:
            break
        else:
            writer.writerow(row) 
            st = pd.read_csv(write_csv)

for s in xrange (0, len(st)):
    st.loc[s] = st.loc[s].replace('(NA)',0)
    st.loc[s,['GeoFIPS']] = (format(st['GeoFIPS'].values[s],'05d'))

#keep only entries with LineCode 3
st = st[st["LineCode"]==3]

#dropping the irrelevant fields
st = st.drop(['Region'], axis = 1)
st = st.drop(['Table'], axis = 1)
st = st.drop(['LineCode'], axis = 1)
st = st.drop(['IndustryClassification'], axis = 1)
st = st.drop(['Description'], axis = 1)

#reindex the data to start from 0
st = pd.DataFrame(st.values, index = range(0,len(st)), columns = st.columns) 

st.to_csv(write_csv, index = False, header = True, quotechar = '"', quoting=csv.QUOTE_NONNUMERIC)

fnames = '' 
for n in xrange (0, len(st.columns)):
    if n == 0:
        fname = (st.columns[n] + ' C(5); ')
        fnames = fnames + fname

    elif n == 1: 
        fname = (st.columns[n] + ' C(50); ')
        fnames = fnames + fname

    elif n > 1 and n < len(st.columns)-1:
        fname = ('N' + st.columns[n] + ' N(11,0); ')
        fnames = fnames + fname

    elif n == len(st.columns)-1: 
        fname = ('N' + st.columns[n] + ' N(11,0)') 
        fnames = fnames + fname 

table = dbf.Table("CA1-3_AK", fnames) 

table.open()
for s in xrange (0, len(st)):
    record = []
    for fn in st.columns:
        record.append(st.loc[s,fn]) 
    table.append(tuple(record))

#change dbf status to not read/writable
table.close()

Now since there're 52 files to process, the code will need to be generalized to be able to cycle through all the variations. In addition, some error trapping code needs to be added to avoid appending over exiting csv or dbf files. Here is the full code that would allow you to cycle through all 52 files inside ./CA1-3 folder and write and single dbf file.

import csv
import pandas as pd
import os.path
import dbf

list_csv = "./CA1-3/"
list_names = []
linecode = 1

for file in os.listdir(list_csv):
    if file.endswith(".csv"):
        list_names.append(file)

for i in list_names:

    read_csv = ("./CA1-3/" + i )
    write_csv = ("./CA1-3_mod/" + i )

    #check if write folder exist
    d = os.path.dirname(write_csv)
    if not os.path.exists(d):
        print "path don't exist... making folders for your file"
        os.makedirs(d)

    if os.path.isfile(write_csv):
        print (write_csv + " file already exist")

    else:
        with open(write_csv,'wb') as f, open(read_csv,'rb') as w:
            writer = csv.writer(f, delimiter = ',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
            reader = csv.reader(w, delimiter = ',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)

            for row in reader:

                #print row
                if row == ['Note: See the included footnote file.']:
                    break
                else:
                    writer.writerow(row)

        st = pd.read_csv(write_csv)

        for s in xrange (0, len(st)):
            st.loc[s] = st.loc[s].replace('(NA)',0)
            st.loc[s,['GeoFIPS']] = (format(st['GeoFIPS'].values[s],'05d'))

        tt = st.transpose()

        for t in xrange (1969, 1969+len(tt)-7):
            tt.loc[str(t)] = tt.loc[str(t)].astype(float)

        st = tt.transpose()

        #keep only entries with LineCode 3
        st = st[st["LineCode"]==linecode]

        #dropping the irrelevant fields
        st = st.drop(['Region'], axis = 1)
        st = st.drop(['Table'], axis = 1)
        st = st.drop(['LineCode'], axis = 1)
        st = st.drop(['IndustryClassification'], axis = 1)
        st = st.drop(['Description'], axis = 1)

        #reindex the data to start from 0
        st = pd.DataFrame(st.values, index = range(0,len(st)), columns = st.columns)

        st.to_csv(write_csv, index = False, header = True, quotechar = '"', quoting=csv.QUOTE_NONNUMERIC)

        fnames = ''
        for n in xrange (0, len(st.columns)):
            if n == 0:
                fname = (st.columns[n] + ' C(5); ')
                fnames = fnames + fname
            elif n == 1:
                fname = (st.columns[n] + ' C(50); ')
                fnames = fnames + fname
            elif n > 1 and n < len(st.columns)-1:
                fname = ('N' + st.columns[n] + ' N(11,0); ')
                fnames = fnames + fname
            elif n == len(st.columns)-1:
                fname = ('N' + st.columns[n] + ' N(11,0)')
                fnames = fnames + fname

        #aggregate all data to single dbf file.
        if os.path.isfile(list_csv[2:-1]+'.dbf'):
            print (i[:-4] + '...')
        else:
            #make table and use filename with the last 4 characters truncated
            table = dbf.Table(list_csv[2:-1], fnames)

        #change dbf status to read/write
        table.open()

        for s in xrange (0, len(st)):
            record = []
            for fn in st.columns:
                record.append(st.loc[s,fn])
            table.append(tuple(record))

        #change dbf status to not read/writable
        table.close()

        print (i+" completed successfully...")

5. Joining with Shapefile

Finally, drag and drop the TIGER/Line county shapefile as well as the dbf file in QGIS. Double click on the shapefile and go to the Joins tab. Click Add and your DBF file should show up. Change the Target Field to GEOID, and the data is now paired with the county map.

BEA_0

BEA_1

BEA_2

Comments are closed.