Geocoding for Information Scientists – KDnuggets



When knowledge scientists have to know every part there may be to know concerning the “the place” of their knowledge, they typically flip to Geographic Info Programs (GIS). GIS is an advanced set of applied sciences and packages that serve all kinds of functions, however the College of Washington offers a reasonably complete definition, saying “a geographic info system is a posh association of related or related issues or objects, whose objective is to speak information about options on the floor of the earth” (Lawler et al). GIS encompasses a broad vary of methods for processing spatial knowledge from acquisition to visualization, a lot of that are invaluable instruments even if you’re not a GIS specialist. This text offers a complete overview of geocoding with demonstrations in Python of a number of sensible functions. Particularly, you’ll decide the precise location of a pizza parlor in New York Metropolis, New York utilizing its deal with and join it to knowledge about close by parks. Whereas the demonstrations use Python code, the core ideas will be utilized to many programming environments to combine geocoding into your workflow. These instruments present the idea for remodeling knowledge into spatial knowledge and open the door for extra advanced geographic evaluation. 





Geocoding is mostly outlined because the transformation of deal with knowledge into mapping coordinates. Normally, this includes detecting a road identify in an deal with, matching that road to the boundaries of its real-world counterpart in a database, then estimating the place on the road to put the deal with utilizing the road quantity. For example, let’s undergo the method of a easy handbook geocode for the deal with of a pizza parlor in New York on Broadway: 2709 Broadway, New York, NY 10025. The primary job is discovering applicable shapefiles for the street system of the placement of your deal with. Word that on this case town and state of the deal with are “New York, NY.”  Thankfully, town of New York publishes detailed street info on the NYC Open Information web page (CSCL PUB). Second, look at the road identify “Broadway.” You now know that the deal with can lie on any road referred to as “Broadway” in NY city, so you may execute the next Python code to question the NYC Open Information SODA API for all streets named “Broadway.”

import geopandas as gpd
import requests
from io import BytesIO

# Request the information from the SODA API
req = requests.get(
    " resource/gdww-crzy.geojson?stname_lab=BROADWAY"
# Convert to a stream of bytes
reqstrm = BytesIO(req.content material)
# Learn the stream as a GeoDataFrame
ny_streets = gpd.read_file(reqstrm)


There are over 700 outcomes of this question, however that doesn’t imply you must verify 700 streets to seek out your pizza. Visualizing the information, you may see that there are 3 primary Broadway streets and some smaller ones.




The explanation for that is that every road is damaged up into sections that correspond roughly to a block, permitting for a extra granular have a look at the information. The following step of the method is figuring out precisely which of those sections the deal with is on utilizing the ZIP code and road quantity. Every road section within the dataset incorporates deal with ranges for the addresses of buildings on each the left and proper sides of the road. Equally, every section incorporates the ZIP code for each the left and proper sides of the road. To find the proper section, the next code applies filters to seek out the road section whose ZIP code matches the deal with’ ZIP code and whose deal with vary incorporates the road variety of the deal with.

# Handle to be geocoded
deal with = "2709 Broadway, New York, NY 10025"
zipcode = deal with.cut up(" ")[-1]
street_num = deal with.cut up(" ")[0]

# Discover road segments whose left facet deal with ranges include the road quantity
potentials = ny_streets.loc[ny_streets["l_low_hn"] < street_num]
potentials = potentials.loc[potentials["l_high_hn"] > street_num]
# Discover road segments whose zipcode matches the deal with'
potentials = potentials.loc[potentials["l_zip"] == zipcode]


This narrows the listing to the one road section seen beneath.




The ultimate job is to find out the place the deal with lies on this line. That is achieved by inserting the road quantity contained in the deal with vary for the section, normalizing to find out how far alongside the road the deal with ought to be, and making use of that fixed to the coordinates of the endpoints of the road to get the coordinates of the deal with. The next code outlines this course of.

import numpy as np
from shapely.geometry import Level

# Calculate how far alongside the road to put the purpose
denom = (
    potentials["l_high_hn"].astype(float) - potentials["l_low_hn"].astype(float)
normalized_street_num = (
    float(street_num) - potentials["l_low_hn"].astype(float).values[0]
) / denom

# Outline some extent that far alongside the road
# Transfer the road to start out at (0,0)
pizza = np.array(potentials["geometry"].values[0].coords[1]) - np.array(
# Multiply by normalized road quantity to get coordinates on line
pizza = pizza * normalized_street_num
# Add beginning section to put line again on the map
pizza = pizza + np.array(potentials["geometry"].values[0].coords[0])
# Convert to geometry array for geopandas
pizza = gpd.GeoDataFrame(
    {"deal with": [address], "geometry": [Point(pizza[0], pizza[1])]},,


Having completed geocoding the deal with, it’s now potential to plot the placement of this pizza parlor on a map to grasp its location. Because the code above checked out info pertaining to the left facet of a road section, the precise location can be barely left of the plotted level in a constructing on the left facet of the street. You lastly know the place you will get some pizza.




This course of covers what’s mostly known as geocoding, however it isn’t the one manner the time period is used. You may additionally see geocoding discuss with the method of transferring landmark names to coordinates, ZIP codes to coordinates, or coordinates to GIS vectors. You could even hear reverse geocoding (which can be lined later) known as geocoding. A extra lenient definition for geocoding that encompasses these could be “the switch between approximate, pure language descriptions of places and geographic coordinates.” So, any time it’s good to transfer between these two varieties of knowledge, think about geocoding as an answer.

As a substitute for repeating this course of every time it’s good to geocode addresses, quite a lot of API endpoints, such because the U.S. Census Bureau Geocoder and the Google Geocoding API, present an correct geocoding service totally free. Some paid choices, comparable to Esri’s ArcGIS, Geocodio, and Smarty even supply rooftop accuracy for choose addresses, which implies that the returned coordinate lands precisely on the roof of the constructing as an alternative of on a close-by road. The next sections define easy methods to use these providers to suit geocoding into your knowledge pipeline utilizing the U.S. Census Bureau Geocoder for example.



To be able to get the very best potential accuracy when geocoding, it’s best to at all times start by making certain that your addresses are formatted to suit the requirements of your chosen service. It will differ barely between every service, however a standard format is the USPS format of “PRIMARY# STREET, CITY, STATE, ZIP” the place STATE is an abbreviation code, PRIMARY# is the road quantity, and all mentions of suite numbers, constructing numbers, and PO bins are eliminated. 

As soon as your deal with is formatted, it’s good to submit it to the API for geocoding. Within the case of the U.S. Census Bureau Geocoder, you may both manually submit the deal with by way of the One Line Handle Processing tab or use the offered REST API to submit the deal with programmatically. The U.S. Census Bureau Geocoder additionally lets you geocode total recordsdata utilizing the batch geocoder and specify the information supply utilizing the benchmark parameter. To geocode the pizza parlor from earlier, this hyperlink can be utilized to cross the deal with to the REST API, which will be achieved in Python with the next code.

# Submit the deal with to the U.S. Census Bureau Geocoder REST API for processing
response = requests.get(
    " with=2709+Broadwaypercent2C+New+Yorkpercent2C+NY+10025&benchmark=Public_AR_Current&format=json"


The returned knowledge is a JSON file, which is decoded simply right into a Python dictionary. It incorporates a “tigerLineId” subject which can be utilized to match the shapefile for the closest road, a “facet” subject which can be utilized to find out which facet of that road the deal with is on, and “fromAddress” and “toAddress” fields which include the deal with vary for the road section.  Most significantly, it incorporates a “coordinates” subject that can be utilized to find the deal with on a map. The next code extracts the coordinates from the JSON file and processes it right into a GeoDataFrame to organize it for spatial evaluation.

# Extract coordinates from the JSON file
coords = response["result"]["addressMatches"][0]["coordinates"]
# Convert coordinates to a Shapely Level
coords = Level(coords["x"], coords["y"])
# Extract matched deal with
matched_address = response["result"]["addressMatches"][0]["matchedAddress"]
# Create a GeoDataFrame containing the outcomes
pizza_point = gpd.GeoDataFrame(
    {"deal with": [matched_address], "geometry": coords},,


Visualizing this level exhibits that it’s barely off the street to the left of the purpose that was geocoded manually.





Reverse geocoding is the method of taking geographic coordinates and matching them to pure language descriptions of a geographic area. When utilized appropriately, it is among the strongest methods for attaching exterior knowledge within the knowledge science toolkit. Step one of reverse geocoding is figuring out your goal geographies. That is the area that may include your coordinate knowledge. Some widespread examples are census tracts, ZIP codes, and cities. The second step is figuring out which, if any, of these areas the purpose is in. When utilizing widespread areas, the U.S. Census Geocoder can be utilized to reverse geocode by making small adjustments to the REST API request. A request for figuring out which Census geographies include the pizza parlor from earlier than is linked right here. The results of this question will be processed utilizing the identical strategies as earlier than. Nonetheless, creatively defining the area to suit an evaluation want and manually reverse geocoding to it opens up many potentialities. 

To manually reverse geocode, it’s good to decide the placement and form of a area, then decide if the purpose is on the inside of that area. Figuring out if some extent is inside a polygon is definitely a reasonably tough drawback, however the ray casting algorithm, the place a ray beginning on the level and travelling infinitely in a course intersects the boundary of the area an odd variety of instances whether it is contained in the area and a fair variety of instances in any other case (Shimrat), can be utilized to unravel it normally. For the mathematically inclined, that is really a direct software of the Jordan curve theorem (Hosch). As a be aware, if you’re utilizing knowledge from world wide, the ray casting algorithm can really fail since a ray will finally wrap across the Earth’s floor and grow to be a circle. On this case, you’ll as an alternative have to seek out the winding quantity (Weisstein) for the area and the purpose. The purpose is contained in the area if the winding quantity shouldn’t be zero. Thankfully, Python’s geopandas library offers the performance essential to each outline the inside of a polygonal area and check if some extent is inside it with out all of the advanced arithmetic.

Whereas handbook geocoding will be too advanced for a lot of functions, handbook reverse geocoding is usually a sensible addition to your talent set because it lets you simply match your factors to extremely custom-made areas. For instance, assume you wish to take your slice of pizza to a park and have a picnic. You could wish to know if the pizza parlor is inside a brief distance of a park. New York Metropolis offers shapefiles for his or her parks as a part of the Parks Properties dataset (NYC Parks Open Information Workforce), they usually can be accessed by way of their SODA API utilizing the next code.

# Pull NYC park shapefiles
parks = gpd.read_file(
            " resource/enfh-gkve.geojson?$restrict=5000"
        ).content material
# Restrict to parks with inexperienced space for a picnic
parks = parks.loc[
            "Nature Area",
            "Community Park",
            "Neighborhood Park",
            "Flagship Park",


These parks will be added to the visualization to see what parks are close by the pizza parlor.




There are clearly some choices close by, however determining the gap utilizing the shapefiles and the purpose will be tough and computationally costly. As an alternative, reverse geocoding will be utilized. Step one, as talked about above, is figuring out the area you wish to connect the purpose to. On this case, the area is “a 1/2-mile distance from a park in New York Metropolis.” The second step is calculation if the purpose lies inside a area, which will be achieved mathematically utilizing the beforehand talked about strategies or by making use of the “incorporates” perform in geopandas. The next code is used so as to add a 1/2-mile buffer to the boundaries of the parks earlier than testing to see which parks’ buffered areas now include the purpose.

# Venture the coordinates from latitude and longitude into meters for distance calculations
buffered_parks = parks.to_crs(epsg=2263)
pizza_point = pizza_point.to_crs(epsg=2263)
# Add a buffer to the areas extending the border by 1/2 mile = 2640 ft
buffered_parks = buffered_parks.buffer(2640)
# Discover all parks whose buffered area incorporates the pizza parlor
pizza_parks = parks.loc[buffered_parks.contains(pizza_point["geometry"].values[0])]


This buffer reveals the close by parks, that are highlighted in blue within the picture beneath




After profitable reverse geocoding, you’ve got discovered that there are 8 parks inside a half mile of the pizza parlor during which you may have your picnic. Get pleasure from that slice.

Pizza Slice by j4p4n




  1. Lawler, Josh and Schiess, Peter. ESRM 250: Introduction to Geographic Info Programs in Forest Assets. Definitions of GIS, 12 Feb. 2009, College of Washington, Seattle. Class Lecture.
  2. CSCL PUB. New York OpenData.
  3. U.S. Census Bureau Geocoder Documentation. August 2022. 
  4. Shimrat, M., “Algorithm 112: Place of level relative to polygon” 1962, Communications of the ACM Quantity 5 Problem 8, Aug. 1962. 
  5. Hosch, William L.. “Jordan curve theorem”. Encyclopedia Britannica, 13 Apr. 2018,
  6. Weisstein, Eric W. “Contour Winding Quantity.” From MathWorld–A Wolfram Net Useful resource.
  7. NYC Parks Open Information Workforce. Parks Properties. April 14, 2023.
  8. j4p4n, “Pizza Slice.” From OpenClipArt.

Evan Miller is a Information Science Fellow at Tech Impression, the place he makes use of knowledge to help nonprofit and authorities companies with a mission of social good. Beforehand, Evan used machine studying to coach autonomous automobiles at Central Michigan College.