In the scenario we are going to explore in this post, we have a GIS data in a proprietary format (not geojson), which comes from an API of a data provider. We want to parse this format and insert the data into a Post GIS database, or a Postgresql with postgis
extension installed.
Analysing input data
The API response is in .json
format. After exploration of the data sample, we notice that the related GIS data fields take one of the following forms (examples below).
Example 1:
"points": [
[
11.37975,
11.3847
],
[
47.37138,
33.8887
],
...
]
Example 2:
"points": [
{
"id": 0,
"lat": 11.76508,
"lng": 11.39344
},
{
"id": 1,
"lat": 11.74897,
"lng": 11.40683
},
...
]
Example 3:
"points": [
[
{
"id": 0,
"lat": 11.46365,
"lng": 11.58313
},
{
"id": 1,
"lat": 11.39815,
"lng": 11.35217
},
...
],
[
...
],
...
]
So when parsing these records with Python, we have the following different cases:
- list of lists
- list of dicts
- list of lists of dicts
After having analysed the API response sample data carefully, and comparing it with the actual output of the web application / service that the API belongs to, we come to the conclusion that these different cases represent different Post GIS geometries. And so:
- list of lists represents a
MULTIPOINT
geometry - list of dicts represents a
LINESTRING
geometry, and we can generalize it toMULTILINESTRING
- list of lists of dicts represents
MULTIPOLYGON
geometry
The later insertion of data into Post GIS can be done using a convenient ST_GeomFromText
function. This function converts a text string in a specific format into a relevant geometry upon data insertion into PostGIS. But in order to succesfully execute the insert operation, we need to correctly construct those text strings.
Converting input data into desired format
Looking at the documentation, for each cases we need to construct strings in the following formats:
#MULTIPOINT ( (lng lat),(lng lat),(lng lat) )
#MULTILINESTRING ( (lng lat,lng lat,lng lat),(lng lat,lng lat,lng lat) )
#MULTIPOLYGON ( ((lng lat,lng lat,lng lat)),((lng lat,lng lat,lng lat)) )
where lng
and lat
are longitude and latitude of each individual point in the input data.
Looking at the examples of the text strings that we need to construct, we can see that different parts of geometries are seprated by brackets. We can also categorize brackets into different types as follows:
- inner brackets encapsulate each
lng lat
pair representing a point - middle brackets encapsulate collections / groups of points
- outer brackets encapsulate the entire geometry definition that follows the word indicating the geometry (
POINT
,MULTIPOINT
etc).
Looking at the geometries in our input data, we have the following:
#MULTIPOINT ( (lng lat),(lng lat),(lng lat) )
# ^inner bracket ^outer bracket
#MULTILINESTRING ( (lng lat,lng lat,lng lat),(lng lat,lng lat,lng lat) )
# ^middle bracket ^middle bracket ^outer bracket
#MULTIPOLYGON ( ((lng lat,lng lat,lng lat)),((lng lat,lng lat,lng lat)) )
# ^^middle brackets ^^middle brackets ^outer bracket
We want our parser / converter function to be generic, so we are going to parametrize numbers of brackets:
num_of_brackets_inner
num_of_brackets_middle
num_of_brackets_outer
This is it as far as brackets are concerned.
In addition (based on common sense and data insert tests that we carried out), for polygons - each polygon in a multipolygon needs to be “closed” i.e. it needs to be a closed loop, so the last point in a polygon needs to be the same as the first point in the polygon. After inspecting the input data samples, we realize that this is not the case, so this requirement would have to be met “synthetically” i.e. in the output data set we will need to ensure that there is a point that allows to close the polygon. So we need a paramter that will indicate whether we need a “closed” geometry or not:
closing_flag
And lastly, we will also use a parameter that will indicate if we have a “deep list” or not i.e. if we have just list of objects, or the list of list of objects. This will determine if we need to iterate through one or two levels of the list. This parameter will be called:
deep_list_flag
workflow
In the conversion workflow, when we are parsing a record from a input .json data, we need to first detect what points
record type we have. This can be realized by a relatively simple if
:
if isinstance(record["points"][0], list):
postgis_geometry = "MULTIPOLYGON"
else:
postgis_geometry = "MULTILINESTRING"
(Please note that the code above is just to illustrate the mechanism, and it is not complete as it does not cover the case of “MULTIPOINT” geometry from our scenario.)
Once the geometry is detected, we then pass this parameter along with the record to the next function in the workflow:
value = await construct_geometry(postgis_geometry, record["points"])
This function is defined as follows, and it is a higher level function constructing the desired text string representing the geometry:
async def construct_geometry(postgis_geometry: str, points: list):
if postgis_geometry == "MULTIPOLYGON":
#MULTIPOLYGON ( ((lng lat,lng lat,lng lat)),((lng lat,lng lat,lng lat)) )
num_of_brackets_outer = 1
num_of_brackets_middle = 2
num_of_brackets_inner = 0
# list of list of dicts
deep_list_flag = True
closing_flag = True
elif postgis_geometry == "MULTILINESTRING":
#MULTILINESTRING ( (lng lat,lng lat,lng lat),(lng lat,lng lat,lng lat) )
num_of_brackets_outer = 1
num_of_brackets_middle = 1
num_of_brackets_inner = 0
# list of dicts
deep_list_flag = False
closing_flag = False
elif postgis_geometry == "MULTIPOINT":
# MULTIPOINT ( (lng lat),(lng lat),(lng lat) )
num_of_brackets_outer = 1
num_of_brackets_middle = 0
num_of_brackets_inner = 1
# list of list of ints or strings
deep_list_flag = False
closing_flag = False
else:
logging.error(f"function does not yet have a case to handle this type of geometry: {postgis_geometry}")
value = postgis_geometry+"("*num_of_brackets_outer
if deep_list_flag:
for i, points_in_deeper_list in enumerate(points):
if (postgis_geometry == "MULTIPOLYGON") and (len(points_in_deeper_list) < 3):
# in this case we are skipping this list of points because it is too short
# to construct a polygon, which should have a minimum of three corners i.e. be a triangle
continue
else:
value = value + await geom_points_to_text(num_of_brackets_middle, num_of_brackets_inner, points_in_deeper_list, closing_flag)
# now we need comma before the next polygon (or more general: collection of points), but we need to avoid comma after the last polygon
if i < (len(points)-1):
value = value + ","
else:
#this is a one level list
value = value + await geom_points_to_text(num_of_brackets_middle, num_of_brackets_inner, points, closing_flag)
value = value + ")"*num_of_brackets_outer
return value
This higher level function above relies in turn on the following function to construct the middle part of the text string:
async def geom_points_to_text(num_of_brackets_middle: int, num_of_brackets_inner:int, points: list, closing_flag: bool):
# polygon_points should be list of dicts with "lng" and "lat" keys
value = num_of_brackets_middle*"("
for j, point_lat_lng in enumerate(points):
if isinstance(point_lat_lng, dict):
value = value + "("*num_of_brackets_inner + str(point_lat_lng["lng"]) + " " +str(point_lat_lng["lat"]) + ")"*num_of_brackets_inner
elif isinstance(point_lat_lng, list):
value = value + "("*num_of_brackets_inner + str(point_lat_lng[1]) + " " +str(point_lat_lng[0]) + ")"*num_of_brackets_inner
# now we need comma before the next point, but we need to avoid comma after the last point
if j < (len(points)-1):
value = value + ","
else:
# now we reached to the last point in the list,
# if we are constructing a polygon geometry
# we need to ensure we close the polygon
# so we have to add the first point in the list
# this is actually assuming each point is in the format dict
if closing_flag:
value = value + "," + str(points[0]["lng"]) + " " +str(points[0]["lat"])
value = value + num_of_brackets_middle*")"
return value
Such conversion function in pretty generic and can be reused / expanded to handle more geometries, should we encounter data sets with similar formats.