Commit 3cb68b1d authored by Nicolas Garcia Ospina's avatar Nicolas Garcia Ospina
Browse files

Included build_dictionary and fixed main program

parent 8e7f8eb4
Pipeline #20861 passed with stage
in 4 minutes and 13 seconds
......@@ -2,7 +2,7 @@ SOURCES=obmgapanalysis tests setup.py
LENGTH=96
check: $(SOURCES)
flake8 --max-line-length=$(LENGTH) $^
flake8 --max-line-length=$(LENGTH) --ignore=E203,W503 $^
black --check --line-length $(LENGTH) $^
pylint -E $^
......
......@@ -14,6 +14,7 @@ data such as Global Human Settlement Layer (GHSL) and World Settlement Footprint
### Python libraries
* `numpy`
* `pyproj>=3.0.0.post1`
* `shapely`
* `babelgrid`
......
database:
host: your_host.dir
dbname: gis
port: 5433
username: postgres
password: secret_pass
roads_table:
tablename: planet_osm_roads
geometry_field: way
datasource:
crs: epsg:3857
pathname: /your/path/to/datasource
......@@ -17,4 +7,18 @@ datasource:
tiles:
use_txt: False
tiles_list: ['122100200320321020', '122100200320321021', '122100200320321022', '122100200320321023']
txt_file: tiles.txt
txt_filepath: tiles.txt
output_pathname: ./results
number_cores: 10
chunk_size: 1000
database:
host: your_host.dir.request_data
dbname: gis
port: 5433
username: postgres
password: secret_pass
roads_table:
tablename: planet_osm_roads
geometry_field: way
......@@ -5,11 +5,26 @@ shape of a settlement by assigning pixel values to different status (e.g. 1=buil
There are currently some models based on remote sensing observations that depict the settlement
distributions among the world (e.g. [GHSL-BUILT](https://ghsl.jrc.ec.europa.eu/download.php)
available in 30m resolution or [WSF-2015](https://www.esa.int/Applications/Observing_the_Earth/Mapping_our_global_human_footprint) available in 30m resolution).
available in 30m resolution or [WSF-2015](https://www.esa.int/Applications/Observing_the_Earth/Mapping_our_global_human_footprint) available in 10m resolution).
have different ways to describe the surface but can be reduced to a binary solution, in this
document a short description of the datasets can be found. For more information, check the datasets
official sites.
## Method_id
The input datasets can be instantiated with a `method_id` if desired. This is recommended if the aim is
to create a multi-source product. This option can be activated by setting the `method_id` argument to
any integer value. By assigning a `method_id`, the output can be queried by this ID. A multi-source
output may be organized like this:
| method_id | description |
|:------:|:------:|
| 1 | Built areas produced with GHSL data from 2014|
| 2 | Built areas produced with GHSL data from 1975|
| 3 | Built areas produced with WSF data |
| ... | ... |
| n | Built areas retrieved with n dataset |
# Data structure
This program makes use of the original data structure from the [GHSL-BUILT](https://ghsl.jrc.ec.europa.eu/download.php)
......
......@@ -38,18 +38,26 @@ class Database:
includes filepaths and raster footprints. These sources can either be:
Args:
host (str): Postgres Database host direction.
dbname (str): Postgres database name.
port (str): Port where the Postgres database can be found.
username (str): User to connect to the Postgres database.
password (str or getpass.getpass): Password for username argument
host (str): Postgres Database host direction.
dbname (str): Postgres database name.
port (str): Port where the Postgres database can be found.
username (str): User to connect to the Postgres database.
password (str or getpass.getpass): Password for username argument
Attributes:
self.host (str): Postgres Database host direction.
self.dbname (str): Postgres database name.
self.port (str): Port where the Postgres database can be found.
self.username (str): User to connect to the Postgres database.
self.password (str or getpass.getpass): Password for username argument
self.host (str): Postgres Database host direction.
self.dbname (str): Postgres database name.
self.port (str): Port where the Postgres database can be found.
self.username (str): User to connect to the Postgres database.
self.password (str or getpass.getpass): Password for username argument
"""
......@@ -74,9 +82,6 @@ class Database:
self.cursor (psycopg2.extensions.cursor):
Cursor object to run commands on the database.
Returns:
None
"""
connection_string = "host={} dbname={} user={} password={} port={}".format(
......@@ -94,7 +99,9 @@ class Database:
Args:
schema (str): Database schema where the database is located. Default = "public"
tablename (str): Table name within database for searching (e.g. "obm_buildings")
geometry_field (str): Name of the column with geometries. Default = "geometry"
Returns:
......@@ -116,11 +123,14 @@ class Database:
Args:
tile (tile.Tile): Tile object with quadkey, crs and geometry attributes.
tablename (str): Table name within database for searching (e.g. "obm_buildings")
crs_number (str): Number of the epsg/srid code for the specified table.
schema (str): Database schema where the database is located. Default = "public"
geometry_field (str): Name of the column with geometries. Default = "geometry"
geometry_field (str): Name of the column with geometries. Default = "geometry"
Returns:
output_features (geopandas.geodataframe.GeoDataFrame): GeoPandas dataframe
......
......@@ -41,22 +41,27 @@ class DataSource:
raster_files_index (str): Filename of the raster index.
method_id (int): ID related to a dataset/method used (optional). Default = False
Attributes:
self.crs (str): Coordinate system associated with the DataSource, given as a EPSG code.
self.pathname (str): Pathname where the dataset and explanatory dataframe are located.
self.raster_files_index (geopandas.geodataframe.GeoDataFrame): GeoPandas dataframe
with raster relative filepaths and respective geometries.
with raster relative filepaths and respective geometries.
self.built_pixel_values (list): List with pixel values that represent built areas.
Hints on pixel built_pixel_values available at docs/02_Input_datasets.md
Hints on pixel built_pixel_values available at docs/02_Input_datasets.md
self.method_id (int): ID related to the settlement dataset/method used (optional)
"""
def __init__(self, crs, pathname, raster_files_index, built_pixel_values):
def __init__(self, crs, pathname, raster_files_index, built_pixel_values, method_id=False):
self.crs = crs
self.pathname = pathname
self.raster_files_index = geopandas.read_file(
os.path.join(pathname, raster_files_index)
)
self.built_pixel_values = built_pixel_values
self.method_id = method_id
#!/usr/bin/env python3
# Copyright (C) 2021:
# Helmholtz-Zentrum Potsdam Deutsches GeoForschungsZentrum GFZ
#
# This program is free software: you can redistribute it and/or modify it
# under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or (at
# your option) any later version.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero
# General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
# along with this program. If not, see http://www.gnu.org/licenses/.
import os
import logging
import geopandas
# Initialize log
logger = logging.getLogger(__name__)
class FileProcessor:
@staticmethod
def write_tiles_to_csv(
list_of_dictionaries, output_pathname, column_geometry="built_area", crs="epsg:4326"
):
"""Write a csv file from a list of dictionaries.
Args:
list_of_dictionaries (list): List of dictionaries with built-up areas to
write.
outdir (srt): Target path name to write the csv file.
column_geometry (str): Name of the field that contains geometries.
Default = "built_area"
crs (str): EPSG code of the data projection. Default = "epsg:4326"
"""
tiles_gdf = geopandas.GeoDataFrame(
list_of_dictionaries, geometry=column_geometry, crs=crs
)
filepath_out = os.path.join(
output_pathname, "{}_{}.csv".format(tiles_gdf.tile_id.iloc[0], len(tiles_gdf.index))
)
logger.info("Creating {}".format(filepath_out))
tiles_gdf.to_csv(filepath_out, index=False)
......@@ -21,119 +21,107 @@ import logging
import os
import sys
import yaml
import numpy
from obmgapanalysis.datasource import DataSource
from obmgapanalysis.tile import Tile
from obmgapanalysis.processor import TileProcessor
from obmgapanalysis.tileprocessor import TileProcessor
from obmgapanalysis.fileprocessor import FileProcessor
from obmgapanalysis.database import Database
import multiprocessing
# Add a logger printing error, warning, info and debug messages to the screen
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.info("Launching obmgapanalysis")
# Get program configuration from config.yml
with open("config.yml", "r") as ymlfile:
config = yaml.load(ymlfile, Loader=yaml.FullLoader)
db_config = config["database"]
datasource_config = config["datasource"]
tiles_config = config["tiles"]
output_pathname = os.path.abspath(config["output_pathname"])
if not os.path.exists(output_pathname):
os.makedirs(output_pathname)
def main():
def multiprocess_chunk(quadkey_batch):
"""
Wrapper funtion that writes a CSV file with built-up areas found in the
quadkey_batch. The ouput is based on TileProcessor.build_dictionary.
This wrapper function is built to work with the multiprocessing library.
logger.info("obmgapanalysis has started")
Output filenames are: <number of tiles>_<first quadkey>.csv
Args:
quadkey_batch (list): List of quadkeys to process with settlement data
"""
# Create DataSource instance to be used by all child processes
datasource = DataSource(**datasource_config)
logger.info(
logger.debug(
"DataSource raster files are registered in {}".format(
os.path.join(datasource.pathname, datasource.raster_files_index)
os.path.join(datasource.pathname, datasource_config["raster_files_index"])
)
)
# Connect to the OBM database
roads_database = Database(**db_config)
roads_database.create_connection_and_cursor()
roads_table_crs_number = roads_database.get_crs_from_geometry_field(
roads_database_crs_number = roads_database.get_crs_from_geometry_field(
**db_config["roads_table"]
)
logger.info(
"Connection established to {} in {}".format(db_config["dbname"], db_config["host"])
)
# List all the
# Build a list with all tiles with reported built-up areas
build_up_areas = []
try:
for quadkey in quadkey_batch:
result = TileProcessor.get_build_up_area(
quadkey=quadkey,
database=roads_database,
datasource=datasource,
database_crs_number=roads_database_crs_number,
table_config=db_config["roads_table"],
)
if result is not None:
build_up_areas.append(result)
if build_up_areas:
# Write output into a csv file
FileProcessor.write_tiles_to_csv(build_up_areas, output_pathname)
finally:
roads_database.connection.close()
def main():
# Read tiles to be processed
if tiles_config["use_txt"]:
with open(tiles_config["txt_file"]) as fid:
tiles_list = [line.strip() for line in fid]
tiles_list = list(numpy.loadtxt(tiles_config["txt_filepath"], dtype=str))
else:
tiles_list = tiles_config["tiles_list"]
logger.info("{} tiles will be processed".format(len(tiles_list)))
# Loop through each quadkey and retrieve information
for quadkey in tiles_list:
# Create a tile with the same crs as the DataSource
tile = Tile(quadkey, datasource.crs)
logger.debug(str(tile.geometry))
logger.info(type(tile))
# Find raster files with information for the given tile
raster_files = TileProcessor.intersect_datasource_with_tile(
datasource=datasource, tile=tile
)
# Load the raster files and extract only the pixels in the cells
pixel_values, pixel_georeferences = TileProcessor.get_raster_pixels_in_tile(
raster_filepaths=raster_files, tile=tile
)
# Transform the pixels into a polygon
built_geometry = TileProcessor.polygonize_array(
pixel_values, pixel_georeferences, datasource
)
logger.info(
"The built area polygon with complete pixels is: {}".format(str(built_geometry))
)
clip_built_geometry = TileProcessor.clip_to_tile_extent(built_geometry, tile)
logger.info(
"The built area polygon cropped to the extent of tile {}: {}".format(
tile.quadkey, str(clip_built_geometry)
)
)
roads_in_tile = roads_database.get_features_in_tile(
tile=tile, crs_number=roads_table_crs_number, **db_config["roads_table"]
)
logger.info(
"{} road features were found for tile {}".format(
len(roads_in_tile.index), tile.quadkey
)
)
# Process the roads query
roads_processed = TileProcessor.process_dataframe_with_tile(
roads_in_tile, tile=tile, buffer_magnitude=3.0
)
# Generate a parallel process pool with each quadkey batch and process
num_processes = config["number_cores"]
chunk_size = config["chunk_size"]
quadkey_batchs = [
tiles_list[i : i + chunk_size] for i in range(0, len(tiles_list), chunk_size)
]
# Substract result from built area
refined_built_area = TileProcessor.polygon_difference(
clip_built_geometry, roads_processed
)
logging.info("Creating multiprocessing pool")
with multiprocessing.Pool(processes=num_processes) as pool:
logging.info("Start parallel processing")
pool.map(multiprocess_chunk, quadkey_batchs)
area = TileProcessor.albers_area_calculation(refined_built_area, tile.crs)
logger.info(
"Within tile {}, there are {} squared meters of built area.".format(
tile.quadkey, area
)
)
roads_database.connection.close()
logging.info("Parallel processing finished, closing pool")
pool.close()
pool.join()
# Leave the program
logger.info("Task finished, closing obmgapanalysis")
sys.exit()
......
......@@ -18,6 +18,9 @@
import logging
import os
from datetime import date
import rasterio
from rasterio.mask import mask
from shapely.geometry import mapping
......@@ -26,8 +29,7 @@ import pyproj
from shapely.ops import transform
from functools import partial
import geopandas
import os
from obmgapanalysis.tile import Tile
# Initialize log
logger = logging.getLogger(__name__)
......@@ -40,33 +42,19 @@ class TileProcessor:
that intersect the given tile.geometry
Args:
datasource (datasource.DataSource): DataSource instance with
crs, pathname and raster_files_index attributes.
datasource (datasource.DataSource): DataSource instance with crs,
pathname and raster_files_index attributes.
tile (tile.Tile): Tile object with quadkey, crs and geometry attributes
Examples:
>>> from obmgapanalysis.datasource import DataSource
>>> from obmgapanalysis.tile import Tile
>>> datasource = DataSource(
crs="epsg:3857",
pathname=os.environ["OBMGAPANALYSIS_BASE_PATH"],
raster_index="GHS_TEST_INDEX.shp"
)
>>> tile = Tile("122100203023320202", datasource.crs)
>>> intersect_datasource_with_tile(datasource, tile)
Returns:
index_list (list): List of the filepaths of the raster files that
intersect with the Tile.geometry.
intersect with the Tile.geometry.
"""
raster_files_index = datasource.raster_files_index
raster_files_index = datasource.raster_files_index.copy()
raster_files_index = raster_files_index[raster_files_index.intersects(tile.geometry)]
raster_files_index["location"] = raster_files_index["location"].apply(
raster_files_index.loc[:, "location"] = raster_files_index["location"].apply(
lambda x: os.path.join(datasource.pathname, x)
)
index_list = raster_files_index.location.tolist()
......@@ -84,9 +72,9 @@ class TileProcessor:
Returns:
pixel_values (list of numpy.ndarray): List of arrays with pixel values that
intersect with the tile.geometry
intersect with the tile.geometry
pixel_georeferences (list of affine.Affine): List with affine transformation
matrices of pixel_values (allows geocoding)
matrices of pixel_values (allows geocoding)
"""
pixel_values = []
......@@ -109,13 +97,13 @@ class TileProcessor:
Args:
pixel_values (list of numpy.ndarray): List of arrays with pixel values that
intersect with the tile.geometry
intersect with the tile.geometry
pixel_georeferences (list of affine.Affine): List with affine transformation
matrices of pixel_values (allows geocoding)
matrices of pixel_values (allows geocoding)
datasource (datasource.DataSource): DataSource instance with
crs, pathname and raster_files_index attributes.
crs, pathname and raster_files_index attributes.
Returns:
built_area_geometry (shapely.geometry.multipolygon.MultiPolygon): Shapely
......@@ -272,3 +260,92 @@ class TileProcessor:
polygon = TileProcessor.reproject_polygon(input_polygon, polygon_crs, "aea")
return polygon.area
@staticmethod
def build_dictionary(tile, datasource, built_polygon):
"""Returns a dictionary with the built-up area related attributes
associated to the Tile and a given DataSource.
Contains:
quadkey (str): Tile quadkey
method_id (int): Integer associated to a predefined method
built_area (str): Polygon string projected to WGS84 coordinates.
size_built_area (float): Area measured in squared meters.
last_update (str): Date when the pickle was generated.
Args:
tile (tile.Tile): Tile object with quadkey, crs and geometry attributes.
datasource (datasource.DataSource): DataSource instance with crs,
pathname, method_id and raster_files_index attributes.
built_polygon (shapely.geometry.multipolygon.MultiPolygon): Shapely
polygon or multipolygon of the built area.
Returns:
results (dictionary): dictionary with buil-up area information.
"""
if built_polygon.is_empty:
logging.info("No built area found in {}".format(tile.quadkey))
return
results = {
"tile_id": tile.quadkey,
"method_id": datasource.method_id,
"built_area": TileProcessor.reproject_polygon(built_polygon, tile.crs, "epsg:4326"),
"size_built_area": TileProcessor.albers_area_calculation(built_polygon, tile.crs),
"last_update": str(date.today()),
}
if not results["method_id"]:
del results["method_id"]
return results
@staticmethod
def get_build_up_area(quadkey, datasource, database, database_crs_number, table_config):
"""Run the complete processing of a quadkey and returns a dictionary
created with TileProcessor.build_dictionary.
Args:
quadkey (str): Quadkey code associated with a Bing quadtree tile.
datasource (datasource.DataSource): DataSource instance with crs,
pathname, method_id and raster_files_index attributes.
database (database.Database): Database instance with credentials
and connection ready to perform data queries.
database_crs_number (str): SRID number of the target table.
table_config (dict): Dictionary with table name, schema and geometry_field.
This is part of the config.yml file.
Returns:
results (dictionary): Dictionary with built-up area information.
"""
tile = Tile(quadkey, datasource.crs)
# Find raster files and extract relevant pixels
raster_files = TileProcessor.intersect_datasource_with_tile(
datasource=datasource, tile=tile
)
pixel_values, pixel_georeferences = TileProcessor.get_raster_pixels_in_tile(
raster_filepaths=raster_files, tile=tile
)
# Transform the pixels into a polygon
built_geometry = TileProcessor.polygonize_array(
pixel_values, pixel_georeferences, datasource
)
if built_geometry is not None:
clip_built_geometry = TileProcessor.clip_to_tile_extent(built_geometry, tile)
roads_in_tile = database.get_features_in_tile(
tile=tile, crs_number=database_crs_number, **table_config
)
roads_processed = TileProcessor.process_dataframe_with_tile(
roads_in_tile, tile=tile, buffer_magnitude=3.0
)
refined_built_area = TileProcessor.polygon_difference(
clip_built_geometry, roads_processed
)
result = TileProcessor.build_dictionary(tile, datasource, refined_built_area)
return result
......@@ -30,6 +30,7 @@ setup(
license="AGPLv3+",
author="Helmholtz-Zentrum Potsdam Deutsches GeoForschungsZentrum GFZ",
install_requires=[
"numpy",
"pyproj==3.0.0.post1",
"shapely",
"babelgrid",
......
......@@ -33,6 +33,7 @@ def test_init():
pathname=os.environ["TEST_BASE_PATH"],
raster_files_index="GHS_TEST_INDEX.shp",
built_pixel_values=[6, 5, 4, 3], # Built pixels in GHSL
method_id=1,
)