Commit e0d482e5 authored by Nicolas Garcia Ospina's avatar Nicolas Garcia Ospina
Browse files

Improved config file and added documentation

parent c1ad7d11
Pipeline #21418 passed with stage
in 4 minutes and 22 seconds
......@@ -3,6 +3,7 @@ datasource:
pathname: /your/path/to/datasource
raster_files_index: GHS_TEST_INDEX.shp
built_pixel_values: [6, 5, 4, 3]
# method_id: 1 # optional
tiles:
use_txt: False
......@@ -10,8 +11,8 @@ tiles:
txt_filepath: tiles.txt
output_pathname: ./results
number_cores: 10
chunk_size: 1000
number_cores: 1
batch_size: 1000
database:
host: your_host.dir.request_data
......@@ -22,6 +23,7 @@ database:
roads_table:
tablename: planet_osm_roads
geometry_field: way
process_buffer_magnitude: False
target_database:
host: your_host.dir.request_data
......
......@@ -10,7 +10,7 @@ have different ways to describe the surface but can be reduced to a binary solut
document a short description of the datasets can be found. For more information, check the datasets
official sites.
## Method_id
## source_id
The input datasets can be instantiated with a `method_id` if desired. This is recommended if the aim is
to create a multi-source product. This option can be activated by setting the `method_id` argument to
......@@ -33,10 +33,10 @@ datasets. The structure has two main components to follow (`file paths` and `ras
## File paths
The program searches for an environment variable called `OBMGAPANALYSIS_BASE_PATH`, which is the path
The program config file uses the path name found under `datasource.pathname`, which points to the path
where the different datasets are stored and should follow this structure.
OBMGAPANALYSIS_BASE_PATH
datasource.pathname
+-- dataset_1_raster_files_index.shp
+-- dataset_1_directory
| +-- dataset_subdirectory_1
......
# Configuration file
The OpenBuildingMap (OBM) gap analysis program can be configured to fit user needs. This is done by
changing the different parameters within the `config.yml` file. A sample file can be found in
the package directory under the filename `config-example.yml`.
## config.yml
The `config.yml` file follows a hierarchical structure with hierarchies defined by 2-space indentations.
An explanation of the different parameters can be found here:
The `datasource` section allows configuring path names for the raster files and the
`raster_files_index`. It also allows to set the rest of the parameters found in [Input dataset docs](./02_Input_datasets.md).
datasource: Configure your settlement layer source
crs (str): Coordinate system associated with the DataSource, given as a EPSG code.
pathname (str): Pathname where the dataset and raster_files_index are located.
raster_files_index (str): Filename of the raster index.
built_pixel_values (list): List with pixel values that represent built areas.
source_id (int): ID related to a dataset/method used (optional).
The `tiles` section inputs the list of Quadkeys to be processed. The `use_txt` argument defines how to
input the tiles. If set to `True`, Quadkeys are read from `txt_filepath` instead of `tiles_list`
tiles:
use_txt (bool): If set to True = get tiles from `txt_filepath`. False = from `tiles_list`
tiles_list (list): List of Quadkeys as strings
txt_filepath (str): file path of a text file with all quadkeys to be read
The following parameters define the processing output and can improve the performance of the program.
First, `output_pathname` is the directory to store and read CSV files for further import in SQL. The
`number_cores` parameter refers to the maximum number of parallel processes the system can handle. This is defined as the number of
cores that can be dedicated to the program execution. Finally, `batch_size` sets the maximum
amount of tiles to be handled per process. Each CSV file may contain maximum this amount of tiles if
all of them provide built areas.
output_pathname (str): Target path name for the csv file writing and reading.
number_cores (int): Desired maximum number of parallel processes to execute.
batch_size (int): Maximum amount of tiles to be handled per process
The last sections refer to database connections. `database` holds a database from which roads can be
extracted to refine built areas, also it may contain buildings if the program wants to calculate a
tile based built-up area. The `process_buffer_magnitude` is a parameter that defines how the OBM roads
(defined as lines and not polygons) are processed, giving them a width. Be careful to use the same units as in the
`datasource.crs` (meters or deg). `target_database` is a second database with the table where processed
tiles will be imported into.
database:
host (str): Postgres Database host address.
dbname (str): PostgreSQL database name.
port (int): Port to connect to the PostgreSQL database.
username (str): User to connect to the PostgreSQL database.
password (str or getpass.getpass): Password for `username` argument.
roads_table:
tablename (str): Table name within database for searching.
geometry_field (str): Name of the column with geometries.
process_buffer_magnitude (float): Numeric magnitude for the
polygon buffer (units are equal to the coordinate system units).
target_database:
host (str): Postgres Database host address.
dbname (str): PostgreSQL database name.
port (int): Port to connect to the PostgreSQL database.
username (str): User to connect to the PostgreSQL database.
password (str or getpass.getpass): Password for `username` argument.
tiles_table:
tablename (str): Table name within database for writing.
geometry_field (str): Name of the column with geometries.
......@@ -37,16 +37,18 @@ class DataSource:
Args:
crs (str): Coordinate system associated with the DataSource, given as a EPSG code.
pathname (str): Pathname where the dataset and explanatory dataframe are located.
pathname (str): Pathname where the dataset and raster_files_index are located.
raster_files_index (str): Filename of the raster index.
method_id (int): ID related to a dataset/method used (optional). Default = False
built_pixel_values (list): List with pixel values that represent built areas.
source_id (int): ID related to a dataset/method used (optional). Default = False
Attributes:
self.crs (str): Coordinate system associated with the DataSource, given as a EPSG code.
self.pathname (str): Pathname where the dataset and explanatory dataframe are located.
self.pathname (str): Pathname where the dataset and raster_files_index are located.
self.raster_files_index (geopandas.geodataframe.GeoDataFrame): GeoPandas dataframe
with raster relative filepaths and respective geometries.
......@@ -54,14 +56,14 @@ class DataSource:
self.built_pixel_values (list): List with pixel values that represent built areas.
Hints on pixel built_pixel_values available at docs/02_Input_datasets.md
self.method_id (int): ID related to the settlement dataset/method used (optional)
self.source_id (int): ID related to the settlement dataset/method used (optional)
"""
def __init__(self, crs, pathname, raster_files_index, built_pixel_values, method_id=False):
def __init__(self, crs, pathname, raster_files_index, built_pixel_values, source_id=False):
self.crs = crs
self.pathname = pathname
self.raster_files_index = geopandas.read_file(
os.path.join(pathname, raster_files_index)
)
self.built_pixel_values = built_pixel_values
self.method_id = method_id
self.method_id = source_id
......@@ -95,6 +95,7 @@ def multiprocess_chunk(quadkey_batch):
datasource=datasource,
database_crs_number=roads_database_crs_number,
table_config=db_config["roads_table"],
buffer_magnitude=db_config["process_buffer_magnitude"],
)
if result is not None:
build_up_areas.append(result)
......@@ -172,15 +173,15 @@ def main():
# Generate a parallel process pool with each quadkey batch and process
num_processes = config["number_cores"]
chunk_size = config["chunk_size"]
quadkey_batchs = [
tiles_list[i : i + chunk_size] for i in range(0, len(tiles_list), chunk_size)
batch_size = config["batch_size"]
quadkey_batches = [
tiles_list[i : i + batch_size] for i in range(0, len(tiles_list), batch_size)
]
logging.info("Creating multiprocessing pool")
with multiprocessing.Pool(processes=num_processes) as pool:
logging.info("Start parallel processing")
pool.map(multiprocess_chunk, quadkey_batchs)
logging.info("Start parallel processing of {} batches".format(len(quadkey_batches)))
pool.map(multiprocess_chunk, quadkey_batches)
logging.info("Parallel processing finished, closing pool")
pool.close()
......
......@@ -197,7 +197,7 @@ class TileProcessor:
return geometry
@staticmethod
def process_dataframe_with_tile(input_dataframe, tile, buffer_magnitude=False):
def process_dataframe_with_tile(input_dataframe, tile, buffer_magnitude=0.0):
"""
Returns a (multi)polygon processed with a tile object and, if desired, buffered
by a certain magnitude.
......@@ -216,7 +216,7 @@ class TileProcessor:
"""
geometry = input_dataframe.unary_union
geometry = TileProcessor.reproject_polygon(geometry, input_dataframe.crs, tile.crs)
if buffer_magnitude:
if buffer_magnitude > 0.0:
geometry = geometry.buffer(buffer_magnitude)
geometry = TileProcessor.clip_to_tile_extent(geometry, tile)
......@@ -267,7 +267,7 @@ class TileProcessor:
associated to the Tile and a given DataSource.
Contains:
quadkey (str): Tile quadkey
method_id (int): Integer associated to a predefined method
source_id (int): Integer associated to a predefined method
built_area (str): Polygon string projected to WGS84 coordinates.
size_built_area (float): Area measured in squared meters.
last_update (str): Date when the pickle was generated.
......@@ -276,7 +276,7 @@ class TileProcessor:
tile (tile.Tile): Tile object with quadkey, crs and geometry attributes.
datasource (datasource.DataSource): DataSource instance with crs,
pathname, method_id and raster_files_index attributes.
pathname, source_id and raster_files_index attributes.
built_polygon (shapely.geometry.multipolygon.MultiPolygon): Shapely
polygon or multipolygon of the built area.
......@@ -291,18 +291,20 @@ class TileProcessor:
results = {
"quadkey": tile.quadkey,
"method_id": datasource.method_id,
"source_id": datasource.source_id,
"built_area": TileProcessor.reproject_polygon(built_polygon, tile.crs, "epsg:4326"),
"size_built_area": TileProcessor.albers_area_calculation(built_polygon, tile.crs),
"last_update": str(date.today()),
}
if not results["method_id"]:
del results["method_id"]
if not results["source_id"]:
del results["source_id"]
return results
@staticmethod
def get_build_up_area(quadkey, datasource, database, database_crs_number, table_config):
def get_build_up_area(
quadkey, datasource, database, database_crs_number, table_config, buffer_magnitude
):
"""Run the complete processing of a quadkey and returns a dictionary
created with TileProcessor.build_dictionary.
......@@ -310,7 +312,7 @@ class TileProcessor:
quadkey (str): Quadkey code associated with a Bing quadtree tile.
datasource (datasource.DataSource): DataSource instance with crs,
pathname, method_id and raster_files_index attributes.
pathname, source_id and raster_files_index attributes.
database (database.Database): Database instance with credentials
and connection ready to perform data queries.
......@@ -319,6 +321,9 @@ class TileProcessor:
table_config (dict): Dictionary with table name, schema and geometry_field.
This is part of the config.yml file.
buffer_magnitude (float): Numeric magnitude for the polygon buffer (units are
equal to the coordinate system units)
Returns:
results (dictionary): Dictionary with built-up area information.
"""
......@@ -342,7 +347,7 @@ class TileProcessor:
tile=tile, crs_number=database_crs_number, **table_config
)
roads_processed = TileProcessor.process_dataframe_with_tile(
roads_in_tile, tile=tile, buffer_magnitude=3.0
roads_in_tile, tile=tile, buffer_magnitude=buffer_magnitude
)
refined_built_area = TileProcessor.polygon_difference(
clip_built_geometry, roads_processed
......
......@@ -33,7 +33,7 @@ def test_init():
pathname=os.environ["TEST_BASE_PATH"],
raster_files_index="GHS_TEST_INDEX.shp",
built_pixel_values=[6, 5, 4, 3], # Built pixels in GHSL
method_id=1,
source_id=1,
)
assert datasource.crs == "epsg:3857"
......
......@@ -35,7 +35,7 @@ datasource = DataSource(
pathname=os.environ["TEST_BASE_PATH"],
raster_files_index="GHS_TEST_INDEX.shp",
built_pixel_values=[6, 5, 4, 3],
method_id=1,
source_id=1,
)
tile = Tile("122100200320321022", datasource.crs)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment