Commit e6ae396b authored by Cecilia Nievas's avatar Cecilia Nievas
Browse files

Added docs 05_Testing_Scripts and 06_Important_Definitions

parent 145115c5
# General
For each core script, the enumerated configurable parameters are those that are specific to that script, i.e. defined in the configuration file under a subtitle that matches the name of the file. General parameters are not explained herein but in `03_Config_File.md`.
# SERA_testing_rebuilding_exposure_from_cells_alternative_01.py
## Configurable parameters:
The parameters that need to be specified under the `SERA_testing_rebuilding_exposure_from_cells_alternative_01` section of the configuration file are:
- countries = Countries to process. If more than one, separate with comma and space.
- admin_ids_to_ignore = 1110101. Within those countries, do not process admin units specified under this parameter. This is useful for running parts of countries only, it can be empty or ignored too.
- sera_disaggregation_to_consider = area, gpw_2015_pop, ghs, sat_27f or sat_27f_model. Select the parameter to use to distribute the SERA model to the grid.
- occupancy_cases = Res, Com, Ind. Occupancy cases to process.
## What the code does:
This code rebuilds the SERA model from the SERA-on-a-grid model resulting from running `SERA_distributing_exposure_to_cells.py`, and checks that the rebuilt model is the same as the original SERA model.
For a particular country and occupancy case (Res, Com, Ind), this code goes by administrative unit. The list of administrative units is retrieved from the PSQL tiles database. Firstly, the total number of buildings, dwellings, people and costs is retrieved from the full SERA files, which are read as Pandas DataFrames. Secondly, the code determines the list of cell IDs associated with this administrative unit of this country and goes cell by cell. For each cell, the total number of buildings, dwellings, and people is determined in two alternative ways: (i) by reading the total for that particular intersection of cell and country_admin_ID (i.e. the attributes of the country_admin_ID subgroup in the HDF5 files), and (ii) by deriving them from the values of dwellings/building and people/dwelling, as well as by converting the total number of buildings into number of buildings per building class. The total cost of all buildings in the administrative unit are derived from the values of dwellings/building, area/dwelling and cost/area.
The components of the HDF5 files being retrieved in this test are:
- The attributes `Num_Bdgs`, `Num_Dwells` and `Num_Ppl` of the `country_admin_ID` subgroups of a cell group in the grid cells HDF5 file.
- The `SERA_classes` and `SERA_vals` datasets of the `country_admin_ID` subgroups of a cell group in the grid cells HDF5 file.
- The `Locations` and `Parameters` datasets of each taxonomy group in the building classes HDF5 file.
The output of this code (per occupancy case and per country) is:
- A CSV file in which each row is an administrative unit. The columns compare the number of buildings, dwellings, people and cost in the original SERA files and resulting from rebuilding the model from the HDF5 files generated by `SERA_distributing_exposure_to_cells.py`. The rebuilt model is correct when the values in the difference columns are zero. The columns are:
- admin_id
- CSV_bdgs: number of buildings stemming from the original SERA CSV files.
- HDF_bdgs: number of buildings stemming from the `Num_Bdgs` of the `country_admin_ID` subgroups in the grid cells HDF5 file.
- HDF_derived_bdgs: number of buildings calculated using `SERA_classes`, `SERA_vals`, `Locations` and `Parameters` from the grid cells and building classes HDF5 files.
- diff_bdgs: CSV_bdgs - HDF_bdgs (i.e., SERA - rebuilt).
- diff_bdgs_derived: CSV_bdgs - HDF_derived_bdgs (i.e., SERA - rebuilt).
- CSV_dwells: number of dwellings stemming from the original SERA CSV files.
- HDF_dwells: number of dwellings stemming from the `Num_Dwells` of the `country_admin_ID` subgroups in the grid cells HDF5 file.
- DF_derived_dwells: number of dwellings calculated using `SERA_classes`, `SERA_vals`, `Locations` and `Parameters` from the grid cells and building classes HDF5 files.
- diff_dwells: CSV_dwells - HDF_dwells (i.e., SERA - rebuilt).
- diff_dwells_derived: CSV_dwells - DF_derived_dwells (i.e., SERA - rebuilt).
- CSV_ppl: number of people stemming from the original SERA CSV files.
- HDF_ppl: number of people stemming from the `Num_Ppl` of the `country_admin_ID` subgroups in the grid cells HDF5 file.
- HDF_derived_ppl: number of people calculated using `SERA_classes`, `SERA_vals`, `Locations` and `Parameters` from the grid cells and building classes HDF5 files.
- diff_ppl: CSV_ppl - HDF_ppl (i.e., SERA - rebuilt).
- diff_ppl_derived: CSV_ppl - HDF_derived_ppl (i.e., SERA - rebuilt).
- CSV_cost: cost stemming from the original SERA CSV files.
- HDF_derived_cost: cost calculated using `SERA_classes`, `SERA_vals`, `Locations` and `Parameters` from the grid cells and building classes HDF5 files.
- diff_cost_derived: CSV_cost - HDF_derived_cost (i.e., SERA - rebuilt).
- A TXT file summarising the total values for the whole country. The rebuilt model is correct when these total values are the same. The summarised values are:
- Total number of buildings from SERA CSV files.
- Total number of buildings from cells summation.
- Total number of dwellings from SERA CSV files.
- Total number of dwellings from cells summation.
- Total number of people from SERA CSV files.
- Total number of people from cells summation.
- Total cost of buildings from SERA CSV files.
- Total cost of buildings from cells summation.
When reading the output CSV files, it should be noted that the difference columns need to be interpreted in relation with the magnitude of the values they contain. It is common for the costs to result in non-zero differences which, compared against the actual cost, represent very small percentages. These differences are not only related to floating point precision matters when running the code but also to the precision to which numbers are calculated and round up in the original SERA model.
The summary TXT file refers to the whole country. If only part of the country has been proccessed with `SERA_distributing_exposure_to_cells.py`, the totals will not match and will need to be interpreted in light of independent knowledge on what is to be expected for the administrative units that have been processed.
# SERA_testing_rebuilding_exposure_from_cells_alternative_02.py
## Configurable parameters:
The parameters that need to be specified under the `SERA_testing_rebuilding_exposure_from_cells_alternative_02` section of the configuration file are:
- countries = Countries to process. If more than one, separate with comma and space.
- min_grid_cell_id = Cell IDs with numbers below this one will be ignored (useful while running pieces of countries and not whole countries). Leave empty if no constraint applies.
- sera_disaggregation_to_consider = area, gpw_2015_pop, ghs, sat_27f or sat_27f_model. Select the parameter to use to distribute the SERA model to the grid.
- occupancy_cases = Res, Com, Ind. Occupancy cases to process.
## What the code does:
This code rebuilds the SERA model from the SERA-on-a-grid model resulting from running `SERA_distributing_exposure_to_cells.py`, and checks that the rebuilt model is the same as the original SERA model.
For a particular country and occupancy case (Res, Com, Ind), this code goes by cell associated with the country. The list of cell IDs is retrieved from the PSQL tiles database. For each cell, the total number of buildings (from the `Total` subgroup of a cell group) is transformed into number of buildings per class and per country_admin_ID by means of the proportions associated with each building class (the `SERA_classes` and `SERA_vals` datasets) and the proportion of buildings of each class within each country_admin_ID (the `SERA_subclasses` dataset of `Total`). Then the code goes by building class listed in `SERA_classes` and, within each class, by country_admin_ID involved. It retrieves the values of dwellings/building, people/dwelling, area/dwelling and cost/area from the building classes HDF5 file for a particular class at a particular country_admin_ID. In this way, the total number of dwellings, people and costs associated with each building class within each country_admin_ID are determined. The subtotals per country_admin_ID are then collected in arrays that keep on adding administrative unit IDs as more units get encountered. Once all cells have been assessed, results are collected per administrative unit, and compared against the total number of buildings, dwellings, people and costs per administrative unit retrieved from the SERA full files, which are read as Pandas DataFrames.
The components of the HDF5 files being retrieved in this test are:
- The attribute `Num_Bdgs` of the `Total` subgroup of a cell group in the grid cells HDF5 file.
- The `SERA_classes`, `SERA_vals` and `SERA_subclasses` of the datasets of the country_admin_ID subgroups of a cell group in the grid cells HDF5 file.
- The `Locations` and `Parameters` datasets of each taxonomy group in the building classes HDF5 file.
The output of this code (per occupancy case and per country) is:
- A CSV file in which each row is an administrative unit. The columns compare the number of buildings, dwellings, people and cost in the original SERA files and resulting from rebuilding the model from the HDF5 files generated by `SERA_distributing_exposure_to_cells.py`. The rebuilt model is correct when the values in the difference columns are zero. The columns are:
- admin_id
- CSV_bdgs: number of buildings stemming from the original SERA CSV files.
- HDF_bdgs: number of buildings calculated using `SERA_classes`, `SERA_vals`, `SERA_subclasses`, `Locations` and `Parameters` from the grid cells and building classes HDF5 files.
- diff_bdgs: CSV_bdgs - HDF_bdgs (i.e., SERA - rebuilt).
- CSV_dwells: number of dwellings stemming from the original SERA CSV files.
- HDF_dwells: number of dwellings calculated using `SERA_classes`, `SERA_vals`, `SERA_subclasses`, `Locations` and `Parameters` from the grid cells and building classes HDF5 files.
- diff_dwells: CSV_dwells - HDF_dwells (i.e., SERA - rebuilt).
- CSV_ppl: number of people stemming from the original SERA CSV files.
- HDF_ppl: number of people calculated using `SERA_classes`, `SERA_vals`, `SERA_subclasses`, `Locations` and `Parameters` from the grid cells and building classes HDF5 files.
- diff_ppl: CSV_ppl - HDF_ppl (i.e., SERA - rebuilt).
- CSV_cost: cost stemming from the original SERA CSV files.
- HDF_cost: cost calculated using `SERA_classes`, `SERA_vals`, `SERA_subclasses`, `Locations` and `Parameters` from the grid cells and building classes HDF5 files.
- diff_cost: CSV_cost - HDF_cost (i.e., SERA - rebuilt).
- A TXT file summarising the total values for the whole country. The rebuilt model is correct when these total values are the same. The summarised values are:
- Total number of buildings from SERA CSV files.
- Total number of buildings from cells summation.
- Total number of dwellings from SERA CSV files.
- Total number of dwellings from cells summation.
- Total number of people from SERA CSV files.
- Total number of people from cells summation.
- Total cost of buildings from SERA CSV files.
- Total cost of buildings from cells summation.
When reading the output CSV files, it should be noted that the difference columns need to be interpreted in relation with the magnitude of the values they contain. It is common for the costs to result in non-zero differences which, compared against the actual cost, represent very small percentages. These differences are not only related to floating point precision matters when running the code but also to the precision to which numbers are calculated and round up in the original SERA model.
The summary TXT file refers to the whole country. If only part of the country has been proccessed with `SERA_distributing_exposure_to_cells.py`, the totals will not match and will need to be interpreted in light of independent knowledge on what is to be expected for the administrative units that have been processed.
# SERA_testing_rebuilding_exposure_from_cells_alternative_03.py
## Configurable parameters:
The parameters that need to be specified under the `SERA_testing_rebuilding_exposure_from_cells_alternative_03` section of the configuration file are:
- countries = Countries to process. If more than one, separate with comma and space.
- admin_ids_to_ignore = 1110101. Within those countries, do not process admin units specified under this parameter. This is useful for running parts of countries only, it can be empty or ignored too.
- sera_disaggregation_to_consider = area, gpw_2015_pop, ghs, sat_27f or sat_27f_model. Select the parameter to use to distribute the SERA model to the grid.
- occupancy_cases = Res, Com, Ind. Occupancy cases to process.
## What the code does:
This code rebuilds the SERA model from the SERA-on-a-grid model resulting from running `SERA_distributing_exposure_to_cells.py`, and checks that the rebuilt model is the same as the original SERA model.
For a particular country and occupancy case (Res, Com, Ind), this code goes by administrative unit. The list of administrative units is retrieved from the PSQL tiles database. For each administrative unit, the SERA model is read directly from the full CSV files, and the repeated taxonomy strings are merged together. At the same time, the model is also reconstructed from the HDF5 files by first determining the list of cell IDs associated with this administrative unit of this country, and then going cell by cell. For each cell, the number of buildings, dwellings, people and cost per `taxonomy*` (see `06_Important_Definitions.md`) is derived from the values of dwellings/building, people/dwelling, area/dwelling and cost/area, as well as by converting the total number of buildings (in the cell) into number of buildings per building class (in that cell). The results for each cell get stacked together to form the results for the whole administrative unit, merging together repeated `taxonomy*` strings. Once these two datasets are constructed (i.e. a DataFrame containing non-repeated taxonomy strings retrieved from the CSV files and another DataFrame containing non-repeated `taxonomy*` strings derived from the HDF5 files), the code goes one by one the taxonomies in the CSV one, identifies the associated cases of `taxonomy*`, and compares the number of buildings, dwellings, people and cost.
The output CSV file contains this comparison, with each row corresponding to a value of taxonomy in an administrative unit, and columns showing values for each of the two datasets and the difference between them.
The output of this code (per occupancy case and per country) is:
- A CSV file in which each row is a `taxonomy*` in an administrative unit. The columns compare the number of buildings, dwellings, people and cost in the original SERA files and resulting from rebuilding the model from the HDF5 files generated by `SERA_distributing_exposure_to_cells.py`. The rebuilt model is correct when the values in the difference columns are zero. The columns are:
- admin_id
- CSV_bdgs: number of buildings stemming from the original SERA CSV files.
- HDF_bdgs: number of buildings calculated using `SERA_classes`, `SERA_vals`, `Locations` and `Parameters` from the grid cells and building classes HDF5 files.
- diff_bdgs: CSV_bdgs - HDF_bdgs (i.e., SERA - rebuilt).
- CSV_dwells: number of dwellings stemming from the original SERA CSV files.
- HDF_dwells: number of dwellings calculated using `SERA_classes`, `SERA_vals`, `Locations` and `Parameters` from the grid cells and building classes HDF5 files.
- diff_dwells: CSV_dwells - HDF_dwells (i.e., SERA - rebuilt).
- CSV_ppl: number of people stemming from the original SERA CSV files.
- HDF_ppl: number of people calculated using `SERA_classes`, `SERA_vals`, `Locations` and `Parameters` from the grid cells and building classes HDF5 files.
- diff_ppl: CSV_ppl - HDF_ppl (i.e., SERA - rebuilt).
- CSV_cost: cost stemming from the original SERA CSV files.
- HDF_cost: cost calculated using `SERA_classes`, `SERA_vals`, `Locations` and `Parameters` from the grid cells and building classes HDF5 files.
- diff_cost: CSV_cost - HDF_cost (i.e., SERA - rebuilt).
When reading the output CSV files, it should be noted that the difference columns need to be interpreted in relation with the magnitude of the values they contain. It is common for the costs to result in non-zero differences which, compared against the actual cost, represent very small percentages. These differences are not only related to floating point precision matters when running the code but also to the precision to which numbers are calculated and round up in the original SERA model.
# SERA_testing_compare_visual_output_vs_OQ_input_files.py
compare the number of buildings, people and cost per cell reported in the OpenQuake input file (generated from the grid) and the visual output CSV.
# SERA_create_outputs_QGIS_for_checking.py
create a summary of the parameters mapped (GHS, GPW, Sat, etc) in CSV format to be read with QGIS, enabling a visual check of the results.
# SERA_testing_mapping_admin_units_to_cells_qualitycontrol.py
check the areas of the cells mapped for the administrative units for which step 3 was run.
# GDE_check_consistency.py
It carries out different consistency checks on the resulting GDE model (see detailed description of this script).
# GDE_check_OQ_input_files.py
It prints to screen some summary values of the files and checks that the asset ID values are all unique.
# GDE_check_tiles_vs_visual_CSVs.py
It reads the visual CSV output by cell and the corresponding GDE tiles HDF5 files and compares the number of buildings, cost and number of people in each cell according to each of the two. An output CSV file collects the discrepancies found, if any.
# Taxonomy*
In the SERA exposure model, the `taxonomy` field contains the string that defines the building class as per the GEM Taxonomy. The (already outdated) preliminary version of the SERA exposure model over which this code was developed required a series of parameters (apart from `taxonomy`) to unequivocally define a building class fully, so that all distinct classes had only one value of the parameters dwellings/building, area/dwelling, people/dwelling and cost/area. This led to the concept of `taxonomy*`, i.e., an extended value of `taxonomy`, including other fields. In the present code, `taxonomy*` is defined in the following way for each occupancy case:
- Res: `taxonomy///settlement_type/occupancy_type/dwell_per_bdg/area_per_dwell`
- Com: `taxonomy///settlement_type/occupancy_type/area_per_dwell`
- Ind: `taxonomy///settlement_type/occupancy_type/cost_per_area`
The use of triple slash allows to do `taxonomy` = `taxonomy*`.split(‘///’)\[0\]. The HDF5 files generated from the process of distributing the SERA model to the 10-arcsec grid (with `SERA_distributing_exposure_to_cells.py`) store `taxonomy*`.
This definitions of `taxonomy*` will most likely change in the future.
Not every country and occupancy case contains the same columns in the SERA full CSV files. The code adds missing columns so as to be able to treat all countries and cases in a homogeneous way. For example, if the `settlement_type` column does not exist, it is added with empty strings. The `taxonomy*` in this case will be something like `taxonomy////occupancy_type…` (note that four slashes are present, the three that go after taxonomy and the one that goes after the empty settlement type). For commercial and industrial exposure, the `dwellings` column does not exist, but it is inferred from the total costs and the intermediate dwelling-dependent variables that 1 building = 1 dwellings in these occupancy cases. Therefore, the `dwellings` column is added with values equal to the `buildings` column.
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment