Commit b0f5b991 authored by Janis Jatnieks's avatar Janis Jatnieks
Browse files

Update readme with materials and links

parent 31723c1f
# Intro
Surrogate playground is an automated machine learning ramework (a kind of auto ML) written for rapidly
screening a large number of different ML models to serve as surrogates for a slow runnig simulator.
screening a large number of different ML models to serve as [surrogates](https://en.wikipedia.org/wiki/Surrogate_model) for a slow runnig simulator.
This was written for a reactive transport application where
a fluid flow model (hydrodynamics) is coupled to a
......@@ -11,25 +11,22 @@ geochemistry simulators are quite slow copared to fluid
dynamics and constitute the main bottleneck for producing
highly detaled simulations of such application scenarios.
# Materials
For the proof-of-concept research pioneering the use of surrogate models for reactive transport geochemistry, see our [paper here](https://www.sciencedirect.com/science/article/pii/S1876610216310050) and our [EGU Poster presentation summarizing this work here](https://presentations.copernicus.org/EGU2016-12923_presentation.pdf).
# Quick start
While this project was written for a very specific application
of surrogate models in mind, it can be re-used for more general
prupose ML model creation. See the file example files to get an
idea of how to quickly launch a large number of ML model fitting
experiments:
* `Experiment_controller_simple_example_general.R` for a general purpose ML fitting
approcah example for regression experiment
* `Experiment_controller_simple_example_general.R` for a general purpose ML fitting approcah example for a regression experiment
* `Experiment_controller_surrogate_example.R` for an example of how to use
this for surrogate model fitting the way it was designed to be used (model coupling not included)
* `Experiment_controller_surrogate_example.R` for an example of how to use this for surrogate model fitting the way it was designed to be used (model coupling not included)
The key difference between coupled surrogat models and the gen
purpose examples are that for couped simulation experiemtns there are
sometimes special fields and filtering considerations that you can
use for the fitting experiements. Among them is the
* ability to ensure that concentration values can never be negative as some models
incorrectly output slighly negative values, but this is not possible
in the real world
The key difference between coupled surrogate models and the gen
purpose examples are that for couped simulation experiemtns there are sometimes special fields and filtering considerations that you can use for the fitting experiements. Among them is the
* ability to ensure that concentration values can never be negative as some models incorrectly output slighly negative values, but this is not possible in the real world
* that some fields that are expected by the ML model as input,
will be supplied at run-time from the hydrofynamics or thermal
model, but are not produced as outputs form the surrogate
......@@ -37,19 +34,17 @@ model, but are not produced as outputs form the surrogate
# Work-flow concept
The way surrogate playground is designed to be used is fairly simple:
* Take input and output data, do some basic filtering and apply a set of preprocessing transforms. For each output, do this:
* Fit a bunch ML models with all specified preprocessing methods and preset hyperparameters/ hyperparameter search grid/optimizer setting using mostly caret interface in Surrogate_playgound
* Fit a bunch ML models with all specified preprocessing methods and preset hyperparameters/hyperparameter search grid using mostly caret interface in Surrogate_playgound
* Validate all the models that succeed at fitting without crashing, using the same random samples
* Collect validation results using all commonly used error metrics as well as performance metrics for all the models, including speed at predicting and training
* Select the best suited model with regards to speed and accuracy, depending on the criteria specified
* Create a function called SelectedSurrogate in the current environment which when called with an input table with the same columns used in training, will return a table with all the output columns similarily to supplied input-output tables. Each output column will be predicted by a different output model, optionally in parallel. This is a simple ensembling approach, sometimes called a bucket of models.
* Create a function called SelectedSurrogate in the current environment which when called with an input table with the same columns used in training, will return a table with all the output columns similarily to supplied input-output tables. Each output column will be predicted by a different output model, optionally in parallel. This is a simple ensembling approach, sometimes called a [bucket of models](https://en.wikipedia.org/wiki/Ensemble_learning#Bucket_of_models).
# Some additional features
* automated parallelization for serial ML models on unix like systems using plyr .parallel capability for data transforms as well as model fitting and prediction.
* automated parallelization for serial ML models on unix like systems using plyr .parallel capability for preprocessing transforms as well as model fitting and prediction.
* custom models can be added, such as demonstrated with DiceEval models and rpart_anova directly
* support most commonly used error functions and MASE (Mean Average Scaled Error), which is quite useaful when comparing accuracy of prediction across multiple output variables with different value ranges
* support most commonly used error functions and [MASE (Mean Average Scaled Error)](https://en.wikipedia.org/wiki/Mean_absolute_scaled_error), which is quite useaful when comparing accuracy of prediction across multiple output variables with different value ranges
* fine-tuning of parallelization granularity at each key stage – data transform, fitting and prediction (be careful when using models with their own internal parallelization capbility)
* use of data.table library wherever possible for performance optimization purposes
* use of [data.table for performance optimization purposes](https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-Grouping)
* control of random seed throughout the process
* saving of resulting models with automatic experiment naming and full environment (be mindful that this can be many GB in some cases – depending on the models being used and the number of models in the ensemble)
* saving of resulting models with automatic experiment naming and full environment (be mindful that this can be many GB in some cases – depending on the models being used and the number of models in the ensemble)
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment