Commit 2b0c392e authored by Janis Jatnieks's avatar Janis Jatnieks
Browse files

Update README.md

parent c670c76b
......@@ -11,7 +11,6 @@ geochemistry simulators are quite slow copared to fluid
dynamics and constitute the main bottleneck for producing
highly detaled simulations of such application scenarios.
# Quick start
While this project was written for a very specific application
of surrogate models in mind, it can be re-used for more general
......@@ -34,7 +33,22 @@ in the real world
will be supplied at run-time from the hydrofynamics or thermal
model, but are not produced as outputs form the surrogate
# Work-flow concept
The way surrogate playground is designed to be used is fairly simple:
* Take input and output data, do some basic filtering and apply a set of preprocessing transforms. For each output, do this:
* Fit a bunch ML models with all specified preprocessing methods and preset hyperparameters/ hyperparameter search grid/optimizer setting using mostly caret interface in Surrogate_playgound
* Validate all the models that succeed at fitting without crashing, using the same random samples
* Collect validation results using all commonly used error metrics as well as performance metrics for all the models, including speed at predicting and training
* Select the best suited model with regards to speed and accuracy, depending on the criteria specified
* Create a function called SelectedSurrogate in the current environment which when called with an input table with the same columns used in training, will return a table with all the output columns similarily to supplied input-output tables. Each output column will be predicted by a different output model, optionally in parallel. This is a simple ensembling approach, sometimes called a bucket of models.
# Some additional features
* automated parallelization for serial ML models on unix like systems using plyr .parallel capability for data transforms as well as model fitting and prediction.
* custom models can be added, such as demonstrated with DiceEval models and rpart_anova directly
* support most commonly used error functions and MASE (Mean Average Scaled Error), which is quite useaful when comparing accuracy of prediction across multiple output variables with different value ranges
* fine-tuning of parallelization granularity at each key stage – data transform, fitting and prediction (be careful when using models with their own internal parallelization capbility)
* use of data.table library wherever possible for performance optimization purposes
* control of random seed throughout the process
* saving of resulting models with automatic experiment naming and full environment (be mindful that this can be many GB in some cases – depending on the models being used and the number of models in the ensemble)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment