At the nextcloud link (below) you can find the sample datasets and information packs for the problems that we will be tackling in the upcoming hackathon. These datasets and information packs have been prepared by Kathrin Ward and Michael Rudolf. If you have the time before the event, it might help if you familiarise yourself with the datasets and create functions for reading in these data. The full datasets will be provided on the day. There might be one additional dataset, but we are waiting for confirmation.
Link to dataset samples and information: https://nextcloud.gfz-potsdam.de/s/kKZBJ4C2NH6iXmH
The venue is Haus H on the Telegrafenberg Campus of GFZ.
## Problems
### Laboratory shearing data
This software package contains (or is going to contain) a broad variety of scripts to separate raw experimental data generated in the analog laboratory into smaller subsets that are then used to train a machine learning code on a certain set of predictors. The experiments are shear experiments where
a granular material is deformed under different stresses and at variable velocity. The needed shear stress and the current thickness of the layer is recorded resulting in 2-channel continuous data with 3 constant variables perle (details outlined below). The complete dataset contains several experiments
which in an ideal case would all be used for the machine learning analysis. For this small documentation we assume that the data is already split into appropriate sets where the experimental conditions are constant, i.e. that eachle is a measurement taken at constant velocity and constant normal stress.
Sampling frequency of data: 625 Hz
Explanation for the label dataset 'eqs':
-`eqi` -> peak stress before a slip event
-`eqe` -> minimum stress after a slip event
-`eqd` -> onset of dynamic failure where v_slip > 0.02 m/s
-`eqm` -> peak slip velocity
-`eqf` -> end of dynamic failure where v_slip < 0.02 m/s
-`creep` -> point where load curve deviates from linear reloading
-`stiff` -> reloading stiffness (not an integer but the slope for the creep determination)
LUCAS (Land Use/Cover Area frame statistical survey) is the most consistent and complete soil spectral library worldwide at continental scale. It consists of ~ 20,000 soil samples that were collected in 25 EU member states in 2012. LUCAS includes a range of soil properties, e.g. soil organic carbon (SOC) and clay content, and additional parameters characterizing the sampling location like land use (LU), land cover (LC) and the GPS position. Additionally, all samples have been measured with a laboratory point spectrometer (FOSS) with a very high spectral resolution resulting in 4200 spectral bands in the visible to short wave infrared (400-2500nm).
[Keras](keras.io) is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.
Install with `conda install -c conda-forge keras`.
And then `import keras`
### Scikit-learn
[Scikit-learn](https://scikit-learn.org) is a free machine learning library for Python. It features various algorithms like support vector machine, random forests, and k-neighbours, and it also supports Python numerical and scientific libraries like NumPy and SciPy.
# Methods
## LSTM (Long short-term memory)
Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the field of deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. It can not only process single data points (such as images), but also entire sequences of data (such as speech or video). For example, LSTM is applicable to tasks such as unsegmented, connected handwriting recognition, speech recognition and anomaly detection in network traffic or IDS's (intrusion detection systems).
A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell.
LSTM networks are well-suited to classifying, processing and making predictions based on time series data, since there can be lags of unknown duration between important events in a time series. LSTMs were developed to deal with the exploding and vanishing gradient problems that can be encountered when training traditional RNNs. Relative insensitivity to gap length is an advantage of LSTM over RNNs, hidden Markov models and other sequence learning methods in numerous applications
## RNN (Recurrent neural network)
A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.
The term “recurrent neural network” is used indiscriminately to refer to two broad classes of networks with a similar general structure, where one is finite impulse and the other is infinite impulse. Both classes of networks exhibit temporal dynamic behavior. A finite impulse recurrent network is a directed acyclic graph that can be unrolled and replaced with a strictly feedforward neural network, while an infinite impulse recurrent network is a directed cyclic graph that can not be unrolled.
Both finite impulse and infinite impulse recurrent networks can have additional stored states, and the storage can be under direct control by the neural network. The storage can also be replaced by another network or graph, if that incorporates time delays or has feedback loops. Such controlled states are referred to as gated state or gated memory, and are part of long short-term memory networks (LSTMs) and gated recurrent units. This is also called Feedback Neural Network.
## CNN
In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery. They are also known as shift invariant or space invariant artificial neural networks (SIANN), based on their shared-weights architecture and translation invariance characteristics
## Confusion matrix
In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa). The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabeling one as another).
It is a special kind of contingency table, with two dimensions ("actual" and "predicted"), and identical sets of "classes" in both dimensions (each combination of dimension and class is a variable in the contingency table).