Commit 2cc30ee4 authored by Michael Rudolf's avatar Michael Rudolf

Ignorelist, First functions, Terminology

 - Added git-ignorelist
 - Added a todo list
 - Added a document describing the terminology
 - Added a first notebook and module for data preparation
parent fb6fe4d9
# ignore the cached variables from running the script
# ignore pictures, pdfs, and data files
# Byte-compiled / optimized / DLL files
# C extensions
# Distribution / packaging
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
# Installer logs
# Unit test / coverage reports
# Translations
# Django stuff:
# Flask stuff:
# Scrapy stuff:
# Sphinx documentation
# PyBuilder
# Jupyter Notebook
# IPython
# pyenv
# celery beat schedule file
# SageMath parsed files
# Environments
# Spyder project settings
# Rope project settings
# mkdocs documentation
# mypy
# Pyre type checker
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Preparation\n",
"This Notebook contains the relevant steps to separate an original experiment into smaller subsets which can then be used for feature generation\n",
"## A first look at the data\n",
"At first we quickly show what kind of data is there and what we need to do with it. The experiment is stored in a HDF5 file which has the following contents:"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import the necessary modules\n",
"import importlib\n",
"import preparation\n",
"# Workflow\n",
"file_path = 'C:/Users/Michael/ownCloud/DocStelle/GitRepos/shear-madness/0-data-preparation/ExampleData/b_5kPa_371-01-27-GB300.h5'\n",
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
"nbformat": 4,
"nbformat_minor": 2
#!/usr/bin/env python3
Module containing functionality for splitting original ringsheartester data
into separate files and extracting sets/subsets for feature generation.
__AUTHOR__: Michael Rudolf
__DATE__: 13-Feb-2019
# Imports
import numpy as np
import h5py
import logging
from ete3 import Tree
# Public Functions
def show_h5_contents(cur_file):
''' Displays the content of a h5file in a structured format '''
with h5py.File(cur_file, 'r') as exp:
tree_string = _extract_keys_asTree(exp)
tree = Tree(tree_string+';', format=1)
def separate_velocities(velo_in, dv=0.1):
Separates the velocity measurements into intervals with the same velocity
The output contains the interval indices and the mean velocity.
# Downsampling by 100 samples for quicker calculation
interv = np.arange(0, len(velo_in), 100)
# First take log10 and then filter the signal
with np.errstate(divide='ignore'): # ignores the division by zero warning
velo = np.log10(velo_in[interv])
fvelo = signal.medfilt(velo, 101)
dvelo = fvelo[0:len(fvelo) - 10] - fvelo[10:len(fvelo)]
dvelo[dvelo < 0] = 0
# Use peak detection to get changes
[mini, maxi] = eventfinder.peakdet(dvelo, dv)
# Retransfer to old indices
change = [n * 100 for n in maxi[0]]
# Add first and last index for intervals
intvls = [0]
[intvls.append(i) for i in change]
# Calculate mean velocities in the intervals
velint = []
to_add = [0] # extract list, first value is always taken
for i in range(len(intvls) - 1):
new_vel = np.round(np.nanmean(velo_in[intvls[i]:intvls[i + 1]]), 4)
# check whether this new value is actually something new or just noise
old_vel = velint[-1]
if old_vel - new_vel < 0.0004:
except Exception as e:
new_intvls = [intvls[i] for i in to_add]
new_intvls.append(len(velo_in)) # also add the last index for completeness
return [new_intvls, velint]
def ram_per_feature(setfilelist,
Calculates the amount of ram needed per feature
setfilelist -- A list with all setfiles for one experiment
window -- Window size that should be used
step_frac -- Step size as a fraction of the window size
min_cycles -- Minimum number of cycles covered at low shear rates
min_win -- Minimum required number of windows during a short cycle
# Internal Functions
def _extract_keys_asTree(f):
''' Generates a string for use with the ete3 TreeObject '''
tree_string = '('
for key in f.keys():
add = ''
if f[key].__class__ ==
tree_string += _extract_keys_asTree(f[key])
typ = str(f[key].dtype)
shp = str(f[key].shape)
if shp.endswith('()') or shp.endswith(',)'):
shp = shp[0:-1]+'0)'
shp = shp.replace(',', 'x')
add = '('+shp+'--->'+typ+')'
tree_string += add+'--'+key+','
tree_string = tree_string[0:-1]+')'
return tree_string
- Feature creation
- SciKit Learn
- Slice Data:
- Each Velocity
- Each Normal Load
- Function that calculates the amount of RAM needed depending on:
- WindowSize=30, StepSizeAsFraction=1, MinimumLengthOfCyclesCovered=3, MinimumWindowsPerCycle=10, MinFeatures=10
- RAM-Proxy
- Iteratively remove experiments
- Output the Number of Windows and which Experiments would be needed
# Terminology and Naming Style for StickSlipLearning
Here you find a small guide to understand what the terms used in this project mean and how they are defined.
## Data
Terms used to describe the dataset itself.
|Term |Explanation
|--- |---
|Experiment |A file containing the raw measurement data including some intermediate processing results. The usual format is HDF5 (*e.g. b\_5kPa\_371-01-27-GB300.h5*) and its contents can be visualized by using `preparation.show_h5_contents()`.
|Set |Extracted data from an `Experiment` which has been taken at constant normal load and loading velocity. A set contains complete time series data from two `Channels` (`friction` and `lid displacement`), the respective average `normal load`, and `loading velocity`. Because the duration of each velocity step is different but load point displacement is constant, the length of a set is different depending on the current velocity. During data preparation several of these sets are created and saved in HDF5 format.
|Subset |A sliced version of a `Set` with the same properties but a smaller amount of samples. During data preparation a subset is generated from each set according to several prerequisites and saved in HDF5 format.
|Channel |A channel is a time series of measured data points, usually in the form of an one-dimensional array. The channels that are recorded by the testing machine are: `loading velocity`, `shear stress`, `normal stress`, and `lid displacement`. The usual acquisition frequency is 625Hz. For advanced processing the channel data sometimes is converted into a different framework, e.g. the `shear stress` is converted to non-dimensional `friction`.
|Sample |A single measurement point in a `Channel`.
|Window |A small extracted slice of data from a `Channel` over a small amount of samples. The window size may vary depending on the task and available memory.
|Step(size) |Distance in samples between individual `Windows`. When `step size == window size` then the windows follow each other without overlap. If `step size < window size` then there is a certain overlap between the individual windows.
|Feature |A specific variable calculated in each window which is used as an input for the machine learning algorithm.
## Setup
### Results
Terms that describe interpreted results from a qualitative analysis of the data. Here is a small illustration that shows how the data looks like:
![PeakDetectionConvention.png]( "Data with descrition")
|Term |Explanation
|--- |---
|(Seismic) Cycle|During the experiment the shear stress shows a repetitive stick-slip pattern that is described by rate-and-state-dependent friction. A single occurrence of this pattern is called a 'cycle' because it resembles the seismic cycle that is produced by periodic loading and unloading of a seismically active fault. The pattern is characterized by a (non-)linear increase in shear stress, followed by a rapid drop of shear stress. For high loading rates the duration of such a cycle is very small, i.e. a few seconds, whereas at low loading rates the duration of a cycle can reach several tens of minutes.
|Dynamic slip |Rapid decay of shear stress at the end of a `Cycle`. The dynamic slip is distiguished by a slip velocity that is larger than the current loading rate.
|Slow slip |Slow decrease in shear stress, usually in the second half of a `Cycle`. May occur multiple times in one cycle and has a slip velocity that is at least one order of magnitude slower than `Dynamic Slip`.
|Creep |Difference between the interpolated linear reloading and the actually measured non-linear increase in shear stress during a `Cycle`. In the initial phase of a cycle the shear stress increases linearly and starts to deviate from this linear trend at a certain point. This denotes the onset of creep in the sample. While `Slow Slip` actually reduces shear stress, creep does only influence the increase in shear stress and does not cause a decrease.
### Introduction of the measurement apparatus (WIP)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment