1 Exploratory Data Analysis

“Exploratory data analysis is detective work” [Tukey, 1977, p.2]. This package enables the user to use graphical tools to find ‘quantitative indications’ enabling a better understanding of the data at hand. “As all all detective stories remind us, many of the circumstances surrounding acrime are accidental or misleading. Equally, many of the indications to be discerned in bodies of data are accidental or misleading [Tukey, 1977, p.3].” The solution is to compare many different graphical tools with the goal to find an agreement or to generate an hypothesis and then to confirm it with statistical methods. This package serves as a starting point.

1.2 Distribution Analysis

"A scientifically sound procedure for the identification and analysis of empirical distributions is a comparison to a known theoretic distribution. The quantile/quantile plot (QQ-plot) allows comparing an empirical distribution to a known distribution [Michael, 1983]. Here, in 100 quantiles the model of a Gaussian distribution is compared to the data, and a straight line confirms a good data fit of the model. The Gaussian distribution is the canonical starting point for such a comparison[…]

[t]he precise form, i.e., the type, nature and parameters of the formal model of the probability density function (pdf) is the […] goal of [Distribution] analysis. Usually, this is performed using kernel density estimators. The simplest of such a density estimation is the histogram. However, histograms are often misleading and require critical parameters such as the width of the bin [Keating and Scott, 1999]. A specially designed density estimation, which has been successfully proved in many practical applications is the “Pareto Density Estimation” (PDE). PDE consists of a kernel density estimator representing the relative likelihood of a given continuous random data [Ultsch, 2005]. PDE has been shown to be particularly suitable for the discovery of structures in continuous data hinting at the presence of distinct groups of data and particularly suitable for the discovery of mixtures of Gaussians [Ultsch, 2005]. The parameters of the kernels are auto-adopted to the date using an information theoretic optimum on skewed distributions [Ultsch, Thrun, Hansen-Goos, and Lötsch, 2015]." [Thrun/Ultsch 2018].

## Loading required namespace: pracma

1.3 Mirrored Density Plots (MD-plots): PDE-Optimized Violin Plots

A clear model behind density estimation can outperform conventional visualization approaches. The approach is published in [Thrun et al. ,2019]. The MD plot is also available in Python

## Loading required package: sm
## Package 'sm', version 2.2-5.6: type help(sm) for summary information
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## [1]   43.48 1620.69

## Loading required namespace: signal
## Loading required namespace: ggExtra

1.4 Correlation Analysis

Often it is better to visualize the density of scatter plots before calculating correlation coefficients.

## Loading required namespace: akima

A Shortcut to visualize correlation coefficients,if many features have to be compared against each other:

## Warning in cbind(Lsun3D$Data, runif(n), rnorm(n), rt(n, 2), rlnorm(n),
## rchisq(100, : number of rows of result is not a multiple of vector length
## (arg 6)