Configuration

Datasets configuration can be provided in a yaml file or as a nested list. Below you can find a detailed description of possible options.

Data frames in a dataset

A single YAML file can include multiple data frames. Entry for each will be used as name of the data frame when it comes to generating data.

first_data_frame:
  ...
second_data_frame:
  ...
third_data_frame:
  ...

Data frame configuration

Data frame configuration includes two sections:

  1. columns - where you can describe columns of your data frame.
  2. default_size - optional value that describes default size of the data frame.

Columns

Each column of your data frame should be described in a separate entry in columns section. Entry name will be used as column name.

Currently there are three major types of columns implemented:

  1. Built-in basic columns (integer, numeric, string, boolean and set)
  2. Columns that use custom function to be generated.
  3. Columns calculated from other columns.

Type of column is set by choosing a proper type value in column description. Check following sections for more details.

The order of columns will be the same as the order of entries in the configuration.

Built-in columns

Basic column types. For an example YAML configuration check this

integer

Random integers from a range

Parameters:

  • type: integer - column type
  • unique (optional, default: FALSE) - boolean, should values be unique
  • min (optional, default: 0) - integer, minimum value to occur in the column.
  • max (optional, default: 999999) - integer, maximum value to occur in the column.

Example:

data_frame:
  columns:
    integer_column:
      type: integer
      min: 2
      max: 10
numeric

Random float numbers from a range

Parameters:

  • type: numeric - column type
  • unique (optional, default: FALSE) - boolean, should values be unique
  • min (optional, default: 0) - numeric, minimum value to occur in the column.
  • max (optional, default: 999999) - numeric, maximum value to occur in the column.

Example:

data_frame:
  columns:
    numeric_column:
      type: numeric
      min: 2.12
      max: 10.3
string

Random string that follows given pattern

Parameters:

  • type: string - column type
  • unique (optional, default: FALSE) - boolean, should values be unique
  • length (optional, default: NULL) - integer, string length. If NULL, string length will be random (see next parameters).
  • min_length (optional, default: 1) - integer, minimum length if length is random.
  • max_length (optional, default: 15) - integer, maximum length if length is random.
  • pattern (optional, default: “[A-Za-z0-9]”) - string pattern, for details check this.

Example:

data_frame:
  columns:
    string_column:
      type: string
      length: 3
      pattern: "[ACGT]"
boolean

Random boolean

Parameters:

  • type: boolean - column type

Example:

data_frame:
  columns:
    boolean_column:
      type: boolean
set

Column with elements from a set

Parameters:

  • type: set - column type
  • set (optional, default: NULL) - set of possible values, if NULL, will use a random set.
  • set_type (optional, default: NULL) - type of random set, can be “integer”, “numeric” or “string”.
  • set_size (optional, default: NULL) - integer, size of random set
  • If set is random, you can add parameters required by type of set (e.g. min, max, pattern, etc.)

Example:

data_frame:
  columns:
    set_column_one:
      type: set
      set: ["aardvark", "elephant", "hedgehog"]
    set_column_two:
      type: set
      set_type: integer
      set_size: 3
      min: 2
      max: 10
date

Column with dates

Parameters:

  • type: date - column type
  • min_date - beginning of the time interval to sample from
  • max_date - end of the time interval to sample from
  • format (optional, default: NULL) - date format, for details check this

Example:

data_frame:
  columns:
    date_column:
      type: date
      min_date: 2012-03-31
      max_date: 2015-12-23
time

Column with times

Parameters:

  • type: time - column type
  • min_time (optional, default: “00:00:00”) - beginning of the time interval to sample from
  • max_time (optional, default: “23:59:59”) - end of the time interval to sample from
  • resolution (optional, default: “seconds”) - one of “seconds”, “minutes”, “hours”, time resolution

Example:

data_frame:
  columns:
    time_column:
      type: time
      min_time: "12:23:00"
      max_time: "15:48:32"
      resolution: "seconds"
datetime

Column with datetimes

Parameters:

  • type: datetime - column type
  • min_date - beginning of the time interval to sample from
  • max_date - end of the time interval to sample from
  • date_format (optional, default: NULL) - date format, for details check this
  • min_time (optional, default: “00:00:00”) - beginning of the time interval to sample from
  • max_time (optional, default: “23:59:59”) - end of the time interval to sample from
  • time_resolution (optional, default: “seconds”) - one of “seconds”, “minutes”, “hours”, time resolution
  • tz (optional, default: “UTC”) - time zone name

Example:

data_frame:
  columns:
    time_column:
      type: datetime
      min_date: 2012-03-31
      max_date: 2015-12-23
      min_time: "12:23:00"
      max_time: "15:48:32"
      time_resolution: "seconds"

Special columns

Special predefined types of columns. For an example YAML configuration check this

id

Id column - ordered integer that starts from defined value (default: 1).

Parameters:

  • type: id - column type
  • start (optional, default: 1) - first value

Example:

data_frame:
  columns:
    id_column:
      type: id
      start: 2
distribution

Column filled with values that follow given statistical distribution. You can use one of the distributions available here. You can use function name (e.g. rnorm) or regular distribution name (e.g. “Normal”). For available names, check this file.

Parameters:

  • type: distribution - column type
  • distribution_type - distribution name
  • ... - all arguments required by distribution function

Example:

data_frame:
  columns:
    normal_distribution:
      type: distribution
      distribution_type: Gaussian
    bernoulli_distribution:
      type: distribution
      distribution_type: binomial
      size: 1
      prob: 0.5
    poisson_distribution:
      type: distribution
      distribution_type: Poisson
      lambda: 3
    beta_distribution:
      type: distribution
      distribution_type: rbeta
      shape1: 20
      shape2: 30
    cauchy_distribution:
      type: distribution
      distribution_type: Cauchy-Lorentz

Custom columns

There are two levels of custom generator that can be used. You can provide a function that generates a single value or a function that provides a whole column. For examples check this configuration and this R script with functions.

custom value generator

Generate column values using custom function available in your environment. Function should return a single value.

Parameters:

  • type: custom - column type
  • custom_generator - name of the function that will provide values.
  • All parameters required by your custom function.

Example:

return_sample_paste <- function(vector_of_values) {
  values <- sample(vector_of_values, 2)
  paste(values, collapse = "_")
}
data_frame:
  columns:
    custom_column:
      type: custom
      custom_generator: return_sample_paste
      vector_of_values: ["a", "b", "c", "d"]
custom column generator

Generate column using custom function available in your environment. Function should accept argument size and return a vector of length equal to it.

Parameters:

  • type: custom_column - column type
  • custom_column_generator - name of the function that will generate column.
  • All parameters required by your custom function except size.

Example:

return_repeated_value <- function(size, value) {
  rep(value, times = size)
}
data_frame:
  columns:
    custom_column:
      type: custom_column
      custom_column_generator: return_repeated_value
      value: "Ask me about trilobites!"

Calculated columns

Calculate columns that depend on other columns. For examples check this configuration and this R script with functions.

Parameters:

  • type: calculated - column type
  • formula - calculation that has to be performed to obtain column

In general, formula can be a simple expression or a call of more complex function. In both cases formula has to include names of the columns required for the calculations. When using a function, make sure that it returns a vector of the same size as inputs.

Example:

check_column <- function(column) {
  purrr::map_lgl(column, ~.x >= 10)
}
data_frame:
  columns:
    basic_column:
      type: integer
      min: 1
      max: 10
    second_basic_column:
      type: integer
      min: 1
      max: 10
    calculated_column:
      type: calculated
      formula: basic_column + second_basic_column
    second_calculated_column:
      type: calculated
      formula: check_column(calculated_column)

Default size

Data frame can have a default number of rows that will be returned if size argument is not provided. Default size can be one of:

Example:

data_frame:
  columns:
    ...
  default_size: 10

Example:

random_number_of_rows:
  columns:
    ...
  default_size:
    arguments:
      min: 10
      max: 20
static_random_number_of_rows:
  columns:
    ...
  default_size:
    arguments:
      min: 5
      max: 10
    static: TRUE

For sample YAML configuration check this.

Arrange data frame

Data frame can be arranged by columns by providing a list of column names as arange field.

Example:

data_frame:
  columns:
    a:
      ...
    b:
      ...
    c:
      ...
    d:
      ...
  arrange: [a, c]