A Step-by-step Tutorial for Interaction Graphs Package ‘integr’

Petar Markovic

2019-05-24

Introduction

This vignette provides a step-by-step tutorial for using the Interaction graphs package “integr”. The package is an implementation of Aleks Jakulin’s Interaction Analysis methodology (http://stat.columbia.edu/~jakulin/Int/) inspired by implementation in Orange 2 data mining software (https://orange.biolab.si/).

The Concept1

In the context of supervised machine learning, an interaction (i.e statistically relevant dependence) between two attributes \(X\) and \(Y\), in the presence of the context (i.e. class) atribute \(C\), is called 3-way interaction. A strength of such interaction is measured with 3-way Interaction gain: \(I(X;Y;C) = I(X,Y;C) − I(X;C) − I(Y;C)\). Here, \(I(X,Y;C) = I(X,Y|C) = H(X|C) + H(Y|C) − H(X,Y|C)\) is conditional Information gain (i.e. conditional Mutual information) between \(X\) and \(Y\) in the context \(C\), and \(I(X;Y) = H(X) + H(Y) − H(X,Y)\) is measure of dependence (i.e. “correlation”) between \(X\) and \(Y\) regardless of context, where \(H(X) = P_i \sum_{i}log_{2}P_i\) is Shannon’s entropy measured in bits, and \(P_i\) the probability of the \(i-th\) class; 2-way Interaction gains of the single attributes \(X\) and \(Y\) is represented with \(I(X;C) = InfoGain_{c}(X) = \sum_{x}\sum_{c}P(x,c)log\frac{P(x,c)}{P(x)P(c)}\) and \(I(Y;C) = InfoGain_{c}(Y) = \sum_{y}\sum_{c}P(y,c)log\frac{P(y,c)}{P(y)P(c)}\), respectively.

Interaction graphs (Figure 1) are a graphical representation of the \(k\)-most significant 3-way interactions (\(2 \leq k \leq 20\)). The graph consists of nodes which represent interracting attributes (and their 2-way interactions indicated below the name), and weighted edges which represent the strength of 3-way interaction. There are two types of edges:

Figure 1: Interaction graph based on the toy-dataset ‘Golf’

Hence, interaction graphs can be used as a tool for understanding the most important interactions and selection of the attributes suitable for grouping/including in a machine learning model.

The toy-data description

In this tutorial, the ‘Golf’ toy-dataset will be used. It is included in the package, and its structure is presented in the Table below. It represents a 14-row discrete data.frame (i.e. all columns are factors) with 6 discrete attributes of which 5 are input, and 1 is the class attribute. The input attributes are used to determine whether a game of golf was played given the conditions, and the decision is recorded in the class attribute:

Outlook Temperature Humidity Windy Others Play
overcast hot high FALSE yes yes
overcast cool normal TRUE yes yes
overcast mild high TRUE yes yes
overcast hot normal FALSE yes yes
rainy mild high FALSE yes yes
rainy cool normal FALSE yes yes
rainy cool normal TRUE no no
rainy mild normal FALSE yes yes
rainy mild high TRUE no no
sunny hot high FALSE no no
sunny hot high TRUE no no
sunny mild high FALSE no no
sunny cool normal FALSE yes yes
sunny mild normal TRUE yes yes

Step-by-step tutorial

Reading the data

First the ‘integr’ package, and a dataset needs to be loaded. The dataset needs to be discrete, and to have a class attribute. Here the ‘Golf’ toy-dataset will be used:

Generating the interaction graph object

When the data is loaded, an interaction graph object needs to be created. A data.frame containing the data needs to be provided, as well as the name of the class attribute as a string:

The additional parameters intNo (integer) and speedUp (boolean) are optional. The first indicates the desired number of interactions to be displayed on the interaction graph (2 <= intNo <= 20, default 16), whilst the latter indicates if during the interactions computation all attributes that have 2-way interaction gain equal to zero (on the 4th decimal) should be pruned; this speeds up computation for larger datasets but it can lead to less precise results so it is turned off (i.e. set to FALSE) by default.

In case the intNo parameter is set to an inappropriate value (i.e <2, >20 or larger than theoretically possible number of interactions for the given dataset) it is automatically adjusted to fit and a warning message is printed.

Plotting the interaction graph object

After the interaction graph object has been obtained, it can be plotted using plotIntGraph():

It only requires an interaction graph object as an input. Here the result of the previous step is used.

The result of this comand is Figure 1.

Exporting the interaction graph object

Integr package allows interaction graphs to be export to a binary file. The supported formats are: a Graphviz graph, SVG image, PNG image, PostScript (PS) file, or PDF. The code for exporting the corresponding binary file is provided below.

Export to a Graphviz binary file

g is the interaction graph object;

path parameter is a string indicating the path (folder) in which the output should be saved.

fName parameter is a string indicating the name of the output. It should be defined without extension and without spaces. If not specified differently, ‘InteractionGraph’ by default.

Export to a SVG image

g is the interaction graph object;

path parameter is a string indicating the path (folder) in which the output should be saved.

fName parameter is a string indicating the name of the output. It should be defined without extension and without spaces. If not specified differently, ‘InteractionGraph’ by default;

h is the desired height of the output image in pixels. If not defined differently, 2000 by default.

Export to a PNG image

g is the interaction graph object;

path parameter is a string indicating the path (folder) in which the output should be saved.

fName parameter is a string indicating the name of the output. It should be defined without extension and without spaces. If not specified differently, ‘InteractionGraph’ by default;

h is the desired height of the output image in pixels. If not defined differently, 2000 by default.

Export to a PDF image

g is the interaction graph object;

path parameter is a string indicating the path (folder) in which the output should be saved.

fName parameter is a string indicating the name of the output. It should be defined without extension and without spaces. If not specified differently, ‘InteractionGraph’ by default;

h is the desired height of the output image in pixels. If not defined differently, 2000 by default.

Export to a PS image

g is the interaction graph object;

path parameter is a string indicating the path (folder) in which the output should be saved.

fName parameter is a string indicating the name of the output. It should be defined without extension and without spaces. If not specified differently, ‘InteractionGraph’ by default;

h is the desired height of the output image in pixels. If not defined differently, 2000 by default.


  1. See http://stat.columbia.edu/~jakulin/Int/ for more details on the methodology