The GoodmanKruskal package: Measuring association between categorical variables

The standard association measure between numerical variables is the product-moment correlation coefficient introduced by Karl Pearson at the end of the nineteenth century. This measure characterizes the degree of linear association between numerical variables and is both normalized to lie between -1 and +1 and symmetric: the correlation between variables x and y is the same as that between y and x. Categorical variables arise commonly in many applications and the best-known association measure between two categorical variables is probably the chi-square measure, also introduced by Karl Pearson. Like the product-moment correlation coefficient, this association measure is symmetric, but it is not normalized. This lack of normalization provides one motivation for Cramer’s V, defined as the square root of a normalized chi-square value; the resulting association measure varies between 0 and 1 and is conveniently available in the vcd package. An interesting alternative to Cramer’s V is Goodman and Kruskal’s tau, which is not nearly as well known and is asymmetric. This asymmetry arises because the tau measure is based on the fraction of variability in the categorical variable y that can be explained by the categorical variable x. In particular, the fraction of variability in x that is explainable by variations in y may be very different from the variability in y that is explainable by variations in x, as examples presented here demonstrate. While this asymmetry is initially disconcerting, it turns out to be extremely useful, particularlly in exploratry data analysis. This combination of utility and relative obscurity motivated the GoodmanKruskal package, developed to make this association measure readily available to the R community.

1. Introduction

Both in developing predictive models and in understanding relations between different variables in a dataset, association measures play an important role. In the case of numerical variables, the standard association measure is the product-moment correlation coefficient introduced by Karl Pearson at the end of the nineteenth century, which provides a normalized measure of linear association between two variables. In building linear regression models, it is desirable to have high correlations - either positive or negative - between the prediction covariates and the response variable, but small correlations between the different prediction covariates. In particular, large correlations between prediction covariates leads to the problem of collinearity in linear regression, which can result in extreme sensitivity of the estimated model parameters to small changes in the data, incorrect signs of some model parameters, and large standard errors, causing the statistical significance of some parameters to be greatly underestimated. In addition, the presence of highly correlated predictors can also cause difficulties for newer predictive model types: the tendency for the original random forest model class to preferentially include highly correlated variables at the expense of other predictors was one of the motivations for developing the conditional random forest method included in the party package (see the paper by Strobl et al., “Conditional Variable Importance for Random Forests,” BMC Bioinformatics, 2008, 9:307, http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-307). Alternatives to the product-moment correlation for numerical data include Spearman’s rank correlation and Kendall’s tau, both of which measure monotone association between variables (i.e., the tendency for “large” values of one variable to be associated with “large” values of the other). All three of these correlation measures may be computed with the cor function in the base R stats package by specifying the method parameter appropriately. The Kendall and Spearman measures are easily extended to ordinal variables (i.e., ordered factors), but none of these measures are applicable to categorical (i.e., unordered factor) variables. Finally, note that all three of these association measures are symmetric: the correlation between $x$ and $y$ is equal to that between $y$ and $x$.

Categorical variables arise frequently in practice, either because certain variables are inherently categorical (e.g., state, country, political affiliation, merchandise type, color, medical condition, etc.) or because numerical variables are frequently grouped in some application areas, converting them to categorical variables (e.g., replacing age with age group in demographic analysis). There is loss of information in making this conversion, but the original numerical data values are often not available, leaving us with categorical data for analysis and modeling. Also, numerical values are sometimes used to code categorical variables (e.g., numerical patient group identifiers), or integer-valued variables where there is relatively little loss of information in treating them as categorical (e.g., the variables cyl, gear, and carb in the mtcars dataset). In any of these cases, quantitative association measures may be of interest, and the most popular measures available for categorical variables are the chi-square and Cramer’s V measures defined in Section 1.1 and available via the assocstats function in the vcd package. Like the correlation measures described above for numerical data, both of these association measures are symmetric: the association between $x$ and $y$ is the same as that between $y$ and $x$.

Much less well known than the chi-square and Cramer’s V measures is Goodman and Kruskal’s tau measure, which is described in Section 1.2 and forms the basis for the GoodmanKruskal package. In contrast to all of the association measures discussed in the preceeding two paragraphs, Goodman and Kruskal’s tau measure is asymmetric: the association between variables $x$ and $y$ is generally not the same as that between $y$ and $x$. This asymmetry is inherent in the way Goodman and Kruskal’s tau is defined and, although it may be initially disconcerting, this characteristic of the association measure can actually be quite useful, as examples presented in Sections 2 and 3 illustrate.

It is possible to apply any of the categorical association measures just described to numerical data, but the results are frequently not useful. There are exceptions - the cases noted above where numerical variables are either effectively encodings of categorical variables or nearly so - but in the case of “typical” numerical data with few or no repeated values, these categorical association measures generally give meaningless results, a point discussed in detail in Section 4. Similarly, these association measures perform poorly in assessing relationships between a “typical” numerical variable and a categorical variable with a moderate number of levels. Because it is sometimes desirable to attempt to measure the association between mixed variable types, the function GroupNumeric has been included in the GoodmanKruskal package to convert numerical variables into categorical ones, which may then be used as a basis for association analysis between mixed variable types. This function is described and illustrated in Section 5, but it should be noted that this approach is somewhat experimental: there is loss of information in grouping a numerical variable into a categorical variable, but neither the extent of this information loss nor its impact are clear. Also, it is not obvious how many groups should be chosen, or how the results are influenced by different grouping strategies (the GroupNumeric function is based on the classInt package, which provides a number of different grouping methods). Nevertheless, the preliminary results presented in Section 5 suggest that this strategy does have promise for mixed-variable association analysis, and a simple rule-of-thumb is offered for selecting the number of groups.

The rest of this note is organized as follows. Section 1.1 presents a detailed problem formulation and describes the chi-square and Cramer’s V association measures. Section 1.2 then describes Goodman and Kruskal’s tau measure, and Section 2 gives a brief overview of the functionality included in the GoodmanKruskal package based on this association measure. Section 3 presents three examples to illustrate the kinds of results we can obtain with Goodman and Kruskal’s tau measure, demonstrating its ability to uncover unexpected relations between variables that may be very useful in exploratory data analysis. Section 4 then considers the important special case where the variable $x$ has no repeated values - e.g., continuously-distributed random variables or categorical variables equivalent to “record indices” - showing how Goodman and Kruskal’s tau measure breaks down completely in this case. Section 5 then introduces the GroupNumeric function to address this problem and describes its use. Finally, this note concludes with a brief summary in Section 6.

1.1 Problem formulation, chi-square, and Cramer’s V

The basic problem of interest here may be formulated as follows. We are given two categorical variables, $x$ and $y$, having $K$ and $L$ distinct values, respectively, and we wish to quantify the extent to which these variables are associated or ``vary together.’’ It is assumed that we have $N$ records available, each listing values for $x$ and $y$; for convenience, introduce the notation $x \rightarrow i$ to indicate that $x$ assumes it’s $i^{th}$ possible value. The basis for all categorical association measures is the contingency table $N_{ij}$ which counts the number of times $x \rightarrow i$ and $y \rightarrow j$: \[ \begin{equation} N_{ij} = |\{ k \; | \; x_k \rightarrow i, y_k \rightarrow j\}|, \end{equation} \] where $| {\cal S} |$ indicates the number of elements in the set $\cal S$. The raw counts in this contingency table may be turned into simple probability estimates by dividing by the number of records $N$: \[ \begin{equation} \pi_{ij} = \frac{N_{ij}}{N}. \end{equation} \] The chi-square association measure is given by: \[ \begin{equation} X^2 = N \sum_{i=1}^{K} \sum_{j=1}^{L} \; \frac{(\pi_{ij} - \pi_{i+} \pi_{+j})^2}{\pi_{i+} \pi_{+j}}, \end{equation} \] where the marginals $\pi_{i+}$ and $\pi_{+j}$ are defined as: \[ \begin{eqnarray} \pi_{i+} & = & \sum_{j=1}^{L} \; \pi_{ij}, \\ \pi_{+j} & = & \sum_{i=1}^{K} \; \pi_{ij}. \end{eqnarray} \] The idea behind this association measure is based on the observation that, if $x$ and $y$ are regarded as discrete-valued random variables, then $\pi_{ij}$ is an empirical estimate of their joint distribution, while $\pi_{i+}$ and $\pi_{+j}$ are estimates of the corresponding marginal distributions. If $x$ and $y$ are statistically independent, the joint distribution is simply the product of the marginal distributions, and the $X^2$ measure characterizes the extent to which the estimated probabilities depart from this independence assumption. Unfortunately, the $X^2$ measure is not normalized, varying between a minimum value of $0$ under the independence assumption to a maximum vaue of $N \min \{ K-1, L-1 \}$ (see Alan Agresti’s book, Categorical Data Analysis, Wiley, 2002, second edition, page 112). This observation motivates Cramer’s V measure, defined as: \[ \begin{equation} V = \sqrt{ \frac{X^2}{N \mbox{min} \{ (K-1, L-1) \} } }. \end{equation} \] This normalized measure varies from a minimum value of $0$ when $x$ and $y$ are statistically independent to a maximum value of $1$ when one variable is perfectly predictable from the other.

1.2 Goodman and Kruskal’s tau measure

Goodman and Kruskal’s $\tau$ measure of association between two variables, $x$ and $y$, is one member of a more general class of association measures defined by: \[ \begin{equation} \alpha(x, y) = \frac{V(y) - E [V(y|x)]}{V(y)} \end{equation} \] where $V(y)$ denotes a measure of the unconditional variability in $y$ and $V(y|x)$ is the same measure of variability, but conditional on $x$, and its expectation is taken with respect to $x$. Different members of this family are obtained by selecting different definitions of these variability measures, as discussed in Section 2.4.2 of Agresti’s book. The specific choices that lead to Goodman and Kruskal’s $\tau$ measure are: \[ \begin{eqnarray} V(y) & = & 1 - \sum_{j=1}^{L} \; \pi_{+j}^2, \\ E [V(y|x)] & = & 1 - \sum_{i=1}^{K} \sum_{j=1}^{L} \frac{\pi_{ij}^2}{\pi_{i+}}. \end{eqnarray} \] These equations form the basis for the function GKtau included in the GoodmanKruskal package. Before concluding this discussion, however, it is worth noting that substituting these expressions into the general expression for $\alpha(x, y)$ given above and simplifying (via some messy algebra), we obtain the following explicit expression for Goodman and Kruskal’s $\tau$ measure: \[ \begin{equation} \tau(x, y) = \frac{ \sum_{i=1}^K \sum_{j=1}^L \; \left( \frac{ \pi_{ij}^2 - \pi_{i+}^2 \pi_{+j}^2 }{ \pi_{+j} } \right) }{ 1 - \sum_{j=1}^L \pi_{+j}^2 }. \end{equation} \] It follows from the fact that $i$ and $j$ are not interchangeable on the right-hand side of this equation that $\tau(y, x) \neq \tau(x, y)$, in general.

2. The GoodmanKruskal R package

The GoodmanKruskal package includes four functions to compute Goodman and Kruskal’s $\tau$ measure and support some simple extensions. These functions are:

GKtau is the basic function to compute both the forward association $\tau(x, y)$ and the backward association $\tau(y, x)$ between two categorical vectors $x$ and $y$;
GKtauDataframe computes the Goodman Kruskal association measures between all pairwise combinations of variables in a dataframe;
GroupNumeric groups a numeric vector, returning a factor that can be used in association analysis, for reasons discussed in Sections 4 and 5;
plot.GKtauMatrix is a plot method for the S3 objects of class GKtauMatrix returned by the GKtauDataframe function.

As noted, GKtau is the basic function on which the GoodmanKruskal package is built. This function is called with two variables, $x$ and $y$, and it returns a single-row dataframe with six columns, giving the name of each variable, the number of distinct values each exhibits, and both the forward association $\tau(x,y)$ and the backward association $\tau(y,x)$. By default, missing values are treated as a separate level, if they are present (i.e., the presence of missing data increases the number of distinct values by one); alternatively, any valid value for the useNA parameter of the table function in base R may be specified for the optional includeNA parameter in the GKtau call. The other optional parameter for the GKtau function is dgts, which specifies the number of digits to retain in the results; the default value is $3$.

As a specific illustration of the GKtau function, consider its application to the categorical variables Manufacturer and Cylinders from the Cars93 dataframe in the MASS package, a dataframe considered further in Section 3.1:

GKtau(Cars93$Manufacturer, Cars93$Cylinders)

##                 xName            yName Nx Ny tauxy tauyx
## 1 Cars93$Manufacturer Cars93$Cylinders 32  6 0.364 0.058

This example illustrates the asymmetry of the Goodman-Kruskal tau measure: knowledge of Manufacturer is somewhat predictive of Cylinders, but the reverse association is much weaker; knowing the number of cylinders tells us almost nothing about who manufactured the car. An even more dramatic example from the same dataframe is the association between the Manufacturer variable and the variable Origin, with levels “USA” and “non-USA”:

GKtau(Cars93$Manufacturer, Cars93$Origin)

##                 xName         yName Nx Ny tauxy tauyx
## 1 Cars93$Manufacturer Cars93$Origin 32  2     1 0.046

Here, knowledge of the manufacturer is enough to completely determine the car’s origin - implying that each manufacturer has been characterized as either “foreign” or “domestic” - but knowledge of Origin provides essentially no ability to predict manufacturer, since each origin is represented by approximately $45$ different manufacturers. (It is interesting to note that the Cramer’s V value returned by the assocstats function from the vcd package for this pair of variables is $1$, correctly identifying the strength of the relationship between these variables, but giving no indication of its extreme directionality.)

The function GKtauDataframe is a wrapper that applies GKtau to all pairs of variables in a dataframe. This function returns an S3 object of class “GKtauMatrix” that consists of a square matrix with one row and one column for each variable included in the dataframe. The diagonal elements of this matrix give the number of unique values for each variable, and the off-diagonal elements contain the forward and backward tau measures for each variable pair. The GoodmanKruskal package includes a plot method for the S3 objects returned by the GKtauDataframe function, based on the corrplot package; detailed demonstrations of both the GKtauDataframe function and the associated plot method are given in Section 3.

The GroupNumeric function converts numeric variables into categorical variables, to serve as a basis for association analysis between variables of different types. Motivation for this function comes from the fact, discussed in Sections 4 and 5, that a continuously distributed random variable $x$ exhibits no “ties” or duplicated values, implying that the number of levels for $x$ is equal to $N$, the number of records. As shown in Section 4, this means $\tau(x, y) = 1$ for any other variable $y$, rendering the Goodman-Kruskal measure useless in such situations. Grouping numerical variables reduces the number of distinct values, and this approach can - at least in some cases - provide a useful basis for characterizing the association between numerical and categorical variables. The GroupNumeric function is based on the classIntervals function from the classInt R package, which provides a variety of different procedures for grouping numerical variables. A more detailed discussion of the GroupNumeric function is given in Section 5, which illustrates its use.

3. Three examples

The following examples illustrate the use of Goodman and Kruskal’s $\tau$ measure of association between categorical variables in uncovering possibly surprising features in a dataset. The two examples presented in Section 3.1 are both based on the Cars93 dataframe in the MASS package, and they illustrate two key points, each based on a different subset of the 27 columns from the dataframe. The first example provides a useful illustration of the general behavior of Goodman and Kruskal’s $\tau$ measure, including its asymmetry, while the second example illustrates an important special case, discussed in detail in Section 4.1. The third example is presented in Section 3.2 and it illustrates the utility of Goodman and Kruskal’s $\tau$ measure in exploratory data analysis, uncovering a relationship that is not obvious, although easily understood once it is identified.

3.1 The Cars93 dataframe: two examples

The Cars93 dataframe from the MASS package characterizes 93 different cars in terms of 27 attributes. The plot below gives a graphical summary of the results obtained using the GKtauDataframe procedure described in Section 2, applied to a subset of five of these attributes. As noted, this function returns an S3 object of class “GKtauMatrix” and the plot shown here was generated using the default options of the plot method for this object class. For this example, the resulting plot is in the form of a $5 \times 5$ array, with the variable names across the top and down the left side. The diagonal entries in this display give the numbers of unique levels for each variable, while the off-diagonal elements give both numeric and graphical representations of the Goodman-Kruskal $\tau$ values. Specifically, the numerical values appearing in each row represent the association measure $\tau(x, y)$ from the variable $x$ indicated in the row name to the variable $y$ indicated in the column name. Looking at the upper left $2 \times 2$ sub-array from this plot provides a graphical representation of the second GKtau function example presented in Section 2 to emphasize the extreme asymmetry possible for the Goodman-Kruskal $\tau$ measure. Specifically, the association from Manufacturer to Origin is $\tau(x, y) = 1$, as indicated by the degenerate ellipse (i.e., straight line) in the $(1, 2)$-element of this plot array. In contrast, the opposite association - from Origin to Manufacturer has a $\tau$ value of only $0.05$, small enough to be regarded as zero. As noted, this result means that Origin is perfectly predictable from Manufacturer, but Origin gives essentially no information about Manufacturer; in practical terms, this suggests we can uniquely associate an origin (i.e., “USA” or “non-USA”) with every manufacturer, but that each of these origin designations includes multiple manufacturers. Looking carefully at the $26 \times 2$ contingency table constructed from these variables confirms this result - there are 48 manufacturers in the “USA”" group and 45 different manufacturers in the “non-USA”" group - but it is much easier to see this from the plot shown below.

varSet1 <- c("Manufacturer", "Origin", "Cylinders", "EngineSize", "Passengers")
CarFrame1 <- subset(Cars93, select = varSet1)
GKmatrix1 <- GKtauDataframe(CarFrame1)
plot(GKmatrix1)

More generally, it appears from the plot of $\tau$ values that the variable Origin explains essentially no variability in any of the other variables, while all of the reverse associations are larger, ranging from a small association seen with Cylinders ($0.14$) to the complete predictability from Origin just noted. The variable Cylinders exhibits a slight ability to explain variations in the other variables (ranging from $0.06$ to $0.14$), but two of the reverse associations are much larger: the $\tau$ value from Manufacturer to Cylinders is $0.36$, while that from EngineSize is $0.85$, indicating quite a strong association. Again, after carefully examining the underlying data, it appears that larger engines generally have more cylinders, but that for each cylinder count, there exists a range of engine sizes, with significant overlap between some of these ranges.

The next plot is basically the same as that just considered, with the addition of a single variable: Make, which completely specifies the car described by each record of the Cars93 dataframe. Because of the way it is constructed, the only new features are the bottom row and the right-most column. Here, the asymmetry of the Goodman-Kruskal $\tau$ measure is even more extreme, since the variable Make is perfectly predictive of all other variables in the dataset. As shown in Section 4.1, this behavior is a consequence of the fact that Make exhibits a unique value for every record in the dataset, meaning it is effectively a record index. Conversely, note that these other variables are at best moderate predictors of the variations seen in Make (specifically, Manufacturer and EngineSize are somewhat predictive). Also, it is important to emphasize that perfect predictors need not be record indices, as in the case of Manufacturer and Origin discussed above.

varSet2 <- c("Manufacturer", "Origin", "Cylinders", "EngineSize", "Passengers", "Make")
CarFrame2 <- subset(Cars93, select = varSet2)
GKmatrix2 <- GKtauDataframe(CarFrame2)
plot(GKmatrix2)

3.2 The Greene dataframe

The third and final example presented here is based on the Greene dataframe from the car package, which has 384 rows and 7 columns, with each row characterizing a request to the Canadian Federal Court of Appeal filed in 1990 to overturn a rejection of a refugee status request by the Immigration and Refugee Board. A more detailed description of these variables is given in the help file for this dataframe, but a preliminary idea of its contents may be obtained with the str function:

str(Greene)

## 'data.frame':    384 obs. of  7 variables:
##  $ judge   : Factor w/ 10 levels "Desjardins","Heald",..: 2 2 2 5 1 9 8 5 5 8 ...
##  $ nation  : Factor w/ 17 levels "Argentina","Bulgaria",..: 11 17 5 4 11 11 7 16 16 3 ...
##  $ rater   : Factor w/ 2 levels "no","yes": 1 1 1 1 2 2 1 1 2 1 ...
##  $ decision: Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 1 1 ...
##  $ language: Factor w/ 2 levels "English","French": 1 1 1 2 2 1 1 1 2 1 ...
##  $ location: Factor w/ 3 levels "Montreal","other",..: 3 3 3 1 1 3 3 3 1 2 ...
##  $ success : num  -1.099 -0.754 -1.046 0.405 -1.099 ...

Applying the GKtauDataframe function to this dataframe yields the association plot shown below, which reveals several interesting details. The most obvious feature of this plot is the fact that the variable success is perfectly predictable from nation (i.e., $\tau(x, y) = 1$ for this association). Similarly, the reverse association, while not perfect is also quite strong ($\tau(y, x) = 0.85$); taken together, these results suggest a very strong connection between these variables. Referring to the help file for this dataframe, we see that success is defined as the “logit of success rate, for all cases from the applicant’s nation,” which is completely determined by nation, consistent with the results seen here. Conversely, an examination of the numbers reveals that, while most success values are unique to a single nation, a few are duplicates (e.g., Ghana and Nigeria both exhibit the success value $-1.20831$), explaining the strong but not perfect reverse association between these variables. Note, however, that this case of perfect association is not due to the “record index” issue seen in the first Cars93 example and discussed further in Section 4, since the number of levels nation is only $17$, far fewer than the number of data records ($N = 384$).

GKmatrix3 <- GKtauDataframe(Greene)
plot(GKmatrix3)

The other reasonably strong association seen in this plot is that between location and language, where the forward association is $0.8$ and the reverse association is $0.5$; these numbers suggest that location (which has levels “Montreal”, “Toronto”, and “other”) is highly predictive of language (which has levels “English” and “French”), which seems reasonable given the language landscape of Canada. We can obtain a more complete picture of this relationship by looking at the contingency table for these two variables:

table(Greene$language, Greene$location)

##          
##           Montreal other Toronto
##   English       13    53     187
##   French       125     2       4

This table also suggests the reason for the much weaker reverse association between these variables: while almost all French petitions are heard in Montreal, there is a significant split in the English petitions between Toronto and the “other” locations. The key point here is that, while the contingency table provides a more detailed view of what is happening here than the forward and reverse Goodman and Kruskal’s $\tau$ measures do, the plot of these measures helps us quickly identify which of the $21$ pairs of the seven variables included in this dataframe are worthy of further scrutiny. Finally, it is worth noting that, if we look at the decision variable, none of the other variables in the dataset are strongly associated, in either direction. This suggests that none of the refugee characteristics included in this dataset are strongly predictive of the outcome of their appeal.

4. An important special case: $K = N$

The special case $K = N$ arises in two distinct but extremely important circumstances. The first is the case of effective record labels like Make in the Cars93 example discussed in Section 3.1, while the second is the case of continuously-distributed numerical variables discussed in Section 5. The point of the following discussion is to show that if $K = N$, then $\tau(x, y) = 1$ for any other variable $y$.

To see this point, proceed as follows. First, note that if $K = N$, there is a one-to-one association between the record index $k$ and the levels of $x$, so the contingency table matrix $N_{ij}$ may be re-written as: \[ \begin{equation} N_{ij} = | \{ i \; | \; y_i \rightarrow j \} | = \left\{ \begin{array}{ll} 1 & \mbox{if $y_i \rightarrow j$}, \\ 0 & \mbox{otherwise,} \end{array} \right. \end{equation} \] which implies: \[ \begin{equation} \pi_{ij} = \left\{ \begin{array}{ll} 1/N & \mbox{if $y_i \rightarrow j$}, \\ 0 & \mbox{otherwise.} \end{array} \right. \end{equation} \] From this result, it follows that: \[ \begin{equation} \pi_{i+} = \sum_{j=1}^{L} \; \pi_{ij} = 1/N, \end{equation} \] since only the single nonzero term $1/N$ appears in this sum. Thus, we have: \[ \begin{eqnarray} E [V(y|x)] & = & 1 - \sum_{i=1}^{K} \sum_{j=1}^{L} \frac{\pi_{ij}^2}{\pi_{i+}} \\ & = & 1 - \sum_{i=1}^{N} \sum_{j = 1}^{L} \; N \pi_{ij}^2 \\ & = & 1 - \sum_{i=1}^{N} \; N (1/N)^2 \\ & = & 1 - \sum_{i=1}^{N} \; (1/N) \\ & = & 0. \end{eqnarray} \] Substituting this result into the defining equation for Goodman and Kruskal’s $\tau$, we obtain the final result: \[ \begin{equation} K = N \; \Rightarrow \; \tau(x, y) = 1, \end{equation} \] for any variable $y$. Before leaving this discussion, it is important to emphasize that the condition $K = N$ is sufficient for $\tau(x, y) = 1$, but not necessary. This point was illustrated by the fact that the variable Manufacturer completely explains the variability in Origin in the Cars93 example discussed in Section 3.1, despite the fact that $K = 32$ for the Manufacturer variable but $N = 93$.

5. Grouping numeric variables

The basic machinery of Goodman and Kruskal’s $\tau$ can be applied to numerical variables, but the results may or may not be useful, depending strongly on circumstances. Specifically, for continuously distributed numerical variables (e.g., Gaussian data), repeated values or “ties” have zero probability, so if $x$ is continuously distributed it follows from the results presented in Section 4 that $\tau(x, y) = 1$ for any variable $y$, regardless of its degree of association with $x$. Thus, for continuously distributed numerical variables, Goodman and Kruskal’s tau should not be used to measure association; either the standard product-moment correlation coefficient or the other options available from the cor function should be used instead. Also, continuity arguments suggest that numerical variables with “only a few ties,” for which $K$ is strictly less than $N$ but not a lot less than $N$, tend to give inflated association values under the Goodman and Kruskal tau measure. The mtcars dataframe is a case in point: this dataframe has $N = 32$ records and $11$ variable, all numeric, but with between $2$ and $30$ distinct values. Here, the forward associations from the $30$-level variable qsec range from $0.87$ for the binary variable am to $1.00$ for the binary variable vs. Similarly, the forward associations from the $27$-level numerical variable disp range from $0.84$ for qsec to $1.00$ for the few-level variables cyl, vs, am, and gear. The reverse associations are much smaller.

Conversely, integer variables often have many repeated values, implying that the number of levels $K$ is much smaller than the number $N$ of observations, and in these cases Goodman and Kruskal’s tau measure may yield useful association results. Also, it frequently happens that numerical variables are used to encode what are effectively categorical phenomena, either ordered or unordered. As a specific example, the cyl variable in the mtcars dataframe is nummeric but has only three distinct levels (corresponding to 4-, 6-, and 8-cylinder cars): the forward associations from this variable to the other 10 in the dataframe vary from $0.06$ for the $30$-level variable qsec to $0.67$ for the $2$-level variable vs. The vs variable encodes the general engine geometry - either a “V-shaped” (when vs = 0) or a “straight” design (when vs = 1) - and looking at the contingency table between these results reveals that the V-shaped design is strongly associated with engine designs having more cylinders:

table(mtcars$cyl, mtcars$vs)

##    
##      0  1
##   4  1 10
##   6  3  4
##   8 14  0

To provide more useful association measures between continuous numerical variables with no ties or few ties and categorical variables with few- to moderate-levels, the strategy proposed here is to group these numerical variables, creating a related categorical variable with fewer levels. This grouping strategy does entail a loss of information relative to the original numerical variable, but to the extent that the groups are representative, applying Goodman and Kruskal’s tau measure to this categorical variable should give a more reasonable measure of the degree of association between the approximate value of the numerical variable and the other categorical - or few-level numerical - variables under consideration. It is also worth noting that, despite the loss of information, this grouping strategy is popular in business applications (e.g., demographic analysis is often done by age group instead of by age itself).

In the GoodmanKruskal package, this grouping is accomplished with the function GroupNumeric, with the following passing parameters:

the required parameter x, specifying the numerical vector to be grouped;
the optional parameter n, an integer specifying the number of groups for the resulting categorical variable; if this value is not specified, it is inferred from the groupNames parameter;
the optional prameter groupNames, a character vector giving the names of the n groups formed in creating the categorical variable returned by the function; if this value is not specified, the value of n must be specified and the default names from the cut function in base R will be used;
the optional parameter orderedFactor, a logical variable with default value FALSE specifying whether the categorical variable returned should be ordered or not (note that this option has no influence on the computed value of Goodman and Kruskal’s tau measure, but it is included as a convenience for those who may wish to use this functon for other purposes);
the optional parameter style, passed to the classIntervals function from the classInt function on which GroupNumeric is based;
un-named optional parameters (…) to be passed to the classIntervals function for certain grouping methods.

To illustrate the results obtained with this function and its potential utility, consider the application of Goodman and Kruskal’s tau measure to the mtcars dataframe. Applying the GKtauDataframe function to the unmodified mtcars dataframe gives the results summarized in the plot below. Note that - consistent with the discussion above - the forward association between any variable with more than $20$ distinct levels and any other variable tend to be quite large, while the reverse associations with variables having few levels is typically quite small. The $27$-level variable disp provides a good illustration of this point: the forward associations range from $0.84$ to $1.00$, with the four variables having two or three unique levels (cyl, vs, am, and gear) all perfectly predictable. In contrast, the reverse associations for all of these variables are less than $0.10$.

GKmat <- GKtauDataframe(mtcars)
plot(GKmat, diagSize = 0.8)

Goodman-Kruskal tau matrix for the mtcars dataframe.

To see the effect of grouping on these numerical variables with few ties, the following R code uses the GroupNumeric function to construct grouped versions of the six variables with $K > 20$. These grouped variables are then used to build the modified dataframe groupedMtcars, replacing each variable with a factor with $n = 5$ groups, constructed with the default option style = ‘quantile’:

groupedMpg <- GroupNumeric(mtcars$mpg, n = 5)
groupedDisp <- GroupNumeric(mtcars$disp, n = 5)
groupedHp <- GroupNumeric(mtcars$hp, n = 5)
groupedDrat <- GroupNumeric(mtcars$drat, n = 5)
groupedWt <- GroupNumeric(mtcars$wt, n = 5)
groupedQsec <- GroupNumeric(mtcars$qsec, n = 5)
groupedMtcars <- mtcars
groupedMtcars$mpg <- NULL
groupedMtcars$groupedMpg <- groupedMpg
groupedMtcars$disp <- NULL
groupedMtcars$groupedDisp <- groupedDisp
groupedMtcars$hp <- NULL
groupedMtcars$groupedHp <- groupedHp
groupedMtcars$drat <- NULL
groupedMtcars$groupedDrat <- groupedDrat
groupedMtcars$wt <- NULL
groupedMtcars$groupedWt <- groupedWt
groupedMtcars$qsec <- NULL
groupedMtcars$groupedQsec <- groupedQsec

Applying the GKtauDataframe function to this modified mtcars dataframe yields the plot shown below:

GKmat2 <- GKtauDataframe(groupedMtcars)
plot(GKmat2, diagSize = 0.8)

Goodman-Kruskal tau matrix for the mtcars dataframe.

Comparing this plot with the previous one, we see that regrouping the variables with more than $20$ distinct levels greatly reduces most of their associations. For example, the original qsec variable has $K = 30$ distinct values in $N = 32$ records and it exhibits very large forward Goodman-Kruskal tau values, ranging from $0.87$ for the two-level am variable to $1.00$ for the two-level vs variable. Replacing qsec with the 5-level variable groupedQsec dramatically reduces these values, which range from a minimum of $0.19$ for the 5-level groupedWt variable to $0.89$ for the two-level vs variable. To better understand why this last value is so large, we can look at the underlying contingency table:

table(groupedQsec, mtcars$vs)

##              
## groupedQsec   0 1
##   [14.5,16.7] 7 0
##   (16.7,17.3] 5 1
##   (17.3,18.2] 6 0
##   (18.2,19.3] 0 6
##   (19.3,22.9] 0 7

The qsec variable is described in the help file for the mtcars dataframe as the “quarter mile time,” so smaller values correspond to faster cars. It is clear from this contingency table that all of the “V-shaped” engine designs correspond to faster cars, with quarter-mile times between $14.5$ and $18.2$ seconds, while all but one of the “straight” engine designs have quarter mile times greater than $18.2$ seconds. The point of this example is to show that grouping numerical covariates and applying Goodman and Kruskal’s tau measure can provide useful insights into the relationships between variables of mixed types, a setting where working directly with the ungrouped numerical variables gives spuriously high association measures.

An important practical question is how many levels to select when grouping a numerical variable, a question to which there appears to be no obvious answer. One possibility is to take the number of groups $n$ as the square root of the number of data observations, $N$, rounded to the nearest integer. This strategy was adopted in the previous example, where the numbers of levels in the original numerical variables ranged from $22$ to $30$, where this rule-of-thumb led to the choice $n = 5$ used here. Also, the default grouping method - style = “quantile” - was used in this example because it probably represents the most familiar grouping strategy for numerical variables. As noted, the GroupNumeric function is based on the classIntervals function in the classInt package, which supports 10 different grouping methods, but takes “quantile” as the default option. The obvious questions - how many groups, and what method do we use in constructing them - appear to be fruitful areas for future research, and a key reason for including the GroupNumeric function in the GoodmanKruskal package is to facilitate work in this area.

6. Summary

This note has described Goodman and Kruskal’s tau measure of association between categorical variables and its implementation in the GoodmanKruskal R package. In contrast to the more popular chi-square and Cramer’s V measures, Goodman and Kruskal’s tau is asymmetric, an unusual characteristic that can be exploited in exploratory data analysis. Specifically, the tau measure belongs to a family of association measures that attempt to quantify the variability in a target variable $y$ that can be explained by variations in a source variable $x$. Since this relationship is not symmetric, Goodman and Kruskal’s $\tau$ can be used to identify cases where one variable is highly predictive from another, but the reverse implication is not true. As a specific and extreme example, applying this measure to the variables Manufacturer and Origin in the Cars93 dataframe from the MASS package shows that Origin - with values “USA” and “non-USA” - is completely predictable from Manufacturer, but knowledge of Origin has essentially no power to predict Manufacturer since each of the two origin classes is represented by many (approximately $45$) different manufacturers.

A limitation of Goodman and Kruskal’s tau measure is that it is not applicable to numerical variables with few ties (e.g., continuously distributed random variables where the probability of duplicated values is zero), as demonstrated in Section 4. This is not a problem by itself since much better known correlation measures are available for this case: the Pearson product-moment correlation coefficient, Spearman’s rank correlation, and Kendall’s tau, all computable as options of the cor function in base R. Where this failure does become an issue is in the assessment of associations between numerical variables with few ties or no ties and categorical variables with a moderate number of levels. For example, applying Goodman and Kruskal’s tau measure between mileage (mpg in the mtcars dataframe, with $25$ distinct values in $32$ records) and the number of cylinders (cyl, a numerical variable with only three levels) suggests near perfect predictability of cylinders from gas mileage ($\tau(x,y) = 0.90$) but essentially no predictability in the other direction ($\tau(y,x) = 0.08$). This limitation prompted the numerical variable grouping strategy described in Section 5 and embodied in the GroupNumeric function included in the GoodmanKruskal package. Replacing mpg with the 5-level categorical variable groupedMpg created using this function gives association measures that appear more reasonable for these two variables: the forward association remains quite large ($\tau(x, y) = 0.70$), but the reverse association is no longer negligible ($\tau(y, x) = 0.36$). As noted in Section 5, the questions of “how many groups?” and “which of many grouping methods should be used?” appear to be open research questions at present, and one purpose for including the function GroupNumeric in the GoodmanKruskal package is to encourage research in this area.

More immediately, the function GKtauDataframe and its associated plot method can be an extremely useful screening tool for exploratory data analysis. In particular, in cases where we have many categorical variables, plots like those shown in Section 3 can be useful in identifying variables that appear to be related. More complete understanding of any relationship seen in these plots can be obtained by using the table function to construct and carefully examine the contingency table on which Goodman and Kruskal’s tau measure is based, but the advantage of plots like those presented in Section 3 is that they allow us to focus our attention on interesting variable pairs. Given that the number of pairs grows quadratically with the number of variables in a dataset, this ability to identify interesting pairs for further analysis can be extremely useful in the increasingly common situation where we have a dataset with many variables.

The GoodmanKruskal package: Measuring association between categorical variables

Ron Pearson

2020-03-18

1. Introduction

1.1 Problem formulation, chi-square, and Cramer’s V

1.2 Goodman and Kruskal’s tau measure

2. The GoodmanKruskal R package

3. Three examples

3.1 The Cars93 dataframe: two examples

3.2 The Greene dataframe

4. An important special case: \(K = N\)

5. Grouping numeric variables

6. Summary