The goal of airt is to evaluate the performance of a portfolio of algorithms using Item Response Theory (IRT). The IRT models are fitted using the R packages EstCRM and mirt. The function in EstCRM is slightly modified to account for a broader set of parameters.
This example is on classification algorithms. The data classification has performance data from 10 classification algorithms on 235 datasets. This data is discussed in (Muñoz et al. 2018) and can be found at the test instance library MATILDA (Smith-Miles 2019). Let’s have a look at this dataset.
data("classification_cts") df <- classification_cts head(df) #> NB LDA QDA CART J48 KNN L_SVM #> 1 0.7199042 0.7602850 0.7459878 0.7546605 0.7435086 0.7430308 0.7694158 #> 2 0.8358182 0.8404234 0.1045281 0.8270254 0.8347175 0.8328870 0.8284820 #> 3 0.8581818 0.8763636 0.8300000 0.8372727 0.8672727 0.8218182 0.8763636 #> 4 0.7467141 0.7356707 0.4451558 0.7356707 0.7356707 0.7530488 0.7356707 #> 5 0.9650329 0.1408706 0.1408706 0.9397915 0.9619915 0.9277021 0.9647557 #> 6 0.7739130 0.8536232 0.5060660 0.8536232 0.8507246 0.8391304 0.8521739 #> poly_SVM RBF_SVM RandF #> 1 0.7225890 0.7788021 0.7655677 #> 2 0.8212273 0.8300809 0.1045281 #> 3 0.8763636 0.8672727 0.8672727 #> 4 0.7704268 0.7685976 0.7557249 #> 5 0.7341198 0.8483237 0.1408706 #> 6 0.8579710 0.8521739 0.8637681
In this dataset the columns represent algorithms and rows represent datasets/instances. The values are performance values. That is, the performance of dataset1 to algorithm Naive Bayes (NB) is 0.7199042. This dataframe is the input to our AIRT model. We fit it by calling cirtmodel.
Now the model is fitted. Let’s have a look at traditional IRT parameters.
paras <- modout$model$param paras #> a b alpha #> NB 1.027368059 -1.1185870 1.05927334 #> LDA 0.697568059 -1.9584262 0.94906831 #> QDA 0.008604556 -37.6665493 0.01731587 #> CART 1.598441089 -1.0209521 1.41547455 #> J48 1.558295477 -1.1640940 1.52036690 #> KNN 1.796892905 -0.8412235 1.64579669 #> L_SVM 2.846510834 -1.4371875 1.50572702 #> poly_SVM 1.743909296 -1.1499008 1.31318614 #> RBF_SVM 3.766472502 -1.4019959 1.53615811 #> RandF 0.999442464 -1.7509568 1.43550771
The parameter a denotes discrimination, b denotes difficulty and alpha is a scaling parameter. These are traditional IRT parameters. Using these parameters we will find AIRT algorithm attributes. These are algorithm anomalousness, consistency and the difficulty limit.
If an algorithm is anomalous then the anomalous indicator is 1. In this algorithm portfolio, none of the algorithms are anomalous, because all anomalous indicators are 0. Anomalous algorithms give good performances for difficult problems and poor performances for easy problems.
The difficulty limit gives the highest difficulty level that algorithms can handle. In this scenario, QDA has the highest difficulty limit. So, QDA can handle the hardest problems. KNN has the lowest difficulty limit. It can only handle very easy problems.
Algorithm consistency attribute gives how consistent an algorithm is. An algorithm can be consistently good for most of the problems or it can be consistently poor for many problems. And many algorithms can vary in their performance depending on the problem/dataset. In this portfolio, QDA is the most consistent algorithm.
Let’s look at these algorithms visually. The heatmaps_crm function plots the heatmaps. The part crm stands for continuous response model.
Let’s discuss these heatmaps. Theta (x axis) represents the dataset easiness and z (y axis) represents the normalized performance values. The heatmaps show the probability density of the fitted IRT model over Theta and z values for each algorithm.
Apart from QDA all heatmaps have a line (a bit like a lightsaber) going through it. If the lightsaber has a positive slope, then the algorithm is not anomalous. We see some lightsabers are sharper than others. Algorithms with sharper lightsabers are more discriminating. The algorithms with no lightsabers (QDA) or blurry lightsabers are more consistent. In this portfolio, QDA is the most consistent as it doesn’t have any lightsabers. LDA and NB are also somewhat consistent. RBF_SVM is the least consistent (most discriminating) as it has a very sharp line.
We can also look at the algorithm performance with respect to the dataset difficulty. This is called the latent trait analysis. The function latent_trait_analysis does this for you. We need to pass the IRT parameters to do this analysis.
When you use plottype = 1, it plots all algorithms in a single plot. To have a separate plot for each algorithm we use plottype = 2.
From these plots we see that certain algorithms give better performances for different problem difficulty values. To get a better sense of which algorithms are better for which difficulty values we fit smoothing splines to the above data. By using plottype = 3 in autoplot we can see these smoothing splines.
From this plot, we can get the best algorithm for a given problem difficulty. We can use these splines to compute the proportion of the latent trait spectrum occupied by each algorithm. We call this the latent trait occupancy (LTO). These are strengths of algorithms.
The column Proportion gives the latent trait occupancy of the algorithm. In this scenario, J48 has the highest latent trait occupancy.
Similar to strengths, we can say an algorithm is weak if it has the lowest performance for a given difficulty.
In this example QDA is the weakness algorithm. QDA is weak for 0.99 of the latent trait. But now there is a big question. If QDA is the weakest algorithm, why did it have such a high difficulty limit? It had the highest difficulty limit of all the algorithms. What happened here?
We see latent trait occupancy in the graph above. The 5 algorithms J48, KNN L_SVM, poly_SVM and RBF_SVM occupy parts of the latent trait spectrum. That is, for some dataset easiness values, these algorithms display superiority.
In this example we have used epsilon = 0. That would give a unique strength/weakness to each point in the problem space. If we make epsilon > 0, then we can get overlapping strengths and weaknesses. That is, we will get algorithms that are epsilon-away from the best algorithm in our strengths/weakness diagram. Let’s do that.
Now we see some overlapping strengths and weaknesses. For very easy problems, many algorithms have strengths, and for more difficult problems, we see that KNN and J48 are strong. QDA is weak for most part of the problem space.
All this is good, but is the fitted IRT model good? To check this, we have a couple of measures. One is the Model Goodness Curve. We first call the model_goodness_crm function to compute the model goodness metrics. Then by calling autoplot we can plot the curves. The letters crm stands for continuous response model.
In the above graph, we’re looking at the distribution of errors – that is, the difference between the predicted and the actual values for different algorithms. The x-axis has the absolute error scaled to [0,1] and the y-axis shows the empirical cumulative distribution of errors for each algorithm. For a given algorithm a model is well fitted if the curve goes up to 1 on the y-axis quickly. That is, if the Area Under the Curve (AUC) is closer to 1. We can check the AUC and the Mean Square Error (MSE) for these algorithms.
cbind.data.frame(AUC = modelgood$goodnessAUC, MSE = modelgood$mse) #> AUC MSE #> NB 0.9409628 0.006942944 #> LDA 0.9164625 0.036060743 #> QDA 0.6239630 0.216081387 #> CART 0.9377391 0.005200573 #> J48 0.9274232 0.006983641 #> KNN 0.9235547 0.007294912 #> L_SVM 0.8999140 0.011336967 #> poly_SVM 0.9189985 0.008500814 #> RBF_SVM 0.8944122 0.012270339 #> RandF 0.8545669 0.032752149
From the graph and the table we see that the IRT model fits all algorithms well apart from QDA. We have another goodness metric called effectiveness. Effectiveness generally tells us how good the algorithms are.
This first plot tells us how good the algorithms actually perform, without fitting an IRT model.