## Abstract

Classification systems are evaluated in a countless number of papers. However, we find that evaluation practice is often nebulous. Frequently, metrics are selected without arguments, and blurry terminology invites misconceptions. For instance, many works use so-called ‘macro’ metrics to rank systems (e.g., ‘macro F1’) but do not clearly specify what they would expect from such a ‘macro’ metric. This is problematic, since picking a metric can affect research findings and thus any clarity in the process should be maximized. Starting from the intuitive concepts of *bias* and *prevalence*, we perform an analysis of common evaluation metrics. The analysis helps us understand the metrics’ underlying properties, and how they align with expectations as found expressed in papers. Then we reflect on the practical situation in the field, and survey evaluation practice in recent shared tasks. We find that metric selection is often not supported with convincing arguments, an issue that can make a system ranking seem arbitrary. Our work aims at providing overview and guidance for more informed and transparent metric selection, fostering meaningful evaluation.

## 1 Introduction

Classification evaluation is ubiquitous: We have a system that predicts some classes of interest (aka classifier) and want to assess its prediction skill. We study a widespread and seemingly clear-cut setup of multi-class evaluation, where we compare a classifier’s predictions against reference labels in two steps. First, we construct a *confusion matrix* that has a designated dimension for each possible prediction/label combination. Second, an aggregate statistic, which we denote as *metric*, summarizes the confusion matrix as a single number.

Already from this it would follow that ‘the perfect’ metric can’t exist, since important information is bound to get lost when reducing the confusion matrix to a single dimension. Still, we require a metric to rank and select classifiers, and thus it should characterize a classifier’s ‘skill’ or ‘performance’ as well as possible. But exactly what is ‘performance’ and how should we measure it? Such questions do not seem to arise (as much) in other ‘performance’ measurements that are known to humans. For example, a marathon’s result derives from a clear and broadly accepted criterion (time over distance) that can be measured with validated instruments (a clock). However, in machine learning, criterion and instrument are often less clear and lie entangled in the term ‘metric’.

Since metric selection can influence which system we consider better or worse in a task, one would think that metrics are selected with great care. But when searching through papers for reasons that would support a particular metric choice, we mostly (at best) find only weak or indirect arguments. For example, it is observed (Table 1^{,}^{1}) that ‘labels are imbalanced’ or it is wished for that ‘all class labels have equal weight’. These perceived problems or needs are then often supposedly addressed with a ‘macro’ metric (Table 1).

metric . | cited motivation/argument . |
---|---|

macro F1 | “macro-averaging (...) implies that all class labels have equal weight in the final score” |

macro P/R/F1 | “because (...) skewed distribution of the label set” |

macro F1 | “Given the strong imbalance between the number of instances in the different classes” |

Accuracy, macro F1 | “the labels are imbalanced” |

MCC | “balanced measurement when the classes are of very different sizes” |

MCC, F1 | “(...) imbalanced data (...)” |

macro F1 | “(...) imbalanced classes (...) introduce biases on accuracy” |

macro F1 | “due to the imbalanced dataset” |

metric . | cited motivation/argument . |
---|---|

macro F1 | “macro-averaging (...) implies that all class labels have equal weight in the final score” |

macro P/R/F1 | “because (...) skewed distribution of the label set” |

macro F1 | “Given the strong imbalance between the number of instances in the different classes” |

Accuracy, macro F1 | “the labels are imbalanced” |

MCC | “balanced measurement when the classes are of very different sizes” |

MCC, F1 | “(...) imbalanced data (...)” |

macro F1 | “(...) imbalanced classes (...) introduce biases on accuracy” |

macro F1 | “due to the imbalanced dataset” |

However, what is meant with phrases like ‘imbalanced data’ or ‘macro’ is rarely made explicit, and how the metrics that are then selected in this context are actually addressing a perceived ‘imbalance’ is unclear. According to a word etymology, evaluation with a ‘macro’ metric may involve the expectation that we are told a *bigger picture* of classifier capability (Greek: makrós, ‘long/large’), whereas a smaller picture (Greek: mikrós, ‘small’) would perhaps bind the assessment to a more local context. Regardless of such musings, it is clear that blurry terminology in the context of classifier evaluation can lead to results that may be misconceived by readers and authors alike.

This paper aims to serve as a handy reference for anyone who wishes to better understand classification evaluation, how evaluation metrics align with expectations expressed in papers, and how we might construct an individual case for (or against) selecting a particular evaluation metric.

### Paper Outline.

After introducing *Preliminaries* (§2) and five *Metric Properties* (§3), we conduct a thorough *Metric Analysis* (§4) of common classification measures: ‘Accuracy’ (§4.1), ‘Macro Recall and Precision’ (§4.2, §4.3), two different ‘Macro F1’ (§4.4, §4.5), ‘Weighted F1’ (§4.6), as well as ‘Kappa’ and ‘Matthews Correlation Coefficient (MCC)’ in §4.7. We then show how to create simple but meaningful *Metric Variants* (§5). We wrap up the theoretical part with a *Discussion* (§6) that includes a short *Summary* (§6.1) of our main analysis results in Table 5. Next we study *Metric Selection in Shared Tasks* (§7) and give *Recommendations* (§8). Finally, we contextualize our work against some *Background and Related Work* (§9), and finish with *Conclusions* (§10).

## 2 Preliminaries

We introduce a set of intuitive concepts as a basis.

*Classifier*, *Confusion Matrix*, and *Metric*.

For any classifier $f:D\u2192C={1,\u2026,n}$ and finite set *S* ⊆ *D* × *C*, let $mf,S\u2208R\u22650n\xd7n$ be a confusion matrix where $mijf,S=\u2223{s\u2208S\u2223f(s1)=i\u2227s2=j}\u2223$.^{2} We omit superscripts whenever possible. So generally, *m*_{i, j} contains the mass of events where the classifier predicts *i*, and the true label is *j*. That is, on the diagonal of the matrix lies the mass of correct predictions, and all other fields indicate the mass of specific errors. A $metric:R\u22650n\xd7n\u2192(\u2212\u221e,1]$ then allows us to order confusion matrices, respectively, rank classifiers (bounds are chosen for convenience). We say that (for a data set *S*) a classifier *f* is better than (or preferable to) a classifier *g* iff *metric*(*m*^{f, S}) > *metric*(*m*^{g, S}), i.e., a higher score indicates a better classifier.

Let us now define five basic quantities:

### Class *Bias*, *Prevalence* and *Correct*

### Class Precision.

*P*

_{i}is the precision for class

*i*:

### Class Recall.

*R*

_{i}denotes the recall for class

*i*,

## 3 Defining Metric Properties

To understand and distinguish metrics in more precise ways, we define five metric properties: *Monotonicity*, *class sensitivity*, *class decomposability*, *prevalence invariance,* and *chance correction*.

### 3.1 Monotonicity (PI)

Take a classifier that receives an input. If the prediction is correct, we would naturally expect that the evaluation score does not decrease, and if it is wrong, the evaluation score should not increase. We cast this clear expectation into

*metric*has PI iff:

i.e., diagonal fields of the confusion matrix (correct mass) should yield a non-negative ‘gradient’ in the metric, while for all other fields (containing error mass) it should be non-positive. PI assumes differentiability of a metric, but it can be simply extended to the discrete case.^{3}

### 3.2 Macro Metrics Are Class-sensitive (PII)

A ‘macro’ metric needs to be sensitive to classes, or else it could not yield a ‘balanced measurement’ for ‘classes having different sizes’ (c.f. Table 1). By contrast, a ‘micro’ metric should care only about whether predictions are wrong or right, which would bind its score more to a local context of a specific data set and its class distribution. This means for macro metrics that they should possess

PII is given iff $\u2203m\u2208R\u22650\u2223C\u2223\xd7\u2223C\u2223$ with $\u2202metric(m)\u2202mi,i\u2260\u2202metric(m)\u2202mj,j$, (*i*, *j*) ∈ *C*^{2} or $\u2202metric(m)\u2202mi,j\u2260\u2202metric(m)\u2202mk,l$, (*i*, *j*, *k*, *l*) ∈ *C*^{4}, *i* ≠ *j*, *k* ≠ *l*.^{4}

A metric without PII is not a macro metric.

### 3.3 Macro Average: Mean Over Classes (PIII)

‘Macro’ metrics are sometimes named ‘macro- average’ metrics. This suggests that they may be perceived as an average over classes. We introduce

i.e., as an unweighted mean over class-specific scores from inputs examples related to a specific class (*c* = *i*∨$f\u2192i$). For example, later we will see that ‘macro F1’ is a specific parameterization of Eq. 4.

### 3.4 Strictly “Treat All Classes Equally” (PIV)

A common argument for using metrics other than the ratio of correct predictions is that we want to ‘not neglect *rare* classes’ and ‘show classifier performance *equally* [w.r.t.] all classes’.^{5}

Nicely, with the assumption of class prevalence being the only important difference across data sets, we could then even say that classifier *f* is *generally* better than (or preferable to) *g* iff *metric*(*m*^{f}) > *metric*(*m*^{g}), further disentangling a classifier comparison from a specific data set. At first glance, PIII seems to capture this wish already, by virtue of an *unweighted mean* over classes. However, the score w.r.t. one class is still influenced by the prevalence of other classes, and thus the result of the mean can change in non-transparent ways if class frequency is varied.

Therefore it makes sense to define such an expectation (‘treat all classes equally’) more strictly. We simulate different class prevalences with a

#### Prevalence Scaling.

. | c =
. | ||
---|---|---|---|

x . | y . | ||

$f\u2192$ | x | 15 · 1 | 5 · 2 |

y | 10 · 1 | 10 · 2 |

. | c =
. | ||
---|---|---|---|

x . | y . | ||

$f\u2192$ | x | 15 · 1 | 5 · 2 |

y | 10 · 1 | 10 · 2 |

If $(\lambda ,\lambda \u2032)\u2208R>0n\xd7n\xd7R>0n\xd7n$ is a pair of diagonal matrices then $metric(m\lambda )=metric(m\lambda \u2032)$.

#### Prevalence Calibration.

*λ*. We select

*λ*

^{∼}s.t. all classes have the same prevalence. We call this prevalence calibration:

### 3.5 Chance Correction (PV)

Two simple ‘baseline’ classifiers are: Predicting classes uniformly randomly, or based on observed prevalence. A macro metric can be expected to show robustness against *any* such chance classifier and be *chance corrected*, assigning a clear and comparable baseline score. Thus, it should have

*metric*has PV iff for any (large) dataset

*S*with

*n*classes and set

*A*with arbitrary random classifiers:

Here, *ω* returns an upper-bound baseline score from the number of classes *n*^{S} alone. If it also holds that $max{...}=min{...}=\omega (nS)$, we say that *metric* is *strictly chance corrected*, and in the case where ∀*S*, *r* : *metric*(*m*^{r, S}) = Ω (constant) we speak of *complete chance correction*.

Less formally, chance correction means that the metric score attached to any chance baseline has a bound that is known to us (the bound generalizes over data sets but not over the number of classes). Strict chance correction means additionally that any chance classifier’s score will be the same, and just depends on the number of classes. Finally, complete chance correction means that every chance classifier always yields the same score, regardless of the number of classes. Note that strictness or completeness may not always be desired, since they can marginalize empirical overall correctness in a data set. Any chance correction, however, increases the evaluation interpretability by contextualizing the evaluation with an interpretable baseline score.

## 4 Metric Property Analysis

Equipped with the appropriate tools, we are now ready to start the analysis of classification metrics. We will study ‘Accuracy’, ‘Macro Recall’, ‘Macro Precision’, ‘Macro F1’, ‘Weighted F1’, ‘Kappa’, and ‘Matthews Correlation Coefficient’ (MCC).

### 4.1 Accuracy (aka Micro F1)

#### Property Analysis.

As a ‘micro’ metric, Accuracy has only PI (monotonicity). This is expected, since PII-V aim at *macro* metrics. Interestingly, in multi-class evaluation, *accuracy* equals ‘*micro Precion, micro Recall and micro F1*’ that sometimes occur in papers. See Appendix A for proofs.

#### Discussion.

Accuracy is an important statistic, estimating the probability of observing a correct prediction in a data set. But this means that it is strictly tied to the class prevalences in a specific data set. And so, in the pursuit of some balance or a more generalizable score, researchers seem interested in other metrics.

### 4.2 Macro Recall: Ticks Five Boxes

#### Property Analysis.

Macro Recall has all five properties (Proofs in Appendix B). It is also *strictly* chance corrected with *ω*(*n*) = 1/*n*.

#### Discussion.

Since macro Recall has all five properties, including prevalence invariance (PIV), it may be a good pick for evaluation, particularly through a ‘macro’ lens. It also offers three intuitive interpretations: *Drawing an item from a random class*, *Bookmaker metric*, and *prevalence-calibrated Accuracy*.

In the first interpretation, we draw a random item from a randomly selected class. What’s the probability that it is correctly predicted? *MacR* estimates the answer $\u2211i1n\xb7P(f\u2192i\u2223c=i)$.

*a (fair) Bookmaker*.

^{6}For every prediction (bet), we pay 1 coin and gain coins per fair (European) odds. The odds for making a correct bet, when the class is

*i*, are $odds(i)=\u2223S\u2223prevalence(i)$. So for each data example (

*x*,

*y*), our bet is evaluated ($I[f(x)=y]\u2208{0,1}$), and thus we incur a total net

*macR*> 1/

*n*.

*macro Recall as Accuracy after prevalence calibration*. Set

*λ*as in Eq. 6:

### 4.3 Macro Precision: Is the Bias an Issue?

#### Property Analysis.

While properties I, II, III, V are fulfilled, macro Precision does not have prevalence invariance (Proofs in Appendix C). With some *λ*, the max. score difference (*macP*(*m*) vs. *macP*(*mλ*)) approaches $1\u22121n$.^{7} Like macro Recall, it is strictly chance corrected (*ω*(*n*) = 1/*n*).

#### Discussion.

Macro Precision wants to approximate the probability to see a correct prediction, given we randomly draw one out of *n* different predictions. Hence, *macP* seems to provide us with an interesting measure of ‘prediction trustworthiness’. An issue is that the score does not generalize across different class prevalences, since $bias(i)\u221dP(f\u2192i)=\u2211jP(f\u2192i,c=j)$ is subject to change if prevalences of other classes *j* ≠ *i* vary (∝ means approximately proportional to). Therefore, even though *macP* is decomposed over classes (PIII), it is not invariant to prevalence changes (PIV), and if we have *f*, *f*′ with different biases, score differences are difficult to interpret, particularly with an underlying ‘macro’ expectation that a metric be robust to class prevalence.

*macP*

^{∼}that employs a prior belief that all classes have the same prevalence. Like macro Recall,

*macP*

^{∼}is now detached from the class distribution in a specific data set, treating all classes more literally ‘equally’.

### 4.4 Macro F1: Metric of Choice in Many Tasks

#### Property Analysis.

Again, all properties except PIV are fulfilled (Proofs in Appendix D). Interestingly, while macro F1 has PV (chance correction), the chance correction isn’t strict, differentiating it from other macro metrics: Indeed, its chance baseline upper bound *ω*(*n*) = 1/*n* is achieved only if $P(f\u2192i)=P(c=i)$, meaning that macro F1 not only corrects for chance, but also factors in more data set accuracy (like a ‘micro’ score). Additionally, the second line of the formula shows that macro F1 is invariant to the false-positive and false-negative error spread for a specific class.

#### Discussion.

Macro F1 wants the distribution of prediction and class prevalence to be similar (a micro feature), but also high correctness for every class, by virtue of the unweighted mean over classes (a macro feature). Thus it seems useful to find classifiers that do well in a given data set, but probably also in others, a ‘balance’ that could explain its popularity. However, macro F1 inherits an interpretability issue of Precision. It doesn’t strictly ‘treat all classes equally’ as per PIV, at least not without prevalence calibration (Eq. 6).

### 4.5 Macro F1: A Doppelganger

#### Property Analysis.

In contrast to its name twin, one less property is given (PIII), since it cannot be decomposed over classes (Proofs in Appendix E), and it is *strictly* chance corrected with *ω*(*n*) = 1/*n*. Opitz and Burst (2019) prove that Eq. 11 and Eq. 10 can diverge to a large degree of up to 0.5.

#### Discussion.

*macF*1′ seems to stick a tad more true to the emphasis in its name (F1, a.k.a. harmonic mean). However,

*macF*1′ does not seem as easy to interpret, since the numerator involves the cross-product of all class-wise recall and precision values. We might view it through the lens of an inter-annotator agreement (IAA) metric though, treating classifier and reference as two annotators:

*macR*’s clear interpretation(s).

### 4.6 Weighted F1

#### Property Analysis.

Weighted F1 is sensible to classes (PII). The other four properties are not featured, which means that it is also non-monotonic. See Appendix F for proofs.

#### Discussion.

While measuring performance ‘locally’ for each class, the results are weighted by class-prevalence. Imagining metrics on a spectrum from ‘micro’ to ‘macro’, *weightF*1 sits next to Accuracy, the prototypical micro metric. This is also made obvious by its featured properties, where only one would mark a ‘macro’ metric (PII). Due to its lowered interpretability and non-monotonicity, we may wonder why *weightF*1 would be preferred over Accuracy. Finally, with prevalence calibration, it reduces to macro F1, *weightF*1(*mλ*^{∼}) = *macF*1(*mλ*^{∼}), similar to how calibrated Accuracy reduces to macro Recall.

### 4.7 Birds of a Feather: Kappa and MCC

^{8}we can state both metrics as concise as possible. Let

**1**be a vector with ones of dimension

*n*. Then let

*i*of vector

**p**, we find

*prevalence*(

*i*), and at index

*i*of vector

**b**we find

*bias*(

*i*).

#### Generalized Matthews Correlation Coefficient (MCC).

#### Cohen’s Kappa

#### Property Analysis.

MCC and Kappa have PII and PV (complete chance correction: Ω = 0). However, they are *non-*monotonic (PI), not class-decomposable (PIII), and not prevalence-invariant (PIV); Proofs in Appendix G. Further note that *sgn*(*MCC*) = *sgn*(*KAPPA*) and |*MCC*|≥|*KAPPA*|, since **p**^{T}**b** ≤{**b**^{T}**b**, **p**^{T}**p**}.

#### Discussion.

Kappa and MCC are similar measures. Since $chance\u2248\u2211iP(c=i)\xb7P(f\u2192i)$ allows the interpretation of observing a prediction that is correct just by chance, Kappa and MCC can be viewed as a standardized Accuracy.

**b**

^{T}

**b**and

**p**

^{T}

**p**in MCC? Given two random items drawn from two random classes,

**b**

^{T}

**b**seems to measure the chance that the classifier randomly predicts the same label, while

**p**

^{T}

**p**measures the chance that the true labels are the same. This adds complexity to the MCC formula that can make classifier comparison less clear. The stronger dependence on

*classifier bias*through

**b**

^{T}

**b**also favors classifiers with uneven biases, regardless of the actual class distribution in a data set. This reduced interpretability is still evident when the measures are prevalence-calibrated (Eq. 6):

*macR*.

## 5 Metric Variants

### 5.1 Mean Parameterization in PIII

*p*≠1 in the generalized mean (Eq. 4). For example, in the example of macro Recall, setting $p\u21920$ yields the geometric mean

*macR*, it has all five properties. Given

*n*random items, one from every class,

*GmacR*approximates a (class-count normalized) probability that all are correctly predicted. Hence,

*GmacR*can be useful when it’s important to perform well in

*all*classes. Thinking further along this line, we can employ

*HmacR*(

*p*= −1) with the harmonic mean $HM(R1,\u2026,Rn)=n(1R1+\u2026+1Rn)\u22121$.

### 5.2 Prevalence Calibration

Property PIV (prevalence invariance) is rare, but we saw that it can be artificially enforced. Indeed, if we standardize the confusion matrix by making sure every class has the same prevalence (Eq. 5, Eq. 6), we ensure prevalence invariance (PIV) for a measure. As an effect of this, we found that Kappa and Accuracy reduce to macro Recall, and weighted F1 becomes the same as macro F1. For a more detailed interpretation of prevalence calibration, see our discussion for macro Precision (§4.3).

When does a prevalence calibration make sense? Since prevalence calibration offers a gain in ‘macro’-features, it can be used with the aim to push a metric more towards a ‘macro’ metric.

## 6 Discussion

### 6.1 Summary of Metric Analyses

Table 5 shows an overview of the visited metrics. We make some observations: i) macro Recall has all five properties, including class prevalence invariance (PIV), i.e., ‘it treats all classes equally’ (in a strict sense). However, through prevalence calibration, all metrics obtain PIV. ii) Kappa, MCC, and weighted F1 do not have property PI. Under some circumstances, errors can increase the score, possibly lowering interpretability. iii) All metrics except Accuracy and weighted F1 show chance baseline correction. Strict chance baseline correction isn’t a feature of Macro F1, and complete (class-count independent) chance correction is only achieved with MCC and Kappa.

metric . | PI (mono.) . | PII (class sens.) . | PIII (decomp.) . | PIV (prev. invar.) . | PV (chance correct.) . |
---|---|---|---|---|---|

Accuracy (=Micro F1) | ✓ | ✗_{(✓)} | ✗_{(✓)} | ✗_{(✓)} | ✗_{(✓)} |

macro Recall (macR) | ✓ | ✓ | ✓ | ✓ | ✓: 1/n, strict |

as GmacR or HmacR | ✓ | ✓ | ✓ | ✓ | ✓: 1/n |

macro Precision | ✓ | ✓ | ✓ | ✗_{(✓)} | ✓: 1/n, strict |

macro F1 (macF1) | ✓ | ✓ | ✓ | ✗_{(✓)} | ✓: 1/n |

macro F1’ (macF1′) | ✓ | ✓ | ✗ | ✗_{(✓)} | ✓: 1/n, strict |

weighted F1 | ✗ | ✓ | ✗_{(✓)} | ✗_{(✓)} | ✗_{(✓)} |

Kappa | ✗ | ✓ | ✗ | ✗_{(✓)} | ✓: 0, complete |

MCC | ✗ | ✓ | ✗ | ✗_{(✓)} | ✓: 0, complete |

metric . | PI (mono.) . | PII (class sens.) . | PIII (decomp.) . | PIV (prev. invar.) . | PV (chance correct.) . |
---|---|---|---|---|---|

Accuracy (=Micro F1) | ✓ | ✗_{(✓)} | ✗_{(✓)} | ✗_{(✓)} | ✗_{(✓)} |

macro Recall (macR) | ✓ | ✓ | ✓ | ✓ | ✓: 1/n, strict |

as GmacR or HmacR | ✓ | ✓ | ✓ | ✓ | ✓: 1/n |

macro Precision | ✓ | ✓ | ✓ | ✗_{(✓)} | ✓: 1/n, strict |

macro F1 (macF1) | ✓ | ✓ | ✓ | ✗_{(✓)} | ✓: 1/n |

macro F1’ (macF1′) | ✓ | ✓ | ✗ | ✗_{(✓)} | ✓: 1/n, strict |

weighted F1 | ✗ | ✓ | ✗_{(✓)} | ✗_{(✓)} | ✗_{(✓)} |

Kappa | ✗ | ✓ | ✗ | ✗_{(✓)} | ✓: 0, complete |

MCC | ✗ | ✓ | ✗ | ✗_{(✓)} | ✓: 0, complete |

Macro Recall and Accuracy seem to complement each other. Both have a clear interpretation, and relate to each other with a simple prevalence calibration. Indeed, macro Recall can be understood as a prevalence-calibrated version of Accuracy. On the other hand, macro F1 is interesting since it does not strictly correct for chance (as in macro Recall) but also factors in more of the test set correctness (as in a ‘micro’ score).

MCC and Kappa are similar measures, where Kappa tends to be slightly more interpretable and shows more robustness to classifier biases. Somewhat also similar are Accuracy and weighted F1, both are greatly affected by class-prevalence. As discussed in §4.6, we could not determine clear reasons for favoring weighted F1 over Accuracy.

### 6.2 What are Other ‘Balances’?

The concept of ‘balance’ seems positively flavored, and thus we may wish to reflect on more ‘balances’ other than prevalence invariance (PIV).

Another type of ‘balance’ is introduced by *GmacR* (or *HmacR*). By virtue of the geometric (harmonic) mean that puts more weight on low outliers, they favor a classifier that equally distributes its correctness over classes. This is also reflected by *macR* being *strictly* chance corrected with 1/*n*, while its siblings have 1/*n* as the upper bound *only* achieved by the *uniform random baseline*, and the metrics’ gradients that scale with low-recall outliers (Appendix B.5, B.6).

Yet again another type of ‘balance’ we saw in macro F1 (*macF*1) that selects a classifier with high recall over many classes (as featured by a ‘macro’ metric) and maximizes empirical data set correctness (as featured by a ‘micro’ metric), an attribute that is also visible in its chance baseline upper bound (1/*n*) that is *only* achieved by a prevalence-informed baseline.

Finally, a ‘meta balance’ could be achieved when we are unsure which metric to use, by ensembling a score from a set of selected metrics.

### 6.3 Value of Class-wise Recall

*i*will then equal

## 7 Reflecting on SemEval Shared Tasks

So far, we had focused on theory. Now we want to take a look at applied evaluation practice. We study works from the *SemEval* series, a large annual NLP event where teams compete in various tasks, many of which are classification tasks.

### 7.1 Example Shared Task Study

As an example, we first study the popular SemEval shared task (Rosenthal et al., 2017) on tweet sentiment classification (positive/negative/neutral) with team predictions thankfully made available.

#### Insight: Different Metric → Different Ranking.

Figure 1 contains pixelated data. Please check. Table 6 we see the results of the eight best of 37 systems.^{9} The two ‘winning’ systems (A, B, Table 6) were determined with *macR*, which is legitimate. Yet, system E also does quite well: It obtains highest Accuracy, and it achieves a better balance over the three classes (*R*_{1} = 69.8, *R*_{2} = 64.0, *R*_{3} = 66.8, max. Δ = 5.8) as opposed to, e.g., system B (*R*_{1} = 87.8, *R*_{2} = 51.4, *R*_{3} = 65.2 max. Δ = 36.4), indicated, on average, also by *GmacR* and *HmacR* metrics. So if we want to ensure performance under high uncertainty of class prevalence (as expected in Twitter data?), we may prefer system E, a system that would be also ranked higher when using an ensemble of metrics.

sys . | off. r . | citations . | standard . | after prevalence calibration . | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Acc . | macR . | macP . | macF1 . | macF1’ . | weightF1 . | Kappa . | MCC . | GmacR . | HmacR . | r1 . | r2 . | r3 . | macF1 . | macF1’ . | Kappa . | MCC . | |||

A | 1 | 555 | 65.1 | 68.1 | 65.5 | 65.4 | 66.8 | 64.5 | 46.5 | 48.0 | 66.8 | 65.4 | 82.9 | 51.2 | 70.2 | 67.7 | 68.2 | 52.1 | 52.6 |

B | 1 | 271 | 65.8 | 68.1 | 67.4 | 66.0 | 67.8 | 65.1 | 47.3 | 49.2 | 66.5 | 65.0 | 87.8 | 51.4 | 65.2 | 67.7 | 68.7 | 52.2 | 53.1 |

C | 3 | 20 | 66.1 | 67.6 | 66.2 | 66.0 | 66.9 | 65.7 | 47.2 | 48.1 | 66.8 | 66.0 | 81.7 | 56.0 | 65.2 | 67.5 | 68.0 | 51.4 | 51.8 |

D | 4 | 12 | 65.2 | 67.4 | 64.9 | 65.1 | 66.1 | 64.8 | 46.3 | 47.3 | 66.5 | 65.6 | 80.3 | 54.2 | 67.6 | 67.1 | 67.5 | 51.0 | 51.3 |

E | 5 | 23 | 66.4 | 66.9 | 65.4 | 66.0 | 66.1 | 66.4 | 47.0 | 47.0 | 66.8 | 66.8 | 69.8 | 64.0 | 66.8 | 67.3 | 67.5 | 50.3 | 50.5 |

F | 6 | 11 | 64.8 | 65.9 | 63.9 | 64.5 | 64.9 | 64.7 | 45.0 | 45.4 | 65.6 | 65.4 | 73.5 | 58.7 | 65.6 | 66.1 | 66.3 | 48.9 | 49.0 |

G | 7 | 2 | 63.3 | 64.9 | 63.6 | 63.4 | 64.2 | 63.1 | 43.0 | 43.8 | 64.2 | 63.5 | 77.4 | 53.9 | 63.5 | 64.9 | 65.4 | 47.4 | 47.7 |

H | 8 | 30 | 64.3 | 64.5 | 63.1 | 63.7 | 63.8 | 64.4 | 43.6 | 43.6 | 64.5 | 64.5 | 65.3 | 63.6 | 64.5 | 64.9 | 65.2 | 46.7 | 47.0 |

sys . | off. r . | citations . | standard . | after prevalence calibration . | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Acc . | macR . | macP . | macF1 . | macF1’ . | weightF1 . | Kappa . | MCC . | GmacR . | HmacR . | r1 . | r2 . | r3 . | macF1 . | macF1’ . | Kappa . | MCC . | |||

A | 1 | 555 | 65.1 | 68.1 | 65.5 | 65.4 | 66.8 | 64.5 | 46.5 | 48.0 | 66.8 | 65.4 | 82.9 | 51.2 | 70.2 | 67.7 | 68.2 | 52.1 | 52.6 |

B | 1 | 271 | 65.8 | 68.1 | 67.4 | 66.0 | 67.8 | 65.1 | 47.3 | 49.2 | 66.5 | 65.0 | 87.8 | 51.4 | 65.2 | 67.7 | 68.7 | 52.2 | 53.1 |

C | 3 | 20 | 66.1 | 67.6 | 66.2 | 66.0 | 66.9 | 65.7 | 47.2 | 48.1 | 66.8 | 66.0 | 81.7 | 56.0 | 65.2 | 67.5 | 68.0 | 51.4 | 51.8 |

D | 4 | 12 | 65.2 | 67.4 | 64.9 | 65.1 | 66.1 | 64.8 | 46.3 | 47.3 | 66.5 | 65.6 | 80.3 | 54.2 | 67.6 | 67.1 | 67.5 | 51.0 | 51.3 |

E | 5 | 23 | 66.4 | 66.9 | 65.4 | 66.0 | 66.1 | 66.4 | 47.0 | 47.0 | 66.8 | 66.8 | 69.8 | 64.0 | 66.8 | 67.3 | 67.5 | 50.3 | 50.5 |

F | 6 | 11 | 64.8 | 65.9 | 63.9 | 64.5 | 64.9 | 64.7 | 45.0 | 45.4 | 65.6 | 65.4 | 73.5 | 58.7 | 65.6 | 66.1 | 66.3 | 48.9 | 49.0 |

G | 7 | 2 | 63.3 | 64.9 | 63.6 | 63.4 | 64.2 | 63.1 | 43.0 | 43.8 | 64.2 | 63.5 | 77.4 | 53.9 | 63.5 | 64.9 | 65.4 | 47.4 | 47.7 |

H | 8 | 30 | 64.3 | 64.5 | 63.1 | 63.7 | 63.8 | 64.4 | 43.6 | 43.6 | 64.5 | 64.5 | 65.3 | 63.6 | 64.5 | 64.9 | 65.2 | 46.7 | 47.0 |

Figure 1 shows a pair-wise Spearman *ρ* correlation of team rankings of all 37 teams according to different metrics. There are only 4 pairs of metrics that are in complete agreement (*ρ* = 100): Macro Recall agrees with calibrated Kappa and calibrated Accuracy. This makes sense: As we noted before (Eq. 8, Eq. 16), they are equivalent (the same applies to weighted F1 and macro F1, after calibration). Looking at single classes, it seems that the second class is the one that can tip the scale: *R*_{2} disagrees in its team ranking with *all* other metrics (*ρ* ≤ 14).

#### Do Rankings Impact Paper Popularity?

For the eight best systems in Table 6, we retrieve their citation count from Google scholar as a (very coarse) proxy for their popularity. The result invites the speculation that a superficial ordering could already be effectual: While both team 1 (A) and team 2 (B) were explicitly selected as the winner of the shared task, the first-listed system appears to have almost double the amount of citations (even though B performs better as per most metrics). The citations of other systems (C–H) do not exceed lower double digits, although we saw that a case can even be made for E (rank 6), which achieves stable performance over all classes.

### 7.2 Examining Metric Argumentation

Our selected example task used macro Recall for winner selection. While the measure had been selected with care (its value became also evident in our analysis), we saw that systems were very close and arguments could have been made for selecting a slightly different (set of) winner(s). To our surprise, in a large proportion of shared tasks, the situation seems worse. Annotating 42 classification shared task overview papers from the recent 5 years,^{10} we find that

only 23.8% of papers provide a formula, and

only 10.9% provide a sensible argument for their metric. 14.3% use a weak argument similar to the ones shown in Table 1. A large number (73.8%) do not state an argument or employ a ‘trope’ like “As is standard, ...”.

The most frequent metric in our sample is ‘macro F1’ (*macF*1). In some cases, its doppelganger *macF*1′ seems to have been used, or a ‘balanced Accuracy’.^{11} Sometimes, in the absence of further description or formula, it can be hard to tell which ‘macro F1’ metric has been used, also due to deviating naming (macro-average F1, mean F1, macro F1, etc.). In at least one case, this has led to a disadvantage: Fang et al. (2023) report that “*During model training and validation, we were not aware that the challenge organizers used a different method for calculating macro F1, namely by using the averages of precision and recall*”, a measure that is much different (cf. §4.4).

We also may wonder about the situation beyond SemEval. While a precise characterization of a broader picture would escape the scope of this paper, a cursory glance shows the observed unclarities in research papers from all kinds of domains.

## 8 Recommendations

Overall, we would like to refrain from making bold statements as to which metric is ‘better’, since different contexts may easily call for different metrics. Still, from our analyses, we can synthesize some general recommendations:

State the evaluation metric clearly, best use a formula. This also helps protect against ambiguity, e.g., as induced through homonyms such as

*macF*1 and*macF*1′.Try building a case for a metric: As a starting point we can think about how the class distribution in our data set would align with the distribution that we could expect in an application. With greater uncertainty, more macro metric features may be useful. For finer selection, consider viewing our analyses, checking any desirable metric features. Our summary (§6.1) can provide guidance.

Consider presenting more than a single number. In particular, complementary metrics such as Accuracy and macro Recall can be indicative about i) a classifier’s empirical data set correctness (as in a micro metric) and ii) its robustness to class distribution shifts (as in a macro metric). If the amount of classes

*n*is low, consider presenting class-wise recall scores for their generalizability. If*n*is larger and a metric is decomposable over classes (PIII), reporting the variance of a ‘macro’ metric over classes can also be of value.Consider admitting multiple ‘winners’ or ‘best systems’: If we are not able to build a strong case for a single metric, then it may be sensible to present a

*set of well-motivated metrics*and select a*set of best systems*. If*one*‘best’ system needs be selected, an average over such a set could be a useful heuristic.

## 9 Background and Related Work

### Meta Studies of Classification Metrics.

Surveys or dedicated book chapters can provide a useful introduction to classification evaluation (Sokolova and Lapalme, 2009; Manning, 2009; Tharwat, 2020; Grandini et al., 2020). Deeper analysis has been provided mostly in the two-class setting: In a series of articles by David M. W. Powers, we find the (previously mentioned) Bookmaker’s perspective on metrics (Powers, 2011), a critique of the F1 (Powers, 2015) and Kappa (Powers, 2012). Delgado and Tibau (2019) study binary MCC and Kappa (favoring MCC), while Sebastiani (2015) defines axioms for binary evaluation, including a monotonicity axiom akin to a stricter version of our PI, advocating a ‘K-metric’.^{12} Luque et al. (2019) analyze binary confusion matrices, and Chicco and Jurman (2020) compare F1, Accuracy and MCC, concluding that “MCC should be preferred (…) by all scientific communities”. The mathematical relationship between the two macro F1s is further analyzed by Opitz and Burst (2019).

Overall, we want to advocate for a mostly *agnostic stance* as to what metric might be picked in a case (if it is sensibly done so), remembering our premise from the introduction: *the perfect metric doesn’t exist*. Thus, we aimed at balancing intuitive interpretation and analysis of metrics, while acknowledging desiderata as worded in papers.

### Other Classification Evaluation Methods.

These can be required when class labels reside on a nominal ‘Likert’ scale (O’Neill, 2017; Amigo et al., 2020), or in a hierarchy (Kosmopoulos et al., 2015), or they are ambiguous and need be matched (e.g., ‘none’/‘null’/‘other’) across annotation styles (Fu et al., 2020), or their number is unknown (Ge et al., 2017). Classifiers are also evaluated with *P-R curves* (Flach and Kull, 2015) or a *receiver-operator characteristic* (Fawcett, 2006; Honovich et al., 2022). The *CheckList* (Ribeiro et al., 2020) proposes behavioral testing of classifiers, and the *NEATCLasS* workshop series (Ross et al., 2023) is an effort to find novel ways of evaluation.

## 10 Conclusion

Starting from a definition of the two basic and intuitive concepts of *classifier bias* and *class prevalence*, we examined common classification evaluation metrics, resolving unclear expectations such as those that pursue a ‘balance’ through ‘macro’ metrics. Our metric analysis framework, including definitions and properties, can aid in the study of other or new metrics. A main goal of our work is to provide guidance for more informed decision making in metric selection.

## Acknowledgments

We are grateful to three anonymous reviewers and Action Editor Sebastian Gehrmann for their valuable comments. We are also thankful to Julius Steen for helpful discussion and feedback.

## Notes

We use ℝ (instead of ℕ) to allow for cases where matrix fields contain, e.g., ratios, or accumulated ‘soft’ scores.

Assume any data set *S* and split *S*′, *S*″, *S*^{‴} s.t. *S*′∪ *S*″ ∪ *S*^{‴} = *S* and |*S*″ = {(*x*, *y*)}| = 1. Then for any classifier *f* we want to ensure *f*(*x*) = *y*⇒*metric*(*m*^{f, S′∪S″}) ≥ *metric*(*m*^{f, S′}), else *metric*(*m*^{f, S′∪S″}) ≤ *metric*(*m*^{f, S′}).

Discrete case: Assume any data set *S* and split *S*′, *S*″, *S*^{‴}, *S*^{′′′′} s.t. *S*′∪ *S*″ ∪ *S*^{‴} ∪ *S*^{′′′′} = *S* and |*S*″ = {(*x*, *y*)}| = |*S*^{‴} = {(*w*, *z*)}| = 1. Then *metric* is not a ‘micro’ metric if there is any *f* with [*f*(*x*) = *y* ∧ *f*(*w*) = *z*] ∨ [*f*(*x*)≠*y* ∧ *f*(*w*)≠*z*] and *metric*(*m*^{f, S′∪S″})≠*metric*(*m*^{f, S′∪S‴}).

Consider a matrix with ones on the diagonal, and large numbers in the first column (yielding low class-wise precision scores). With *λ* where *λ*_{1,1} is very small (reducing the prevalence of class 1), we obtain high precision scores.

$mij=1\u2223S\u2223\u2223{s\u2208S\u2223f(s1)=i\u2227s2=j}\u2223\u2208[0,1],s.t.\u2211(i,j)m=1$. This models ratios in the matrix fields but does not change *MCC* or *KAPPA*.

A csv-file with our annotations is accessible at https://github.com/flipz357/metric-csv.

We did not find a formula. As per scikit- learn (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_{a}ccuracy_{s}core.html, 2024/02/05) it may equal macro Recall.

Same as Powers’ (2011) *Informedness*, the *K-metric* can be understood as two-class macro Recall.

For example, c.f., Sokolova and Lapalme (2009).

## References

#### A Accuracy a.k.a. Micro Precision/Recall/F1

^{13}as the harmonic mean (

*HM*) of ‘micro Precision’ and ‘micro Recall’, where micro Precision and micro Recall are

*HM*(

*a*,

*a*) =

*a*.

##### A.1 Monotonicity $\u2713$

*i* ≠ *j*: $\u2202Acc(m)\u2202mi,j=\u2212\u2211kcorrect(k)\u2223S\u22232=\u2212Acc\u2223S\u2223\u22640$; else $\u2202Acc(m)\u2202mi,i=\u2223S\u2223\u2212\u2211kcorrect(k)\u2223S\u22232=1\u2212Acc\u2223S\u2223\u22650$.

##### A.2 Other Properties

It is easy to see that properties other than Monotonicity are not featured by Accuracy.

#### B Macro Recall

##### B.1 Monotonicity ✓

If *i* ≠ *j*: $\u2202macR(m)\u2202mi,j=\u2212correct(j)n\xb7prevalence(j)2\u22640$; else $\u2202macR(m)\u2202mi,i=prevalence(i)\u2212correct(i)n\xb7prevalence(i)2\u22650$.

##### B.2 Class Sensitivity ✓

Follows from above.

##### B.3 Class Decomposability ✓

In Eq. 4 set $g(row,col,x)=rowx\u2211icoli$ and *p* = 1.

##### B.4 Prevalence Invariance ✓

$Ri\u2032=\lambda i,imi,i\u2211j\lambda i,imj,i=\lambda i,imi,i\lambda i,i\u2211jmj,i=Ri$.

##### B.5 Chance Correction ✓

*macR*is strictly chance corrected.

##### B.6 Gradients for *GmacR* and *HmacR*

*macR*with arithmetic mean (

*AM*). Let

*μ*

_{x}= (

*n*·

*prevalence*(

*x*))

^{−1}, then (

*i*,

*j*implies

*i*≠

*j*):

#### C Macro Precision

##### C.1 Monotonicity ✓

If *i* ≠ *j*: $\u2202macP(m)\u2202mi,j=\u2212correct(i)n\xb7bias(i)2\u22640$; else $\u2202macP(m)\u2202mi,i=bias(i)\u2212correct(i)n\xb7bias(i)2\u22650$.

##### C.2 Class Sensitivity ✓

Follows from above.

##### C.3 Class Decomposability ✓

In Eq. 4 set $g(row,col,x)=rowx\u2211irowi$ and *p* = 1.

##### C.4 Prevalence Invariance

##### C.5 Chance Correction ✓

#### D Macro F1

##### D.1 Monotonicity ✓

*Z*

_{x}=

*bias*(

*x*) +

*prevalence*(

*x*). If

*i*≠

*j*:

##### D.2 Class Sensitivity ✓

Follows from above.

##### D.3 Class Decomposability ✓

In Eq. 4 set *p* = 1, $g(row,col,x)=2rowx\u2211irowi+coli$.

##### D.4 Prevalence Invariance

It is not prevalence-invariant, c.f. C.4.

##### D.5 Chance Correction ✓

#### E Macro F1 (Name Twin)

##### E.1 Monotonicity ✓

Let (*macR* + *macP* −*macP* · *macR*) = *ϵ* ≥ 0. We have $\u2202macF1\u2032(m)\u2202mi,j=2x\u03f5macR+macP$ where $x=(\u2202macR(m)\u2202mi,j+\u2202macP(m)\u2202mi,j)$. Since *macR* and *macP* are monotonic, *macF*1′ also has monotonicity.

##### E.2 Label Sensitivity ✓

Follows from above.

##### E.3 Class Decomposability

Not possible.

##### E.4 Prevalence Invariance

It is not prevalence-invariant, c.f. C.4.

##### E.5 Chance Correction ✓

Since *macF*1′ is the *HM* of (strictly chance corrected) macro Precision and macro Recall, we also have strictly chance correction with $1n$.

#### F Weighted F1

##### F.1 Monotonicity

*prevalence*(

*i*),

*bias*(

*i*),

*correct*(

*i*) =

*x*

_{i},

*y*

_{i},

*z*

_{i}. If

*i*≠

*j*:

*weightF*1 increases even though a classifier made an error.

##### F.2 Label Sensitivity ✓

Follows from above.

##### F.3 Class Decomposability

Not possible.

##### F.4 Prevalence Invariance

Trivial.

##### F.5 Chance Correction

Trivial.

#### G Kappa and MCC

##### G.1 Monotonicity

We resort back to Kappa and MCC formulas for non-normalized confusion matrices, multiplying numerator and denominator by *s*^{2}, where $s=\u2211(i,j)mi,j$ is the number of data examples, and write *r* for $\u2211icorrect(i)$.

###### Kappa.

*z*

_{ij}=

*prevalence*(

*i*) +

*bias*(

*j*).

*i*≠

*j*:

###### MCC.

*i*≠

*j*:

*N*

_{K}(Kappa) or

*N*

_{M}(MCC) → 0, but not $(r\u2212zi,j)\xb7DM\u2223K2\u21920$.

. | c =
. | |||
---|---|---|---|---|

x . | y . | z . | ||

$f\u2192$ | x | 10 | 43 | 0 |

y | 1 | 1 | 0 | |

z | 0 | 0 | 1 |

. | c =
. | |||
---|---|---|---|---|

x . | y . | z . | ||

$f\u2192$ | x | 10 | 43 | 0 |

y | 1 | 1 | 0 | |

z | 0 | 0 | 1 |

##### G.2 Class Sensitivity ✓

Trivial.

##### G.3 Class Decomposability

Trivial.

##### G.4 Prevalence Invariance

Trivial.

##### G.5 Chance Correction ✓

## Author notes

Action Editor: Sebastian Gehrmann