Agreement statistics play an important role in the evaluation of coding schemes for discourse and dialogue. Unfortunately there is a lack of understanding regarding appropriate agreement measures and how their results should be interpreted. In this article we describe the role of agreement measures and argue that only chance-corrected measures that assume a common distribution of labels for all coders are suitable for measuring agreement in reliability studies. We then provide recommendations for how reliability should be inferred from the results of agreement statistics.
School of Computer Science, University of Manchester, Manchester, M13 9PL, U.K.