Data cleaning is an essential step in a bibliometric analysis. Methods and tools to identify duplicates, inconsistencies, and errors in data sets from bibliographic databases such as OpenAlex and Dimensions are of particular interest. Existing data cleaning procedures usually rely on the handling of each column of a data set independently. The aim of this article is to introduce definitions and tools together with applications in order to foster greater exploitation of potentially implicit dependencies between columns in any given data set. The article introduces two relations, the parent-child relation and the coarser-finer relation, aiming to describe relations between columns in a given data set. As tools, we introduce two diagrams, the monotony diagram and the dependency diagram, aiming to give an overview of the most relevant relations. We also introduce one statistic, the conditional proportion of observed values, aiming to evaluate the completeness of one field relative to the observed values of another one. Applications highlight how the identification of relations and the generation of diagrams can be used to explore and clean data. The tools introduced are not limited to exploratory purposes and data cleaning, however. Further potential applications are discussed.

This content is only available as a PDF.

Author notes

Handling Editor: Vincent Larivière

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.

Supplementary data