Fine-grained classification of social science journal articles using textual data: A comparison of supervised machine learning approaches

We compare two supervised machine learning algorithms—Multinomial Naïve Bayes and Gradient Boosting—to classify social science articles using textual data. The high level of granularity of the classification scheme used and the possibility that multiple categories are assigned to a document make this task challenging. To collect the training data, we query three discipline specific thesauri to retrieve articles corresponding to specialties in the classification. The resulting data set consists of 113,909 records and covers 245 specialties, aggregated into 31 subdisciplines from three disciplines. Experts were consulted to validate the thesauri-based classification. The resulting multilabel data set is used to train the machine learning algorithms in different configurations. We deploy a multilabel classifier chaining model, allowing for an arbitrary number of categories to be assigned to each document. The best results are obtained with Gradient Boosting. The approach does not rely on citation data. It can be applied in settings where such information is not available. We conclude that fine-grained text-based classification of social sciences publications at a subdisciplinary level is a hard task, for humans and machines alike. A combination of human expertise and machine learning is suggested as a way forward to improve the classification of social sciences documents.


INTRODUCTION
Disciplines have long been considered as the fundamental units of division within the sciences (Stichweh, 2003). These units are knowledge production and communication systems, and can as such serve important classificatory functions (Hammarfelt, 2018;Stichweh, 1992Stichweh, , 2003Sugimoto & Weingart, 2015;van den Besselaar & Heimeriks, 2006). The subjects of interest for scientometricians (i.e., scientific documents) are classified according to disciplines to facilitate research into knowledge production and dissemination. Over the past few decades, however, we have faced continuous growth of the number of new disciplines and specialties (i.e., internal differentiation), resulting in increasing dynamism and "intensification of the interactions between […] disciplines" (Stichweh, 2003, p. 85).
General classification systems such as the Web of Science ( WoS) Subject Categories (SC) or the OECD's Fields of Science are too broad to adequately capture the more complex, finegrained cognitive reality. Several concerns have been raised in this regard-here we mention a n o p e n a c c e s s j o u r n a l Citation: Eykens, J., Guns, R., & Engels, T. C. E. (2021). Fine-grained classification of social science journal articles using textual data: A comparison of supervised machine learning approaches. Quantitative Science Studies. Advance publication. https://doi.org/10.1162/qss_a_00106 two central ones. First, Glänzel, Schubert, and Czerwon (1999) point out that the SC approach on the journal level works well for classifying publications in highly specialized journals, but that it is problematic for those appearing in multidisciplinary or general journals. Second, research such as Waltman and van Eck's (2012) large-scale clustering study, grouping publications based on their citation relations, indicated the feasibility of more fine-grained classification schemes. The authors cluster documents on three different levels, the most detailed of which can be conceived of as "small subfields" and consists of 21,412 clusters. While most bibliometric studies still make use of more general classification schemes for publications, these are limited in scope, only indicating broad scientific fields or general disciplines. Empirical studies such as the one conducted by Waltman and van Eck (2012), as well as theoretical arguments raised by sociologists of science, amplify the need for fine-grained classification schemes. More recently, Ahlgren (2018, 2019) have shown that fine-grained specialized communities can be determined based on citation relations, and these communities in their turn might possibly exhibit specific citation and publication practices.
In Flanders, the Dutch speaking region of Belgium, the Flemish Research Discipline Standard ("Vlaamse Onderzoeksdiscipline Standaard" or VODS) has been introduced to facilitate a detailed classification of research, including research output (Vancauwenbergh & Poelmans, 2019a, 2019b. The VODS builds upon the OECD Fields of Science (2007), adding two more fine-grained levels. While the introduction of the VODS will open up new possibilities for understanding knowledge production and dissemination on a more detailed level, it also poses important challenges, the classification of publications in the social sciences being one of them.
Current bibliometric approaches to classification of publications are not entirely fit for the social sciences. This mainly has to do with lack of coverage in major citation databases (Ossenblok, Engels, & Sivertsen, 2012) and differences in publication and citation practices within the fields (Kulczycki, Engels et al., 2018;Nederhof, 2006). One possible way to address these concerns is including nonsource items in citation-based bibliometric maps (Boyack & Klavans, 2014). An alternative solution is making use of text-based methods.

Using Textual Data and Machine Learning to Cluster or Classify (Social Science) Publications
Compared to classification approaches making use of reference/citation data (or other metadata), the usage of purely textual information (titles, abstracts, full-texts, etc.) has thus far received less attention. Nevertheless, the theoretical relevance of an article's textual content for this task has already been emphasized since the seminal work by Rip and Courtial (1984). Michel Callon and colleagues have further developed this long tradition of co-word analysis research, which aims to map and describe scientific interaction and the formation of specialist communities (Callon, Courtial et al., 1983;Callon, Courtial, & Laville, 1991). More recently there has been a resurgence of interest in textual data, mainly due to increased computing resources and availability of potential data sources.
Machine learning methods currently spearhead a lot of research that is based on textual data. We can distinguish between supervised and unsupervised approaches. In unsupervised learning, no predefined classes or categories are available to learn from. Supervised learning, on the other hand, starts from a set of predefined categories, each of which has a number of instances or records assigned to it. An algorithm is then trained on these labeled instances, from which it tries to deduce the common characteristics of instances in each category to predict to which category a new, unseen instance might belong. The present article uses such a supervised approach.
In scientometric studies, unsupervised clustering of documents is common. Hybrid approaches to document clustering in which citation information and textual data are used have shown that adding textual information can ameliorate the outcomes of document clustering (see for example Janssens, Zhang et al., 2009;Yau, Porter et al., 2014). Unsupervised clustering of documents based only on textual similarity (Boyack et al. 2011) has gained traction in the bibliometric community as well. Arguably, supervised ML has been less popular, presumably because in most scientometric clustering studies a granular ground truth classification at the article level is lacking.
An exploration of supervised ML algorithms combined with basic NLP techniques has been described by Read (2010), who used supervised learning to classify documents in, among others, the Ohsumed data set, part of MedLINE. The author reports F1 scores for different multilabel classification techniques, ranging from 0.1 up to 0.43. Classifier Chains (CCs) are proposed by Read (2010) as a possible solution to the task of multilabel, multiclass classification problems. The latter are tasks in which a document can be assigned to multiple categories at the same time. This kind of learning task is considerably more challenging than the single label classification problem.
Recent supervised ML algorithms with neural networks and word embeddings or BERT (Bidirectional Encoder Representations from Transformers) models, respectively, have also been used to vectorize and classify scientific documents. While these recent studies do not deal with multilabel, multiclass classification, they are relevant in that they apply these relatively new NLP techniques to vectorize scientific publications. Kandimalla, Rohatgi et al. (2020) report on a large-scale classification study in which they categorize papers according to WoS Subject Categories by making use of neural networks and word embedding models. The authors show that such classification systems work well, achieving an average F-score of 0.76. For the individual SC, the scores range from 0.5 to 0.95. In this study, however, the subcategories with too few records are merged or omitted from the analysis, as they "decrease the performance of the model." Documents that are labeled with more than one category are also dropped. The authors conclude that their experiment shows that the supervised learning approach scales better than citation clustering-based methods. Dunham, Melot, and Murdick (2020) train SciBERT classifiers on arXiv metadata and subject labels. This model is then used to identify AI-relevant publications in WoS, Digital Science Dimensions and Microsoft Academic. The authors report F1 scores ranging from 0.59 to 0.86 for the four categories within the field of AI.
Annif, an automated subject indexing tool currently being tested and implemented at the National Library of Finland, is also comparable to our approach (Suominen, 2019). Annif annotates terms from different subject vocabularies and thesauri to documents based on textual information, such as abstracts and/or titles. The ML module consists of an ensemble of classification algorithms. Annif annotates documents on a granular level, as the tested module was able to assign up to five indexing terms to documents. The module was evaluated on four corpora, including both academic and nonacademic texts, yielding F1 scores ranging from 0.14 to 0.46.
The present paper is an extension of work presented at ISSI 2019, where we applied supervised ML to classify sociology publications into subdisciplinary categories (Eykens, Guns, & Engels, 2019), reaching 81% accuracy. Note, though, that that paper only worked with publications assigned to one specialty. In this article, we study the use of textual data to classify publications from three social science disciplines into one or more subdisciplines. Much like Read (2010) and Kandimalla et al. (2020), we thus primarily aim to exploit textual characteristics of (social science) documents to categorize them into predefined disciplinary categories. As we will describe in more detail further on, we aim to categorize these social science abstracts into granular subcategories. Multiple categories can be assigned to one document at the same time. The novelty of this study resides in the fact that we have used a procedure to validate the data collected for our ML experiment, and that multiple granular subdisciplinary categories can be assigned to one single document.

Outline
In Section 2 we describe the classification scheme used in detail. Section 3 describes the data sources used (Sociological Abstracts, ERIC, EconLit), and the collection and processing procedure. We have developed a structured way of collecting and validating textual data based on well-established disciplinary thesauri in tandem with a validation round by experts from the respective fields. This validation procedure is discussed in Section 3.3. Next, Section 4 further details the supervised ML algorithms and feature extraction techniques that we compare. Section 5 describes the results of the comparison, where we evaluate performance on two dimensions: the individual labels and the instances. Finally, we discuss our ML setup and contrast our approach to existing automatic classification techniques. We conclude with some reflections and pathways for future research, and briefly discuss practical applications.

THE FLEMISH RESEARCH DISCIPLINE STANDARD ( VODS)
We make use of the Flemish Research Discipline Standard ("Vlaamse Onderzoeksdiscipline Standaard", abbreviated VODS, in Dutch), which is available at https://researchportal.be/en /disciplines and has been described in the literature (Vancauwenbergh & Poelmans, 2019a, 2019b. The VODS was introduced in the Flemish Research Information Space (FRIS, see https://researchportal.be/en), an aggregation platform of publicly funded research in Flanders, in 2019. In the future, all scientific output produced by scholars in Flanders may be classified according to the VODS. The VODS is structured as a hierarchical tree with four levels. To allow for international comparison, the first level overlaps with the seven broad fields of science present at the highest level of the OECD Fields of Science (OECD, 2007) coding scheme (hereafter referred to as OECD FOS). For the case of sociology, for example, at the top level of the OECD FOS we find category 5 "social sciences" and subcategory "5.4 sociology and anthropology" (Figure 1). This category is present in the VODS as well.
The VODS adds two more granular layers representing further subdivisions of the second layer of the OECD FOS. The third layer of the VODS might be interpreted as containing subdisciplinary categories, while items on the fourth level can be considered research specialties. To construct and define this scheme, experts from the corresponding fields were consulted by the creators of the VODS. In total, on the most granular level the VODS contains 2,493 codes. For further technical details on this classification scheme, we refer interested readers to Poelmans (2019a, 2019b). Our objective is to automatically classify articles (based on abstracts and titles) into categories at level 3 of the coding scheme (e.g., 050402 Applied sociology and/or 050405 Social Change and/or …) for three fields within the social sciences, namely (0502) economics & business (10 classes at the third level), (0503) pedagogical & educational sciences (nine classes at the third level), and (0504) sociology & anthropology (12 classes at the third level). At level 3, we have 31 subdisciplinary categories for the three disciplines together. Section 3.3 will further detail the reasons why our approach operates at level 3 rather than level 4. In the following part we introduce the data sources used to collect the titles and abstracts for the three disciplines.

DATA SOURCES: SOCIOLOGICAL ABSTRACTS, ERIC AND ECONLIT
The data used for our study were downloaded from ProQuest (https://search.proquest.com). ProQuest provides good journal coverage of the social science literature compared to, for example, Scopus or WoS (Norris & Oppenheim, 2007). ProQuest offers access to a range of existing abstracting services and disciplinary databases. For the purpose of our analyses, we have used Sociological Abstracts to download bibliographic records from sociology & anthropology, EconLit for records from business & economics, and ERIC for records from the pedagogical & educational sciences.

Combinations of Indexing Terms as Proxies for Subject Specialties
A clear advantage of all three databases is that they make use of controlled vocabularies (or thesauri) for the records which are indexed. The Thesaurus of Sociological Indexing Terms is a well-developed and highly regarded indexing system used by Sociological Abstracts' service. Within EconLit, the Journal of Economic Literature ( JEL) classification, also known as the American Economic Association Classification System, is used. Within ERIC, the Thesaurus of ERIC Descriptors is used. In addition, ProQuest's search engine allows us to filter on publication types and publication years. We selected all journal articles published between 2000 and 2018. These controlled vocabularies allow us to query ProQuest's command line search page for abstracts on a very fine levels of granularity. Figure 2 shows an example of the query we used for the category "law & economics," within business & economics. The full set of queries for all categories is available online (Eykens & Guns, 2020).
To design the queries, we manually coupled the indexing terms of these different data sets to the fourth level categories in the VODS and downloaded the abstracts found for each query. The first author went through the list of VODS categories and per category collected all relevant indexing terms from the thesaurus at hand.
The VODS provides semantic definitions of each category, which were formulated together with field experts. We used this information to manually retrieve the relevant indexing terms. In many cases, this was straightforward, because there was a perfect overlap with the indexing terms (for EconLit, this was the case for nearly all categories). In other cases, some additional indexing terms were found to be relevant (see for example Figure 2).
The indexing terms were then used to query ProQuest. The records retrieved for each of the 214 level 4 categories were subsequently downloaded (with an upper limit set to 1,000 records per VODS level four category) and saved in a separate folder, which was labeled with the corresponding VODS category. The data collection was carried out between December 2018 and February 2019. After collecting and processing, the merged sets (all files for the three fields together) resulted in a raw set consisting of 148,341 records (see Table 1).

Data Cleaning and Processing
To clean the raw data set, we followed a protocol consisting of four steps. At step 1 we removed all records that were missing an abstract or title. Although we limited our search to records published between 2000 and 2018 (step 2), there were still some in our data set that were published before 2000 or after 2018. These were omitted as well. Lower and upper boundaries were set for the word count of the abstracts (step 3): minimum 50 and maximum 1,000, respectively. These limits were found to adequately weed out cases where the abstract field replicated either the title or the entire full text.
Whereas we expected only journal articles resulting from our queries, other publication types were present as well. The reason for this might have to do with the fact that all three data sets have been designed by different organizations, which results in a diverse range of variable names to describe the different publication types used within the data sets. At step 4, for each data set, we compiled a list of unique variable names present in the collected records and filtered out those describing publication types that we did not want to take into consideration (e.g., book reviews, interviews, editorial material, instructional material). A list of the remaining, relevant publication types was used to restrict our data set to research articles published in journals. Table 1 provides an overview of the number of records in each set before and after cleaning. For Sociological Abstracts and EconLit, our initial collection of records was reduced by a little over 20%. For ERIC, the total number of records was reduced by almost 40%. The large intergroup difference observed is mainly due to a large number of records classified as "instructional material" in ERIC. The large intragroup difference is due to the smaller number of subcategories present in the VODS. For business & economics we queried 84 categories, for pedagogical & educational sciences 53 categories, and for sociology & anthropology 77 categories.
As discussed above, we have designed queries for each level 4 category in the VODS and collected records from the respective databases. Some records appeared multiple times-that is, some records were retrieved with different queries. After deduplication and relabeling, the

Expert Validation: Interindexer Consistency and F1 Scores
To validate the reliance on controlled vocabularies described above, a domain expert from each of the three disciplines was contacted. The three experts were given a random sample of 45 abstracts and titles, which they were asked to classify according to the VODS level 4 categories corresponding to their field of expertise (i.e., sociology & anthropology, business & economics, and pedagogical & educational sciences). Each expert was presented a set of abstracts and titles from their own discipline. The expert working in the field of business & economics, for example, was given abstracts and titles originating from EconLit (business & economics) only. No limitations were set on the number of categories the indexers were allowed to assign.
Next, the classification by the experts on the one hand and the classification based on the controlled vocabularies of each database on the other were compared. To do so, the interindexer consistency (IIC) was calculated for each record in the sample. For every sample, we calculated the average IIC using the method described by Rollin (Leininger, 2000). First, a percentage of consistency between two indexers (here: the expert and the controlled vocabulary) is calculated for each document d: Here, A denotes the number of categories on which both indexers agree, B is the number of categories assigned by indexer 1 (expert) and C the number of categories assigned by indexer 2 (controlled vocabulary). The IIC at document level is the Dice coefficient of the two sets of categories assigned by the indexers. The average IIC for the whole sample is calculated by dividing the sum of the IICs for all individual documents by the total number of documents N (in our case, equal to 45). In addition, we calculated F1 scores for each disciplinary sample. We have calculated these scores for levels 3 and 4 of the VODS. Table 2 displays the results of the IIC and F1 calculations. The F-scores are also included for the assessment of the performance of the ML models. On level three of the VODS, the IIC varies between 45.2% for the sample from Sociological Abstracts and 62.2% for EconLit. On level four, the IIC scores are considerably lower, with a minimum of 23.7% for EconLit and a maximum of 39.7% for ERIC. Previous research into IIC in the case of the PsycINFO database shows similar results to those obtained for level 3 of the VODS. Leininger (2000) evaluates IIC for a similar classification scheme, based on research areas within psychology. Using Rollins' method, he finds an average IIC of 45% (Rollin, 1981, as cited in Leininger, 2000). Sievert and Andrews (1991) study the IIC for a subset from Information Science Abstracts. The authors report average consistency scores of about 50%. Funk and Reid (1983) study the IIC for MEDLINE. They report a consistency score of 61.1% for the MeSH terms assigned to documents. While our scenario is somewhat different (i.e., the first author "reclassified" publications according to the indexing terms and experts were consulted to validate this reclassification), it seems that these low scores rather indicate the difficulty of the problem at hand. Therefore, we conclude that the level 3 classification is sufficiently robust to be used for our ML learning experiment, and hence we limit ourselves to the classification of journal articles at this level. For matters of interpretation of the differences in scores, at level 3 we have 31 subdisciplinary categories in total, compared to 214 research specialties at level 4.

METHODS
For 30 years, the dominant paradigm of text classification (TC) has consisted of ML approaches. ML algorithms are deployed such that "a general inductive process automatically builds an automatic text classifier by learning, from a set of preclassified documents, the characteristics of the categories [or labels] of interest" (Sebastiani, 2002, p. 2). ML approaches have already been applied to classify abstracts or full texts of journal articles. Langlois, Nie et al. (2018) classify papers into two broad domains: empirical and nonempirical. Our approach is different from such studies as the level of granularity of categories into which we classify texts is far greater. Consequently, articles from two different level 3 subdisciplinary categories are overall much more similar than what is encountered in most other classification tasks.
The classification problem discussed in this paper belongs to the domain of multiclass multilabel classification. Multiclass classification refers to assigning one of more than two classes to an instance. Multiclass multilabel classification is an extension of this problem where we assign one or more of multiple classes to an instance (Read, Pfahringer et al., 2009). Some abstracts were thus assigned to multiple classes (up to a maximum of six).
A popular strategy is to transform the multilabel problem into different single-label classification tasks. This can be done making use of binary relevance. As a baseline classifier, we make use of Multinomial Naïve Bayes (MNB). We optimize this classifier to explore the best feature engineering techniques as described below. Next, we compare the results obtained with MNB to those obtained by a Gradient Boosting (GB) model. After discussing the feature engineering steps in the following part, we will present a short description of the algorithms and the metrics that were used to evaluate performance on different aspects.

Feature Engineering
Feature engineering for multilabel TC is done in the same way as for single-label TC. The "features" or columns of the matrix are representations of words in the abstracts and titles of the publications. The Bag of Words approach (BoW) is a traditional, popular, and simple yet powerful way of vectorizing documents for TC. The BoW approach consists of slicing a text into words or phrases (without taking word order into account). We have built customized tokenizer functions in Python to extract four different textual features: lemma unigrams, lemma bigrams (combined with unigrams), nouns, and noun phrases (Figure 3). Although previous research has shown that for the BoW approach more advanced document representations such as nouns and noun phrases are "not adequate to improve TC accuracy," we wanted to explore this for our specific use-case (Moschitti & Basili, 2004).
We made use of the natural language processing packages NLTK (Loper & Bird, 2002) and SpaCy (Honnibal & Montani, 2018) to parse the texts and to perform part of speech tagging and stemming. For stemming, we made use of NLTK's implementation of the snowball stemming algorithm. Scikit-learn's count vectorizer and TF-IDF (term frequency-inverse document frequency) transformer were used to process the outcomes of the different feature extraction methods (Pedregosa, Varoquaux et al., 2011). With each tokenizer, we tested the performance of both (normalized) TF and TF-IDF. This resulted in eight different feature spaces (see Figure 4).
Feature sparseness is a common problem in TC. Transformation methods that make use of bigrams can easily bring about feature matrices with hundreds of thousands or even millions of columns, leading to very high dimensionality. To reduce the dimensionality, we make use of a feature selection method based on randomized decision trees. After extracting textual features, we fit a shallow extra trees classifier (maximum depth of 10) to the data to select the most relevant ones.

Multinomial Naïve Bayes (MNB)
MNB is one of the most popular TC algorithms used by the ML community. It is a fast, scalable (i.e., iterates very fast over large data sets), and successful approach for many TC problems. Over the years, it has become a popular baseline method, "the punching bag of classifiers" (Lewis, 1998, p. 2). MNB makes use of Bayes' theorem to construct histograms based on the feature vectors-in our case counts or probabilities of the textual features present in a document-for every single instance. The classifier associates these histograms with the labels and estimates likelihoods of a label and a distribution of feature counts occurring together.
If, however, a feature-class combination has zero counts, the probability will be set to zero. This mitigates the necessary information of the other probabilities by multiplying them by zero. For the algorithm to be able to deal with such problems a smoothing parameter is used. Another way of  dealing with this problem is transforming the feature space into a TF-IDF normalized matrix (Rennie, Shih et al., 2003).

Gradient Boosting Decision Trees (LightGBM)
Gradient Boosting Decision Trees (GBDTs) are, as the name indicates, tree-based learning algorithms. These algorithms build ensemble models, or groups of decision trees aimed at reducing residual errors for a split point in a decision tree. Boosting is a specific ensemble technique that sequentially builds the models on random subsets of the features and instances. When an instance is misclassified, its weight is increased and the next model tries to correct for this error.
In practice, this algorithm can be very time-consuming. Ke, Meng et al. (2017) have come up with a solution to this problem by optimizing the randomness of the feature and instance selection step. They combine Gradient-based One-Side Sampling (GOSS) with Exclusive Feature Bundling (EFB) to speed up the training process. The GOSS procedure pays more attention to instances with larger gradients (i.e., having more impact on the classification error of a model) "and randomly drop those instances with small gradients" (Ke et al., 2017, p. 2). This approach is implemented in the LightGBM software package (https://lightgbm.readthedocs.io).
The EFB implementation exploits feature sparseness, which is a very common problem in TC. It bundles sparse features together into a single feature, efficiently reducing the dimensionality. In a previous study, we have found the LightGBM implementation of GB to be the best performing algorithm to classify publications in sociology & anthropology, achieving accuracy scores well over 80% (Eykens et al., 2019). Different from this previous study, in this paper we assess classifier performance for a vastly more complex multilabel setting.
Decision tree-based models, however, come at a cost. They require tuning a wide range of parameter settings. For LightGBM, one can set well over 100 parameters 1 . For our purposes, we have chosen to optimize for 11 core parameters: • the number of trees that will be built; • the maximum depth of the trees: to limit tree growth; • the number of leaves of the decision trees: last splits made in the model when reaching the optimal number of splits for a given loss function, or when reaching the predefined maximum depth; • the learning rate: sets the weight of the outcomes of each tree for the final output; • maximum in bin: handles the maximum number of bins in which the feature values will be grouped; • regularization alpha (L1): limits the impact of the leaves encouraging sparsity (i.e., weights to zero); • regularization lambda (L2): limits the impact of the leaves by encouraging smaller weights; • minimum child weight: the minimum sum of instance weight which is needed in a leaf (child); • bagging fraction: the fraction of the data set used for each iteration; • bagging frequency: the number of trees training per random subsample of the data set; and • minimum data in leaf: the minimum number of samples which should be captured in a leaf.
We will describe how we optimized the parameters in Section 4.2.4. 1 For a complete overview of the parameters used in LightGBM, see https://lightgbm.readthedocs.io/en/latest /Parameters.html.

Multilabel classification: Classifier Chains (CC)
Two main approaches to multilabel classification exist: problem transformation and algorithm adaptation. The most popular and computationally least expensive approach is problem transformation, where a multilabel classification problem is transformed into N single-label classification problems. An example of problem transformation is turning the multilabel task into N-labels binary classification problems, wherein each binary classification problem is treated by a separate classifier. This is also known as the Binary Relevance method and has proven success in the domain of multilabel TC (Zhang, Li et al., 2018).
As each label is treated separately, however, the algorithm effectively ignores label dependence. Read et al. (2009) have suggested an improvement of the Binary Relevance method by "chaining" the results of each classifier to the input space so that the next training round takes the results of previous classifiers into account. As different disciplinary categories might be closer to each other in terms of concepts and topics studied, we do not expect labels to be completely independent of each other. Hence, we opt for the CC approach. It should be noted that other approaches exist, but these come at a cost of computational complexity as well as intuitive understanding of the models.

Cross-validation
After vectorization, dimensionality reduction, and problem transformation with a binary relevancebased CC algorithm, a holdout set (25% of the complete data set) was sliced from the initial data set using an iterative stratification technique as proposed by Sechidis, Tsoumakas, and Vlahavas (2011). This stratification method handles class imbalance for multilabel learning problems in such a way that the distribution of instances over classes in the validation set is kept as close to the actual distribution as possible. Figure 5 visualizes the cross-validation procedure. The test data (0.25 of the total set) will be used for the final evaluation of our models. For each iteration, a different subset of the remaining 75% of the data (training data) are used to evaluate different parameter settings for the feature engineering options presented above. We make use of randomized parameter grid search and threefold cross-validation to evaluate different parameter settings on parts or "folds" of the training data. This means we run three new random experiments, each of which again divides the training data into two different parts, using 66.66% of the training data to train a model with a random parameter setting, and evaluating that setting on unseen data (the darker grey area represented above). We make use of three different slices of training and test data to make sure that our findings are robust.

Evaluation Metrics
Evaluating the performance of multilabel classification is not as straightforward as is the case for single-label classification. Single-label classifiers' predictive performance can be evaluated Figure 5. Visualization of training set-validation set folds and test data. Lighter grey represents training samples, and darker grey represents validation samples. using the accuracy measure (i.e., the fraction of correctly classified instances over the total number of instances). The Accuracy Acc is calculated as follows: I here is the indicator function. Y i is the set of true labels (subdisciplinary categories, in our case) for document i, andŶ i is the set of predicted labels for document i. For multilabel classification assessing such a score based on the full set of labels per instance would be too harsh, "since even a single false positive or false negative label makes the example incorrect" (Read, 2010). Using multiple metrics to capture different dimensions of the multilabel prediction is advised (Read, 2010;Zhang & Zhou, 2014). Two main dimensions can be assessed: the individual labels and the entire training or testing label sets per instance (Zhang & Zhou, 2014, p. 1822. For full label set evaluation, we calculate accuracy, and for label-based evaluation, we calculate precision and recall.
Per label ℓ from the set of labels L, we can determine the set test ℓ of documents to which this label has been assigned and the set pred ℓ of documents for which the classifier predicts this label. Weighted average precision P is determined as follows: where |·| denotes set cardinality. Similarly, weighted average recall R is: The F1 score is the weighted average of precision and recall. Precision and recall are first "macroaveraged" by calculating the weighted mean of precision and recall for each label, and these are used to calculate the final F1 scores. These measures give an indication of the performance of our algorithm across the three different disciplinary data sets. Precision (Eq. 3) in a multilabel setting is "the fraction of predicted relevances which are actually relevant" (Read, 2010, p. 41). In addition, Schapire andSinger (2000, as cited in Tsoumakas andKatakis, 2009) propose Hamming Loss to take into account the fraction of labels that are predicted incorrectly. Hamming Loss is calculated as follows (see Sorower, 2010): Here, È is the exclusive-or operator, y i,k is 1 if document i has label k and 0 otherwise, and similarly,ŷ i,k is 1 if document i is predicted to have label k and 0 otherwise. We average these scores over the total number of classes |L| and predictions N. Hamming Loss thus denotes the fraction of incorrectly predicted labels and its optimal value is 0.

RESULTS
In the first part, we present the best results obtained for Multinomial Naïve Bayes. As detailed above, we have vectorized the abstracts and titles making use of three slightly different textual characteristics: lemmas, nouns, and noun phrases. Because of the computational requirements, the ML steps were carried out on the High Performance Computing infrastructure of VSC (the Flemish Supercomputer Center) at the University of Antwerp.

Multinomial Naïve Bayes
For the Multinomial Naïve Bayes classifier, we aim to optimize the smoothing parameter alpha. We randomly sample a value from a loguniform distribution, ranging from very small (i.e., 10 −10 ) up to 1 (i.e., add-one or Laplace smoothing). After finding the optimal value for alpha (0.13883) by fitting the algorithm to the three folds of the training set, we make a prediction for the holdout test set. The results for the best feature representation method are presented in Table 3. The optimal representation strategy turns out to be lemma bigrams without IDF normalization.
Making use of bigrams for lemmas decreased the Hamming Loss and increased the other scores. We achieved quite similar results with TF-IDF transformed vectors. Interestingly, noun phrases, except for the Hamming Loss evaluation metric, do not yield improved results.

Gradient Boosting (LightGBM)
For the GB algorithm, we randomly sample values for 11 different parameters. For a more detailed explanation of these parameters, refer to Section 4.2.2. To reduce computing time, we limited the number of random iterations to 100. If we were to perform a full parameter grid search, the number of model fits would be far too high. Keeping in mind that 25 fits take about three hours, this is not desirable.
Compared to the best results achieved with MNB, the GB implementation scores better on almost all evaluation metrics, except for precision (see Table 4). It is interesting to note that MNB scores better for the precision metric in some scenarios. Accuracy scores, however, strongly increase for GB, and Hamming Loss also falls back. The same feature transformation strategy seems to work best for GB. For the lemma bigrams feature extraction method without IDF normalization we achieve an F1 score of 0.55. Hamming Loss is considerably lower as well, with a fraction of 0.3% of the labels wrongly assigned. 46% of the label combinations predicted by the algorithm were the same as those in the test set. It is noteworthy that the differences between TF-IDF and TF feature transformations are insignificant. Figure 6 shows how the F1 scores are distributed across all 31 subdisciplines. We observe that the scores for GB are not only higher on average but also less spread out, with the exception of three poorly scoring subdisciplines. These three are all subdisciplines of educational & pedagogical sciences: Informal learning, General pedagogical & educational sciences, and Parenting & Figure 7. Relation between the number of records and F1 scores for MNB and GB for each of the 31 subdisciplines studied. Preprocessing: lemma bigrams, no IDF. family education. Except for these subdisciplinary categories, overall no discipline performs clearly better or worse than the others, although the number of training records seems to have some influence: subdisciplines with fewer training records tend to get lower F1 scores (Figure 7). While this relation is somewhat stronger for MNB, the three cases for GB with exceptionally low F1 scores all have few (between 174 and 780) records.

DISCUSSION
Classifying research output into disciplinary categories is of fundamental importance for nearly all bibliometric analyses. In the introduction to this paper, we touched upon the issue of differentiation in the sciences, leading to an ever-increasing number of research communities and disciplines (Stichweh, 2003). This emergence of new disciplines can lead, among other things, to the formation of new research specialties, the organization of new conferences, the formation of new scientific societies and the foundation of new journals (see Shneider, 2009). As the landscape of disciplines grows more diverse, classification schemes are being updated to better fit this dynamic reality.
The development of such an updated classification scheme is exemplified by the implementation of the VODS in Flanders (see Vancauwenbergh & Poelmans, 2019a). Such a diverse and fine-grained classification scheme makes it possible to study interactions between disciplines (i.e., inter-and intradisciplinary knowledge flows) more closely, and map discrepancies between different classification systems with more detail. Yet, it requires new ways of approaching classification tasks as well, in particular in settings such as the classification of expertise, projects, and outputs for which citation data are not available. In this article we take up the specific challenge of a fine-grained classification of social sciences journal articles using the text of their abstracts and titles.
To summarize, our study consists of three elements. First, we constructed a labeled data set. As the VODS classification scheme is relatively new, we lack a data set of classified publications or other documents that can readily be used for ML purposes. This led us to manually construct a training data set consisting of data extracted from EconLit, ERIC, and Sociological Abstracts. Each of the 31 VODS subdisciplines of economics & business, pedagogy & educational sciences, and sociology & anthropology was translated to a thesaurus-based query for the respective databases. Second, the query results were validated by human experts. IIC and F1 scores indicate that categories at level 3 (subdisciplines) and 4 (specialties) of the VODS can sometimes be hard to distinguish between. At the same time, the IIC scores for level 3 categories are comparable to those obtained in earlier IIC studies.
Third, the labeled data set at level 3 was used to train Multinomial Naïve Bayes and GB ML models. If we compare Figure 6 to Table 2, the configuration with the best results yields F1 scores slightly below those for the validation by human experts. This indicates that the models might still be improved somewhat, but very high scores are probably unrealistic or indicative of overfitting. Taken together, the results suggest that level 3 of VODS is so fine grained that some categories are hard to discern in practice and as a result a certain degree of ambiguity becomes unavoidable, at least for the disciplines studied here.
While some of the reported indicators, such as F-scores, are relatively low, we think it is instructive to compare our results to those of the recent studies by Kandimalla et al. (2020) and Dunham et al. (2020). While these authors report better accuracy, it should be highlighted that in this paper we specifically look at the applicability of supervised learning in the context of social sciences. As Kandimalla and colleagues note, this is not an easy task given the large overlap in terminology and the proximity of the categories. Kandimalla et al. (2020) have for that reason dropped or collapsed 120 out of 235 of the SC from their data set. In addition, they drop documents assigned to multiple disciplines. It should be noted that WoS SC are less granular than the ones used in our study (i.e., at the level of disciplines instead of subdisciplines). Dunham et al. (2020) report good scores for their model, which classifies AI publications into subdisciplinary categories, but their model is restricted to only four categories in AI, hence it is also less prone to errors. Our system works with 31 subcategories, divided over three social science disciplines. Taking these elements into account, it becomes clear that the lower scores are to a large extent a result of the difficulty of the task at hand.
A matter of concern that can be raised in this regard is to what extent classification of documents at a level of granularity that is finer than that of disciplines is feasible. Disciplines, and especially subdisciplines and research specialties, are in constant flux. Whereas most publications might belong to the knowledge base of just one discipline, their contents may be of relevance to two or more subdisciplines and research specialties. Theoretical work such as actor-network theory in the social sciences, for example, has been of relevance for many disciplines, subdisciplines, and research specialties, not only in the social sciences. Interdisciplinary studies, in which an integration of different disciplinary knowledge sources takes place to tackle a research question, may classify under several research specialties, subdisciplines, and disciplines. As these examples illustrate, a multilabel approach, as applied in this paper, is needed in view of the validity of a classification.
This framework requirement needs to be balanced with requirements in terms of the accuracy, feasibility, and reliability of a classification scheme. As the results of our study show, the classification of social sciences publications into subdisciplines ( VODS level 3) on the basis of abstracts and titles is a hard task for both humans and machines; classification into research specialisms ( VODS level 4) probably is not all that meaningful any more (cf. the IIC and F1 scores in Table 2). We argue that classification at the subdiscipline level should be further explored and fine-tuned, as this level of granularity corresponds to actual policy needs and might be improved by smart combinations of human input and ML. For example, a recommender system might be improved through validation by the authors of papers and machine classifications might gain accuracy through the use of larger sets of texts describing expertise, projects, and publications classified by humans.

Limitations
Four limitations of this paper should be highlighted. First, we could not compare our results to any benchmark. Although there have been some experiments in which supervised ML techniques are used to classify (or study elements of ) scientific articles (see for example Langlois et al., 2018;Matwin & Sazonova, 2012), to date no comparable applications or data sets exist (i.e., medium-sized annotated sets of social science publications classified according to fine-grained disciplinary categories)-at least not to our knowledge. The lack of previous work in this line of research makes it hard to benchmark our results for this specific problem setting.
Second, given that the records in our data set were extracted from EconLit, ERIC, or Sociological Abstracts, each record has been assigned to only one (but possibly multiple subdisciplines of the same) discipline of the VODS level 2 (i.e., to economics & business, to pedagogy & educational sciences, or to sociology & anthropology). Hence, interdisciplinary cases are not present in our initial training data. We cannot compare the performance of the models deployed in this study at different levels of granularity, in particular the discipline and subdiscipline levels. However, our results do show that the subdiscipline level is, at least for articles in social sciences and using their abstracts and titles only, the most fine-grained level that makes sense for classification exercises.
Third, we have coupled classification systems with two entirely different functions. On the one hand, we have the indexing systems based on the thesauri. These are systems that are designed for information retrieval purposes and have no limit to the number of indexing terms that can be assigned to a document. In such a system, there is no purpose in trying to fit a document into one to six subdisciplinary categories. Thus, we have reduced the complexity and granularity of the thesaurus-based classification to a fixed number of disciplinary groups. This "mismatch" between the two classification systems might lead to relatively low-scoring results when an ML algorithm is tasked with reproducing this classification.
Fourth, as discussed in Section 3.1, the queries have been manually constructed by the first author. The indexing terms in the thesauri were coupled to VODS discipline codes based on the semantic definition of each field in the VODS. It can be argued that this is a highly subjective task, as previous research has shown that disagreement between indexers when annotating records with indexing terms is commonplace. For many categories, however, the indexing terms nicely overlapped with the categories of the VODS. This gave us confidence in the construction. As the expert validation yielded results comparable to previous exercises of this kind, we believe this procedure to be of sufficient quality to allow for an automated (re)classification experiment. On the other hand, one can also interpret the relatively low IIC scores as indicative of the inherent ambiguity at this level of granularity.

Future Research and Practical Applications
The use of a minimum of textual data makes the approach presented in this study practical to generalize to other data sets (e.g., projects and project applications). Using additional bibliographic metadata would presumably increase the performance of the classification algorithms. Full-text documents would be an interesting path forward, yielding more textual data and a better sensitivity of TF-IDF transformations. In addition, it would be interesting to study ambiguities of the classification resulting from the predictions made by the algorithm and study those in detail.
With regard to the ML modules used, we acknowledge that more advanced and complex language-processing techniques have a good track record when it comes to automatically classifying text documents (e.g., BERT and related models). Dunham and colleagues (2020) have shown that SciBERT models outperform other NLP methods when applying them to classify publications in the field of Artificial Intelligence. For our purposes, however, we have opted to keep the setup relatively straightforward. The main motivation behind this study was to investigate and compare the feasibility of using supervised ML algorithms for this particular, challenging fine-grained classification task. We leave comparisons of other methods and feature transformation procedures for future research.
Questions surrounding the properties of interdisciplinarity demand for a clear operationalization of disciplines, which is not straightforward. This is in itself also the main reason why many different classification schemes are used in different contexts, each pointing to insights about different aspects-organizational, cognitive, etc.-of a discipline (Guns, Sı le et al., 2018). Textual approaches might lead to other insights regarding the cognitive structure of disciplines, but these same disciplines are in constant flux (Yan, Ding et al., 2012). A fixed classification scheme will not meet future developments in science; "… human assigned subject categories are akin to using a rearview mirror to predict where a fast-moving car is heading" (Suominen & Toivanen, 2016, p. 2464. To this end, the team working on the VODS has provided a "not elsewhere classified" category for all the subfields (Vancauwenbergh & Poelmans, 2019a). This particular category has not been studied in this article. Future deployments of the classification system in Flanders will allow researchers to identify themselves and/ or their projects with this category and assign documents to it, and, following from this, we could study text residing in these categories to discover emerging research problems and topics.
Once researchers employed by the Flemish universities start to label their expertise, projects, and outputs using the VODS, supervised ML algorithms can be trained on a broader range of disciplinary categories, allowing for a broader evaluation of the method proposed in this paper. This approach will enable us in practice to assist with annotating unlabeled work, or it can serve to underpin an online recommendation system for researchers, embedded in current research information systems. The output of supervised text classifications can also be compared to other existing classification schemes. We can, for example, contrast the publication level classification with journal level classifications of the same publications to study the disciplinary or interdisciplinary diversity of journals.
Finally, we should highlight that measuring IIC consistency is not straightforward. While there exists a long tradition of research that makes use of scoring systems such as IIC or F1 scores to assess the reliability and functionality of classification systems, there have been attempts to include semantic relations between indexing terms or categories to develop more realistic measures of indexing accuracy. Medelyan and Witten (2006) propose calculating the cosine similarity between word vectors of vocabularies or semantic definitions of categories. This is an interesting approach, but, to our knowledge, there are no systematic comparisons with other scoring systems available to date. It would be interesting to use such an approach when assessing classification systems in which a semantic definition of categories is available. The classification error could for example be weighted by the cosine distance between sentence embeddings of semantic definitions of the disciplinary categories. If a classification error is made whereby two distant categories are mistaken for each other, then the error is greater than when these categories are closer to each other in terms of cosine distance.

CONCLUSION
In this article we present a supervised ML approach to classify social science journal articles into multiple fine-grained disciplinary categories. By making use of GB with CC we are capable of assigning one or more disciplinary categories to text documents (i.e., abstracts and titles). To do so, we have compiled a new data set consisting of 113,909 records originating from three disciplinary databases in the social sciences (EconLit, ERIC, and Sociological Abstracts).
The novelty of this study lies in two aspects: the construction of the labeled data set, based on discipline-specific thesauri, and the application of supervised ML algorithms to classify social science journal articles into one or more fine-grained disciplinary categories using text. We show in detail how we have collected the data and how we have validated the labeling based on the subject indexing terms from the thesauri. With regard to the ML methods, we compare different feature engineering techniques and two well-established classification algorithms. The Gradient Boosting classifier (LightGBM) in a Classifier Chaining framework is capable of predicting approximately 46% of the exact label combinations correctly, with a fraction of 0.3% of labels assigned incorrectly. The F1 score is 0.55.
In a previous study (Eykens et al., 2019) we assessed the performance of four different ML algorithms for the classification of sociology and anthropology journal articles extracted from Sociological Abstracts into fine-grained disciplinary categories (level 4 of the VODS). Making use of the same LightGBM module (Ke et al., 2017), we were able to correctly classify over 80% of the publications. In this previous study, we made use of simple feature engineering (i.e., lemmas and unigrams) and we did not assess whether multilabel classification was possible. Aside from the work by Read (2010) to date, we are unaware of studies making use of similar methods to achieve fine-grained disciplinary classifications. To our knowledge, no work exists that studies the performance of supervised ML algorithms to classify social science documents on such a granular level.
Because we have significantly scaled up our data set, this study adds more nuance to the previous experimental study (Eykens et al., 2019). We have added textual data from two additional disciplinary databases, namely ERIC and EconLit, and we have assessed more complex feature engineering techniques as well. Importantly, we assess whether multilabel classification is manageable. The results confirm the robustness of our previous work and expand it to additional data sources. We further demonstrate that to a certain extent the approach is indeed generalizable to a multilabel classification task. To achieve this, the quality of the data collection and data validation is crucial. Hence, we encourage others to develop a thought-through data collection and validation procedure to make sure that the complete ML experiment is reproducible, from data collection and processing onwards.
To summarize, this study shows that supervised ML algorithms are capable of classifying social science journal articles into predefined, fine-grained categories based on the limited textual data of abstracts and titles only. However, for both human experts and machines, such classification at the subdisciplinary level proves very hard, to the extent that the question can be raised of whether such an attempt makes sense. Given the need for fine-grained classification in view of assessments, evaluations, and policy, we suggest that the informetric community further explores the possibilities for such fine-grained classification. For example, can the results obtained in this study be improved with different or more advanced NLP techniques, and by combining human expertise with advanced ML techniques? Like others (Boyack & Klavans, 2014;Suominen & Toivanen, 2016), we do not believe it to be fruitful to consider one or other classification system superior. We do instead insist that each approach has its merits, especially when contrasted to others. We hope that our work will spur others to conduct similar studies that explore the limits of the feasibility of classification through algorithms and human experts.