We describe a gold standard corpus of protest events that comprise various local and international English language sources from various countries. The corpus contains document-, sentence-, and token-level annotations. This corpus facilitates creating machine learning models that automatically classify news articles and extract protest event-related information, constructing knowledge bases that enable comparative social and political science studies. For each news source, the annotation starts with random samples of news articles and continues with samples drawn using active learning. Each batch of samples is annotated by two social and political scientists, adjudicated by an annotation supervisor, and improved by identifying annotation errors semi-automatically. We found that the corpus possesses the variety and quality that are necessary to develop and benchmark text classification and event extraction systems in a cross-context setting, contributing to the generalizability and robustness of automated text processing systems. This corpus and the reported results will establish a common foundation in automated protest event collection studies, which is currently lacking in the literature.

Socio-political event knowledge bases enable comparative social and political studies. The spatiotemporal distribution of these events sheds light on the causes and effects of government policies and political discourses that resonate in society. Protest event databases have been of particular prominence in social and political science in this respect, and this is the type of event we focus on. We define protest events within the scope of contentious politics (CP), referred to as a “repertoire of contention” by Tilly et al. [1, 2], which includes demonstrations, riots, strikes, and many other types of collective action.

Since news media provide a continuous flow of data over time and enable researchers to determine the significance of events that are reported, social and political scientists turn to news data to create knowledge bases of protest events [3, 4, 5]. Protest event data collection has been carried out manually [6], semi-automatically [7], and automatically [8, 9, 10, 11]. While manual approaches tend to be too expensive, automatic methods have proven limited in the information they are able to collect [12, 13, 14]. Moreover, there has not been any common ground across projects that would enable a comparison of the results of distinct studies [15]. Therefore, as members of the Emerging Welfare (EMW) project, we took on the challenge of creating a common foundation in terms of the required high-quality data, and state-of-the-art tools for fully automating the creation of reliable and valid protest knowledge bases. This foundation would serve as a benchmark and enable protest event collection studies to benefit. This effort has yielded a gold standard corpus (GSC) that will also serve the machine learning (ML) and computational linguistics communities to study text processing tool development for constructing knowledge bases of protest events.

The Corpus consists of English news articles from various international sources and local sources from India, China, and South Africa. The variety of the sources has enabled researchers to study cross-context robustness and generalizability, which are critical requirements of the ML models, by addressing changes in style and content across sources. The annotations have been applied at document-, sentence-, and token-levels in a consecutive manner. Each individual level is annotated by two people, and adjudicated by the annotation supervisor. Moreover, detailed manual and semi-automatic quality checks and error analyses were applied to the annotations.

The Corpus contains more than 12,000 news articles labeled as protest and non-protest. Over 1,000 articles which contain protest events have undergone sentence-level annotation whereby each sentence in the article was labeled to see whether it contains event information. Finally, these articles were annotated at the token level for detailed event information such as the trigger(s), trigger semantic category, place, time and actor(s) of events, semantic category of the actor(s), and event coreference information. The corpus has enabled the development of a pipeline of ML models that extract a knowledge base of protest events from archives of news articles. Moreover, some parts of the GSC have enabled shared tasks for cross-context document and sentence classification as well as token extraction [16, 17] and event sentence coreference identification [18].

The contributions of this paper include 1) an effective methodology for creating a corpus that enables the creation of robust ML-based text processing tools, 2) insights from applying this methodology to news archives across contexts, 3) a corpus that contains data from multiple contexts and annotations at various levels of granularity, 4) results of a pipeline consisting of automated tools that are created using the corpus, and 5) first recall quantification on unfiltered raw data as opposed to recall measurements on settings limited by keyword-filtered data, which is common in event knowledge base projects.

We describe the context of our work in reference to recent related work in Section 2. Next we introduce our methodology, the manuals we prepared, and the corpus we have created in Sections 3, 4, and 5, respectively. Section 6 reports the results of the ML tools that were created using the corpus. Finally, Section 7 concludes this report by presenting overall results and pointing to future steps we are planning to pursue in the future.

The quality of the automated socio-political event collection is determined by language resources, automated tools that exploit these resources, the assumptions made to complete the design of an event collection system, and data sets that are inputs or outputs of these systems. The existing language resources are rare and there are few accessible tools. Moreover, the assumptions made in delivering a resulting data set are not examined in diverse settings.

Automated tools for event information collection are designed in terms of pipelines that receive news articles from one or more news sources and yield records of event information. Each tool is inherently limited to the language resource it utilizes for development and the setting within which it is validated. Therefore, when an automated tool is used to analyze different sources, i.e., cross-context performance of these pipelines, quality of the result is rarely evaluated. The first step of these pipelines, which is discriminating between relevant and irrelevant documents, has been extensively studied by Croicu and Weidmann [19] and Hanna [20]. Keyword lists and labeled documents aid in determining which news reports contain relevant events. These studies provide their own keyword list and describe the way they use it. Moreover, labeled documents are presented as their URLs or document IDs in proprietary collections such as Lexis Nexis without their content [21]. Accessing the data set with such limited information, and the necessity of purchasing subscriptions to these databases are significant limitations. Our approach and database is novel in the sense that it does not restrict itself with keywords, and applies state-of-the-art ML models to tackle selecting documents that contain event information, which is known as the report selection problem in this field [4].

Once protest event-related documents are determined, what remains as a task is to extract event information on the token level. There are several established event ontologies and relevant language resources that can be exploited for this task, the most prominent of which are ACE [22], TAC-KBP [23], and CAMEO [24]. All three frameworks include wide ranges of event types and sub-types which can serve the needs of diverse domains. However, in terms of comparative social and political studies that focus on contentious politics, data based on ACE and TAC-KBP (Rich ERE) annotation frameworks are limited in size and scope. Both frameworks include ATTACK and DEMONSTRATE event categories as those relevant to the collection of CP events. The ATTACK category does not discriminate between authors of the actions enumerated under ATTACK, i.e., attack, clash, and bomb, and thus includes state actions that are excluded from contentious politics. In other words, these event ontologies adhere only to syntactic rules, which do not allow the semantics of the event triggers and their arguments to affect the annotation. On the other hand, the ACE and Rich ERE category of DEMONSTRATE in itself, is too restrictive to be applicable in terms of a broad understanding of CP for two reasons. First, as it limits the scope of this event type to spontaneous (unorganized) gatherings of people, it excludes certain actions of political and/or grassroots organizations such as political parties and non-governmental organizations (NGOs). Protest actions of such organizations sometimes do not involve mass participation despite aiming at challenging authorities, raising their political agendas, or issuing certain demands. Putting up posters, distributing brochures, and holding press declarations in public spaces are examples of such protest events. Second, the requirement of mass participation in a public area leaves many protest actions such as online mass petitions and boycotts, which are not necessarily tied to specific locations where people gather, and actions of individuals or small groups such as hunger strikes and self-immolation. Unlike ACE and Rich ERE, the CAMEO framework seems to offer event categories better suited to the collection of CP events as it is directed towards the domain of international politics and governmental actions. CAMEO and its simplified version PLOVER includes the PROTEST category which includes most but not all CP event categories. Important elements of CP action, such as violent clashes, group confrontations, and armed attacks are found in other categories which include state actions as well as non-state, that is, CP events. In fine, the multi-purpose and comprehensive nature of existing language resources appears as shortcomings and complications in terms of a database that focuses exclusively on CP events.

Consequently, protest-event specific annotation schemas and data sets were proposed for creating automated or semi-automated event knowledge bases [4, 11, 24, 25]. In their corpus, Makarov et al. [26] used the ontology of CP events which is very similar to ours and identified ten event types. They coded event actors and issues that correspond to each event. Our definition is broader in the sense that we include election rallies as a form of CP events while they exclude it on the basis of being related to institutional politics. The Mass Mobilization in Autocracies Database (MMAD), on the other hand, defines protest events as specifically anti-regime political actions, thereby excluding industrial actions (strikes, etc.) and conflicts between social groups. They also exclude protest events with less than 25 participants and the systematic use of armed force. Apart from the differences of these projects from ours in terms of event ontology, these resources are mainly created using keywords for a single context, and it is a challenge to obtain the data sets based on the limited information shared. We follow the detailed protest event information tradition as proposed by Lorenzini et al. [15] and Gerner et al. [24], working on data unrestricted with keywords and making our data available to all researchers with shared tasks and sufficient information.

To sum up, recent studies on event data often assume one or more of the following 1) analyzing thousands or millions of sources will compensate for low recall performance of the tools, 2) a news report contains information about a single event, 3) analyzing a sentence individually is sufficient for extracting relevant information about an event, and 4) tool performance on a new source will be comparable to the performance on the validation setting [27, 28, 29]. Quantifying the effect of these assumptions is not a simple task; therefore, they are rarely tested. We make a point to provide observations that shed light on the effect of these assumptions, or we refrain from making assumptions at all.

A GSC of protest events that can enable large-scale, multi-source socio-political studies should be representative of the content it aims to capture. Moreover, it should enable quantification of the automation performance across contexts; therefore, using available corpora such as English Gigaword [30] is not an option for this setting. In order to satisfy these requirements, our methodology is designed to incorporate multiple sources and countries and to apply a detailed annotation methodology without reducing the content quality.

We collected online local news articles and international sources when local sources were not accessible, from India, China, and South Africa. We first downloaded URLs of the freely accessible parts of online news archives of daily newspapers that include Indian Express (IEX), New Indian Express (NIEX), The Hindu (TH), Times of India (ToI), South China Morning Post (SCMP), and People's Daily (PD). Then, for each source, we took a random sample of these URLs and downloaded their content for labeling and annotation.

The random sampling approach made the task challenging, as it contained few relevant documents compared to keyword filtered samples, yet it approximated the real data universe more accurately. Keyword lists run the risk of excluding events that are reported without the use of common protest terms (the phrase “classrooms remained empty” can be used to refer to a teachers' strike, for instance) [16, 31, 32]. Moreover, lexical variance across contexts cannot always be captured using keywords. For instance, the phenomena such as “bandh” and “dharna” are protest event types that are specific to India, and thus, are not covered by any general-purpose protest keywords list. Our evaluation of four keyword lists, which are utilized by Huang et al. [25], Wang et al. [12], Weidmann and R⊘d [4], and Makarov et al. [26], yielded .68 and .80 precision and recall on our randomly sampled batches at best.

The annotation process of this randomly sampled raw data is based on an annotation manual that was created by an expert, and applied to document-, sentence- and token-level annotations for each particular target context while the annotation team was continuously monitoring the annotations to achieve a high inter-annotator agreement (IAA). The same manuals were applied to data collected from different sources and countries, enabling the attainment of comparable measures of automatic tool performance across contexts. Finally, in order to eliminate the risk of incorrect labeling due to lack of knowledge about a country, a domain expert in the politics of the target country instructed the annotators before they started the annotation.

The annotation team consisted of a supervisor, a social scientist responsible for maintaining the annotation manuals and resolving annotator disagreements. The annotators, who worked in pairs were master students or Ph.D. candidates in social or political sciences. Throughout the annotation, the overlap ratio of annotated articles between pairs was 100%. The annotation started by labeling whether a news article mentioned a past or ongoing protest. Then, the sentence(s) that contained protest information was identified. Finally, protest information such as participants, place, and time was detected in the protest-related sentences at the token level. The three levels of annotation were separate but integrated in the sense that they formed a pipeline in which each batch of documents went through each step, which would build upon the result of the previous step. The aim here was to maximize time and resource efficiency and performance by utilizing the feedback of each level of annotation for the whole process. The lack of clear boundaries between these levels at the beginning of the annotation project caused a relatively lower IAA, and more time was spent on the quality check and correction of the data set. In order to ameliorate this, we added an additional step, which was sentence level annotation to the main steps of protest event pipelines. This order of tasks enabled the possibility for error analysis and optimization during annotation and tool development efforts.

Each batch for document and sentence levels has been corrected in terms of:

  1. Spot-checks: 10% of the agreements were checked by the annotation supervisor

  2. ML-internal: 80% of the batch was used to create an ML model. Next, the remaining 20% was predicted using this model. This procedure was repeated until all instances were used at least once in training and once in test data.

  3. ML-external: The annotated and corrected data from previous batches were used to create an automated classifier, which then classified the newly annotated batch.

The disagreements between the classifiers and the annotations were checked manually for ML-internal and ML-external. Annotations that were checked based on spot-checks, ML-internal, and ML-external were found to be incorrectly annotated wrongly at around 2%, 50%, and 10%, respectively. In total, around 10% of the annotations were corrected using these measures.

In order to increase the time efficiency of annotation via increasing the number of relevant documents, we applied a recall-optimized active learning (AL) sampling from news archives by utilizing the initial ML-based classifiers (three or more) already trained on random samples. We followed this procedure in cases where we needed to improve the performance of a tool on a source that had already been covered, trained the tool on a new source from a country that was already reviewed, or adapted the tools to a different period. For these AL samples, we first trained multiple ML-based classifiers (three or more) on the available corpus, and then predicted a random batch from the new context. To achieve elevated recall scores, we took the logical OR of all classifiers as a final prediction, and selected positive samples to be annotated. Although the recall decreased from 100% to 97% in such a sample, the precision increased from around 5% to around 70% in comparison to a random sample where AL significantly decreased the effort the annotators needed to spend on annotation [33]. Since annotators encountered many more positive samples in this setting, we expected the decrease of recall to be a minor issue at the moment.

Annotation tasks on document-, sentence- and token-levels each have their respective annotation manuals that define the task and establish the rules of annotation, enumerate cases that might be confusing, and elaborate on the rules in the context of concrete situations and examples.

EMW Document-Level Protest Annotation Manual (DOLPAM) has been created for document-level annotation. This manual lays out the protest event ontology; that is, the protest event definition which specifies the range of CP events that are included in the scope of the project. Also, it contains the rules by which the news articles are identified as containing CP events. In brief, CP events cover any politically motivated collective action that lies outside the official mechanisms of political participation associated with formal government institutions. This broad event definition is further developed in two sections. The first section identifies three abstract categories of collective action, namely, political mobilizations, social protests, and group confrontations to define the broad range of socio-political events that the project simply refers to as protest events. Next, five specific categories of CP events are identified as concrete manifestations of the types of collective action already defined. Demonstrations, industrial actions, group clashes, political violence, armed militancy, and electoral mobilization events are the concrete types of events that our event ontology encompasses. Once the event definition is laid out, the manual establishes certain criteria for determining which news stories that report protest events can be classified as protest news articles. These criteria include the necessity of civilian actors, and the existence of concrete time and place information which confirms that the event(s) the report mentions has indeed taken place. Only the news reports that mention events that have taken place in the past or are taking place at the time of writing are labeled as protest news articles. The references to the future (i.e., planned, threatened, announced or expected) events are not labeled as protest with the exception of threats of violent events or attempts to carry them out.

EMW Sentence Level Protest Annotation Manual (SELPAM) establishes rules for determining the classification of sentences of news reports as protest event sentences and non-event sentences. Similar to document-level annotation, sentences which contain references to protest events are labeled as protest event sentences. These are defined as sentences which give information about an event in the news report also containing at least one direct reference to a protest event; that is, all event sentences must contain an expression that denotes the event.

EMW Token Level Protest Annotation Manual (TOLPAM) acts as the guide to the annotation on the token level. The TOLPAM defines all variables and pieces of information about protest events that the EMW project aims at extracting, and it establishes the rules according to which expressions in the event sentences are annotated using tags. Event arguments that exist within the text are only tagged in event sentences. This is to ensure that arguments belong to their respective events unambiguously, which improves IAA significantly. There are general rules which apply to all tags, as well as specific rules which apply to individual tags.

There are two main categories of tags: syntactic and semantic tags. Syntactic tags label expressions according to whether they are triggers (events) or event arguments. They are grouped under event, participant, organizer and target characteristics. Event characteristics contain the trigger tags, i.e., event expressions which either directly denote the event (event anchor) or refer to it (event mention) as well as tags for time, place, facility, and centrality (i.e., whether it is urban or rural) arguments of events. Each event can have only single anchor and zero or more event mentions that refer to the event anchor. Actor arguments of events are labeled under two categories: participants and organizers. Participants are personal actors (i.e., individuals or groups) who actively engage in the protest action. Organizers are organizations (political parties, NGOs, unions, etc.) that hold or take part in the protest events. In some cases, influential individuals or leaders might be the organizers of the protest events. Persons are annotated as organizers only in special cases where the article designates them explicitly as organizers or leaders of the protests. Each actor argument is labeled with participant (or organizer) type, name, ideology, religious, ethnic and caste identity, and socioeconomic status labels. Finally, target arguments of events are annotated with target type and target name labels. These labels also designate the possible antagonists of the protest events including governments, officials, leaders, political organizations or other social groups in the case of group clashes.

Semantic tags classify events and participant and organizer arguments into sub-types identified per the requirements of the EMW project. Every event trigger, participant actor, and organizer actor is labeled with a syntactic tag and one of the semantic sub-type tags. For events, semantic tags correspond to types of collective action. Demonstrations (rallies, marches, sit-ins, slogan shouting, gatherings, etc.), industrial actions (strikes, slow-downs, picket lines, gheraos, etc.), group clashes (fights, clashes, lynching, etc.), armed militancy (attacks, bombings, assassinations, etc.) and electoral politics (election rallies) are the sub-types. Participant expressions are categorized semantically into peasant (people who work in agriculture and/or live in rural areas), worker (any kind of public and private sector worker, blue and white collar), small producer (owners of small shops, small traders, artisans including transport owners like 4 owner taxi drivers), employer/executive (owners and managers of medium and large-sized businesses), professional (university-educated professionals such as physicians, lawyers, academics, and journalists, who work in a private or public sector), student (students from all levels of education), people (general categories which refer to citizens like women, residents, religious or ethnic community members, expressions such as mob and crowds), activist (ordinary members of political parties, grassroots organizations, NGOs), politician (members of political parties who are members of legislative organs and/or executive branches of government), and militant (members of armed political organizations such as Islamic fundamentalist militants and members of armed revolutionary organizations) sub-types. Any category of participants who cannot be placed into any of the above categories is marked as ‘other.‘ Organizer sub-types are political party, NGO, trade union, armed/militant organization, and chamber of professionals.

Below are examples of event information annotated at the token level. The bold tokens are the event triggers. The underlined tokens are event arguments which are listed in order as event time (sentence 2); event time, organizer name, participant type, facility type, target name, event time, facility name, and participant type (sentence 3); as well as event place, participant count, participant type, and facility type (sentence 4). In the first sentence, the event triggers are also labeled with the semantic sub-type label “group clash” as the event is a communal clash. It should be noted that an event that has not taken place, the rally, in the second sentence is not annotated. The tokens that refer to the event that took place are also labeled with semantic sub-type “demonstration.” In the third sentence, there are two event references. ‘Gathered’ and ‘shouted slogans’ are the triggers for the first event, which are also labeled with the “demonstration” semantic tag. The triggers of the second event reference are “attack” and “killed,” which are labeled with the “armed militancy” semantic sub-type tag. The actor arguments in the third sentence are labeled with semantic sub-types “political party” (‘BJP’), “activist” (‘workers’), and “militant” (‘militants’). In the fourth sentence, event triggers are labeled with the “demonstration” semantic tag, while the participant actor token (‘workers’) is labeled with the “worker” semantic tag.

  1. It took a communal turn that had resulted in stone-pelting, arson and loot.

  2. The Bhim Army and other Dalit groups were refused permission to organize a rally against atrocities on May 9, sparking off violence and vandalism, with several vehicles and buses burnt.

  3. At noon, BJP workersgathered in the square and shouted slogans, condemning the failure of the Union Government in delivering justice to the victims of last year's terror attackat the train station where armed militantskilled 25 people.

  4. In Bangalore, hundreds of workersparticipated in the rallyin front of the collectorate.

We have annotated the corpus at three levels: document, sentence, and token. The document level refers to what a reader sees on a news article. It consists of a title, publication time, and the article text. The sentence level refers to a text unit that ends with a sentence-completing punctuation mark. The token level refers to a punctuation mark or sequence of alphanumeric characters that is characterized as a word in English. The following subsections describe the number of instances for each level, the quality of the annotations, and the storage and release of the corpus.

5.1 Data Statistics

The document counts for each document level batch are reported in Table 1. Each batch is named after the source it was sampled from. In cases where we annotate data from a corpus such as EventStatus [25] (ES) and RCV1 [34] that are readily released, we use their names as batch names. Suffixes are added to distinguish between different batches from the same source. For instance, SCMP1 and SCMP2 differ in terms of the period they cover, which is 2000–2002 and 2000–2017, respectively.

Table 1.
Document label statistics and sampling method.
ESINTIEXNIEXPDRCV1SCMP1SCMP2THToI
Protest 151 262 296 71 69 802 17 19 264 481 
Non-Protest 149 738 265 630 732 367 985 483 782 1,985 
Sampling AL AL AL AL R&AL 
ESINTIEXNIEXPDRCV1SCMP1SCMP2THToI
Protest 151 262 296 71 69 802 17 19 264 481 
Non-Protest 149 738 265 630 732 367 985 483 782 1,985 
Sampling AL AL AL AL R&AL 

Note: K, R, and AL indicate Keyword, Random, and Active Learning, respectively. ES, INT, and RCV1 are EventStatus, International, and Reuters which is filtered for China using meta-information.

Active learning was applied to creating three batches of articles, which are INT2 (Guardian), SCMP3, and NIEX2 (New Indian Express) that were annotated at the sentence level and which are reported in Table 2. We sampled the full documents and annotated their sentences. The high number of non-protest sentence annotations stems from documents that do not contain any protest information.

Table 2.
Sentence level statistics.
INT2SCMP3NIEX2
Protest 1,658 511 1,299 
Non-protest 9,045 2,847 7,083 
INT2SCMP3NIEX2
Protest 1,658 511 1,299 
Non-protest 9,045 2,847 7,083 

Note: The total number of sentences and their annotations as protest and non-protest is reported.

Sentences of a subset of the positive documents were annotated at the token level. The number of the information types in the annotated documents, which are 704 and 135 from India and China, respectively, are reported in Table 3. A news article was annotated at the token level only if the event happened or was happening in the same country of the source or the country under focus for international sources because protest event characteristics that are different across countries can affect the quality of the annotation.

Table 3.
Token level statistics and IAA in terms of Krippendorf's alpha.
Tag nameTimeTriggerPlaceFacilityParticipantOrganizerTarget
India 822 1,378 645 392 2,283 1,260 1,453 
China 144 142 82 52 272 88 109 
IAA 60.07 50.02 41.82 39.10 39.50 47.44 34.38 
Tag nameTimeTriggerPlaceFacilityParticipantOrganizerTarget
India 822 1,378 645 392 2,283 1,260 1,453 
China 144 142 82 52 272 88 109 
IAA 60.07 50.02 41.82 39.10 39.50 47.44 34.38 

The country-specific total count of the documents, the separate events in these documents, and the triggers referring to these events are presented in Table 4. The number of triggers that refer to multiple events is provided as well. Each document contains 1.8 events and each event is described using 3.1 triggers on average across the countries in the scope. The overall percentages of events in the documents, triggers in events, and tokens in triggers are demonstrated in Figure 1. The figure shows that 41%, 65%, and 18% of the documents, events, and triggers contain more than one event, trigger, and token, respectively. These statistics are comparable across data collected from each country and jointly illustrate the first quantification of the phenomenon of reporting multiple events in a news article, using multiple triggers to describe an event, and bringing together multiple tokens that denote a trigger in the socio-political events domain. This information indicates the importance of coreference information for event information collection tasks and shows that the assumption that news articles contain event information only in the title, leading sentences, or in an indexed summary inherently leads to disregarding a significant amount of event information. Examples of this assumption were made for instance by Tanev et al. [28] and Jenkins et al. [35]. This was also observed by Johnson et al. [29] using extrinsic evaluation of protest event databases.

The overall percentages of events, triggers, and tokens in documents, events, and triggers, respectively.

Figure 1.
The overall percentages of events, triggers, and tokens in documents, events, and triggers, respectively.
Figure 1.
The overall percentages of events, triggers, and tokens in documents, events, and triggers, respectively.
Close modal
Table 4.
Number of events in a document.
#documents#events#triggers#multi event triggers
India 639 1,171 3,577 71 
China 73 123 416 11 
South Africa 184 357 1,060 16 
#documents#events#triggers#multi event triggers
India 639 1,171 3,577 71 
China 73 123 416 11 
South Africa 184 357 1,060 16 

The distribution of the semantic event categories for each country in scope and the overall percentage of the categories in the data set are provided in Table 5. Most of the events fall into the demonstration and armed militancy categories. The remaining categories comprise only 28% of the events. The event category incidence ratios are imbalanced both in individual countries and across countries. This difference is one potential reason for the performance gap of the automated tools for event information collection in cross-context settings [17].

Table 5.
Number and percentage of demonstrations, political violence and armed militancy, group clashes, industrial actions, and electoral mobilization of events in a document.
#demonstration#armed militancy#group clash#industrial act#electoral politics#other
India 1,177 787 656 366 62 20 
China 358 64 50 
South Africa 269 39 57 
Overall ratio 53% 19% 17% 8% 2% 1% 
#demonstration#armed militancy#group clash#industrial act#electoral politics#other
India 1,177 787 656 366 62 20 
China 358 64 50 
South Africa 269 39 57 
Overall ratio 53% 19% 17% 8% 2% 1% 

The semantic categories of the actors, which are participants and organizers, are annotated at the token-level for data collected from India. The annotated participant categories and their counts are Activist in 416, People in 416, Militant in 275, Worker in 249, Professional in 153, Student in 146, Politician in 118, Peasant in 54, and Small Producer in 44 cases. The organizer counts are 518 for Political Party, 267 for NGO, 139 for Labor Union, 123 for Militant Organization, and 47 for Chamber of Professionals. There are 409 participant and 161 organizer annotations, mostly first and last names, that were not semantically labeled, because they do not carry any semantic information per our annotation schema. Finally, the Employer/Executive category labels have been removed since they emerged only four times in the whole data set.

5.2 Corpus Quality

The inter-annotator agreement (IAA) for document and sentence levels is above .75 and .65 Krippendorf's alpha [36] on average, respectively. The IAA for the token level is less consistent than other levels, which can be seen in Table 3 for each information type. We interpret these scores as an indication of the difficulty of the task and the extent to which the annotation manuals can facilitate consistent annotation.

The semantic information for event triggers, participants, and organizers were annotated after the token level annotations were completed and adjudicated for most of the documents. The annotations for the semantic category of the event trigger were applied following the double annotation and quality control approach at the token level. The IAA for the semantic category of the event trigger is .86 and .85 for data from India and China, respectively. However, the semantic annotations for participants and organizers were annotated at the token level by only one experienced annotator by resolving any issue with the annotation supervisor. Hence, these were the only annotations for which we could not provide an IAA score.

We selected the evaluation method proposed by Denis and Baldridge [37] and used in the CoNLL-2012 shared task [38] for measuring the quality of the event coreference annotation at the token level. We assumed the annotations of one annotator as the gold standard truth and annotations of the other annotator as predictions. We repeated this procedure for each annotator, and calculated the average of these scores. The average is .77, which we believe to be sufficient for modeling purposes.

The quality of all annotations in the corpus was improved semi-automatically. Adjudications, spot-checks, and manual analysis of the system predictions allowed us to fix at least 10% of the annotations for each task. The agreement scores should be perceived as an indication of the complexity of the concepts we attempted to process automatically.

In short, we suggest serious consideration of the following factors for creating a high-quality cross-context corpus that reliably captures a certain phenomenon, in a way that it can be modelled using machine learning models:

  1. Variety of sources: Each different data source enriches the representation of the phenomena in the corpus. Use of the automated tools in any context that is not represented in the corpus will yield outputs that are less reliable and less valid. Therefore, each target context must be represented with at least one source in the corpus.

  2. Random sampling: Raw data demonstrate how the target phenomenon is presented. Therefore, working with random samples is critical until an operational setting that can quantify performance of any other method of sampling the documents to be annotated, such as keyword selection or active learning, can be established.

  3. Supervisor: Consistency is the key element of automated tool development. Therefore, an expert, preferably the same person throughout the whole corpus preparation task, should prepare and maintain the annotation manual, train the annotators, adjudicate and spot-check the annotations, and double-check the incompatibilities between automated predictions and manual annotations.

  4. Annotation manual: The first version of an annotation manual should be a minimally viable product, consisting of a generic description of the target phenomena and basic instructions for the annotators. It should be updated as more data are observed throughout the annotation. In case of backward incompatible updates, the previous annotations should be updated semi-automatically using data annotated with the new manual.

  5. Annotating at various levels: The text should be annotated at multiple levels, e.g., document, sentence, and token, consecutively in a way that ensures the quality of the annotations at each individual level. Any annotation error detected in the preceding level should be corrected. For instance, a result of sentence level annotation, e.g., lack of relevant sentences, can indicate the error committed on document level annotation, e.g., relevant label for the document.

  6. Tracking irrelevant Information: Identifying relevant and irrelevant information is equally important for a gold standard corpus. This information will enhance the automated tool performance and enable quantification of recall. If any update is required for the annotation manual, the irrelevant documents should be considered to be relevant only if they satisfy the new conditions.

  7. Training of the annotators: Each context should be understood as well as possible by the people who will analyze data and make decisions. Therefore, annotators should be trained on the contexts within the scope. This practice increases consistency and decreases time spent on supervision of the annotators and adjudication of the annotations.

  8. Double annotation: The formal, abstract description of the phenomena in the annotation manual may differ drastically from how it is referred to within the text. Annotators' respective backgrounds may also affect their interpretations. Therefore, each annotation should be performed by at least two annotators, the annotators in a pair should be changed frequently, and different pairs of annotators should work on individual levels of annotation for each batch of documents.

  9. Multiple expert annotators: Cross-context work on a non-trivial phenomenon requires the attention of multiple experts. Therefore, an annotation team must consist of more than two annotators, which would prevent systematically erroneous agreements. Multiple annotation teams would further necessitate an expert annotation supervisor to ensure consistency and compliance with the manuals. Regular and consistent communication between annotators and the supervisor is critical for assuring consistency and reducing time and effort spent on retrospective corrections.

  10. Adjudications and spot-checks: Annotation disagreements between annotators should be adjudicated by the annotation supervisor. Moreover, some portion of the annotation agreements should be double-checked as spot-checks. Each disagreement or incorrect annotation provide insights to the supervisor about the data, the annotation manual, and the target phenomena. Errors and disagreements should be considered feedback and serve to inform changes to the annotation manuals during their maintenance. Annotation manuals should be updated recurrently in light of feedback from annotators, though not in a way that would reverse existing rules or create inconsistencies.

  11. Inter-annotator agreement (IAA): Keeping track of the IAA scores serves as a guide for determining points to be improved in the annotation manual, revealing differences in approaches and viewpoints of annotators, improving the rules of annotation, and even adjusting the annotation tasks. These scores also serve as information sources about the performance expected when automated tools that handle the target phenomena are created. High IAA will enable good performance.

  12. Semi-automated error correction: Batches of documents should be used to train and test ML models as soon as their annotation and quality assurance is complete. Any inconsistencies between the model predictions and the annotations must be double-checked by the annotation supervisor. This step frequently enables the identification of annotation errors caused by disagreement between annotators.

  13. Determination of the baseline performance: State-of-the-art ML models should be trained and validated on the corpus. The performance scores on a held-out test data and leave-one-context-out setting offers information about the usability of the corpus for the contexts in the scope of the study.

We suggest following all of these essential steps particularly at the beginning of a text annotation study. Some steps may be altered as the project progresses. For instance, keywords or active learning can be used to select documents that will be annotated only after testing them on the randomly sampled and annotated data. Another example is that if the annotation manuals become complete and stable after first iterations of the annotation, the training of the annotators could be relatively simpler in next iterations.

5.3 Corpus Release

The document and sentence level annotations are stored in JSON and token level data are stored in FoLiA [39] formats. We have distributed the corpus in a way that does not violate the copyright of the news sources. This involves only sharing information that is needed to reproduce the corpus from the source in cases news article distribution may be problematic. To this end, the document and sentence level data can be downloaded using software we developed and packaged in a Docker image. These software tools download, clean, and align text and are provided in a Docker image in order to facilitate ease of use and reproducibility. The validation of this software was performed during the aforementioned shared tasks.

Protest database building demands special attention to ethical dimensions of research because of the risks associated with political dissent. We have developed a data sharing protocol to minimize these risks and prevent the use of our data for malicious purposes against activists. Our data sharing policy is shaped by our belief in the power of scientific collaboration and research transparency for theory and method development as much as our concern about the well-being of precarious groups and/or individuals that are engaged in political activities for social change. In order to balance these sometimes-conflicting priorities, we have embraced a two-tier data sharing method. At the first layer, we share a processed and limited version of our data in a visualized form on an open-access website of the GLOCON Project. The protest data visualized on country maps on this website provide only macro-level information about the main protest categories, such as year, event-place (province or city), event-type (five major aggregated event-type semantic categories we have produced: demonstrations, industrial actions, group clashes, armed militancy, and electoral politics), urban/rural locations, ethnicity, and ideology. At the second layer, we share our detailed protest data sets (including detailed event, organizer, participant, place, facility and time information) and our GSC and computational tools only with researchers and parties (including social movements themselves) who comply with ethical standards of social movements research and norms for the protection of research subjects. In line with this policy, we do not share our detailed data with government or law enforcement agencies or researchers who collaborate or receive funding from intelligence or defense agencies.

We have exploited the data from India to train ML-based models using BERT [40] for document and sentence classification and token extraction in various scenarios. We fine-tuned the pre-trained BERT-based model with our data. The hyperparameters are the same as those of the original authors of each model for each level except for our sentence classifier which restricts maximum sequence length to 128 instead of 512. Table 6 provides F1-macro scores of the document and sentence classification and an F1-score that is based on the CoNLL 2003 evaluation script for token extraction models. The test data for India are a held-out sample because the training data are from India and all data for China and international data in the corpus. The row indicating “Token” is only the trigger detection performance in this table.

Table 6.
F1-macro of document and sentence classification and F1 for trigger detection.
IndiaChinaInt-ChinaSouth Africa
Document .89 .82 .83 .85 
Sentence .85 .79 .83 .85 
Token .74 .67 N/A N/A 
IndiaChinaInt-ChinaSouth Africa
Document .89 .82 .83 .85 
Sentence .85 .79 .83 .85 
Token .74 .67 N/A N/A 

The token level scores are based on the BERT-based model fine-tuned on our GSC and are generated using a held-out part of it (see Table 7). Additionally, we fine-tuned the Flair NER model [41], which is trained on CoNLL 2003 NER [42] with our data by mapping our place, participant, and organizer tags to “LOC”, “PER”, and “ORG” in CoNLL data, respectively. In comparison to the BERT-based model, this model yielded significantly better results of .780, .697, and .652 for the place, participant, and organizer types, respectively. Finally, we ran the same test data through an event extraction model, which was also a BERT-based model trained on ACE event extraction data. We measured the trigger detection performance of this model based on its CONFLICT category predictions. The F1 scores of the CONFLICT type are .543 for its own data and .479 for our new data. The difference between the scores obtained using ACE and our training data show that our efforts significantly contribute to the protest event collection studies.

Table 7.
Token level information extraction scores per information type.
TriggerTimePlaceFacilityParticipantOrganizerTarget
Precision 0.756 0.663 0.724 0.436 0.649 0.568 0.497 
Recall 0.691 0.704 0.646 0.436 0.564 0.619 0.485 
F1 0.722 0.683 0.683 0.436 0.604 0.593 0.491 
TriggerTimePlaceFacilityParticipantOrganizerTarget
Precision 0.756 0.663 0.724 0.436 0.649 0.568 0.497 
Recall 0.691 0.704 0.646 0.436 0.564 0.619 0.485 
F1 0.722 0.683 0.683 0.436 0.604 0.593 0.491 

We integrate the tools reported in Table 6 and report their performance in Table 8 on a separate data set of 200 news articles, which consist of 100 positively and 100 negatively predicted documents at the document level from India. Doc, Sent, and Tok correspond to the tool applied in the order the tool name is mentioned in the configuration name. The highest precision, recall, and F1-macro was yielded by Doc+Sent+Tok, Tok, and Doc+Tok, respectively. The event trigger detection score is the reported one for the Tok. The performance of the trigger detection is lower than the one reported in Table 7 since this evaluation setting contains non-protest documents. The obvious result is that each additional component improves precision but decreases the recall. The interesting result here is that integrating only the document classification tool enhances precision and a slight decrease in recall in comparison to other configurations.

Table 8.
Trigger detection performances in various configurations of a pipeline.
TokSent+TokDoc+TokDoc+Sent+Tok
Precision .624 .696 .660 .701 
Recall .663 .561 .647 .547 
F1 .643 .621 .653 .614 
TokSent+TokDoc+TokDoc+Sent+Tok
Precision .624 .696 .660 .701 
Recall .663 .561 .647 .547 
F1 .643 .621 .653 .614 

Additional results that were yielded using some parts of this corpus can be found in various publications. The results obtained in a shared task for the cross-context document and sentence classification and token extraction were reported in the overview paper of the ProtestNews Lab [17], which was held in the scope of Conference and Labs of the Evaluation Forum (CLEF 2019). Participants of this shared task reported comparable results to the performance reported in this paper. Moreover, the data set was facilitated in the event sentence coreference identification task in which the participants developed systems to identify sentences about the same event in the scope of the workshop Automated Extraction of Socio-political Events from News (AESPEN) at Language Resources and Evaluation Conference (LREC 2020) [18].

We created three sentence classification models exploiting data from India for the semantic categories of the event trigger, participant, and organizer. The sentence-level labels were inferred from token level labels. A BERT-based model with the maximum sequence length of 128 was fine-tuned for each of these tasks using the same BERT settings reported above.

The data exploited for the creation and evaluation of the event trigger semantic classification model is reported in Table 9. The sentences that contain event triggers that belong to multiple semantic categories were excluded from all subsets of the data, which are train, development, and test.

Table 9.
Sentence counts for classification of the semantic category of the event triggers.
TrainDevelopmentTest
Armed Militancy 361 48 72 
Demonstration 847 113 169 
Electoral Politics 31 
Group Clash 278 37 56 
Industrial Action 202 27 41 
Other 11 
TrainDevelopmentTest
Armed Militancy 361 48 72 
Demonstration 847 113 169 
Electoral Politics 31 
Group Clash 278 37 56 
Industrial Action 202 27 41 
Other 11 

The performance scores of the model for trigger semantic categorization are provided in Table 10. The data points labeled as “Electoral Politics” and “Other” are changed to “Demonstration” since their size is relatively small and the model that we trained by including these classes yielded an F1-score of only .60. The final model yielded relatively low scores for the Group Clash category. Our investigation via a confusion matrix showed that the model most frequently confuses the Group Clash category with the Demonstration category. We observed that the demonstration and group clashes simultaneously occur frequently in the data.

Table 10.
The precision, recall, and F1-macro for the semantic event trigger classification at the sentence level.
PrecisionRecallF1
Armed Militancy .85 .95 .90 
Demonstration .82 .90 .86 
Group Clash .74 .51 .61 
Industrial Action .90 .73 .81 
Overall average .83 .77 .79 
PrecisionRecallF1
Armed Militancy .85 .95 .90 
Demonstration .82 .90 .86 
Group Clash .74 .51 .61 
Industrial Action .90 .73 .81 
Overall average .83 .77 .79 

Next, we prepared the data for the participant and organizer semantic classification models. The “No” label, which is maintained as a separate class, is attached to the event sentences that do not contain any semantic information pertaining to participants or organizers. There are three kinds of cases where the “No” label is used. First, participant or organizer annotation may not occur in an event sentence, as is the case in 1,018 and 1,776 instances, respectively. Second, some first and last names do not indicate any semantic information, as in 99 and 173 cases, respectively. Third, some participants or organizers may carry semantic information that is not covered in our annotation schema. For instance, managers and employers, which occur only four times in the data set, are not covered in the participant categories. The remaining sentence counts were 1,336 for participant and 1,094 for organizer categories. We observed 139 sentences that contained multiple participant semantic categories and 38 that contained multiple organizer semantic categories. We consider these sentences to be conflicting and have thus excluded them from training and development sets when we created an ML model using these data. However, we included these data in the test by repeatedly adding the same sentence for each semantic type that occurs for a fair evaluation. Consequently, there are 387 and 377 unique sentences out of 422 and out of 385 samples in participant and organizer test sets, respectively.

Table 11 demonstrates the data sizes of each semantic participant class and the performance scores in terms of precision, recall, and F1-macro. The average score is .60 for all of these metrics. The performance of the model is the best for the militant and activist categories and the worst for the politician and small producer categories. Although the data size for a class affects the performance, we observe that the peasant category predicted better than the politician category, which has more data. Our analysis using a confusion matrix shows that six out of 14 instances of politician annotations are predicted as activist. Thus, any further work on this task should focus on both increasing data for each class and revising the annotation schema.

Table 11.
Data size and performance scores for the semantic participant classification at sentence level.
Data setPerformance
TrainDevelopmentTestPrecisionRecallF1-macro
People 234 31 59 .60 .61 .60 
Militant 167 22 39 .83 .76 .80 
Activist 236 32 62 .61 .70 .65 
Peasant 33 .44 .57 .50 
Student 69 18 .70 .66 .68 
Politician 52 14 .26 .28 .27 
Professional 64 17 .55 .64 .59 
Worker 126 17 35 .72 .51 .60 
Small Producer 20 .50 .33 .40 
No 838 112 168 .86 .85 .86 
Overall average .60 .59 .59 
Data setPerformance
TrainDevelopmentTestPrecisionRecallF1-macro
People 234 31 59 .60 .61 .60 
Militant 167 22 39 .83 .76 .80 
Activist 236 32 62 .61 .70 .65 
Peasant 33 .44 .57 .50 
Student 69 18 .70 .66 .68 
Politician 52 14 .26 .28 .27 
Professional 64 17 .55 .64 .59 
Worker 126 17 35 .72 .51 .60 
Small Producer 20 .50 .33 .40 
No 838 112 168 .86 .85 .86 
Overall average .60 .59 .59 

The data size for each class and the performance for the organizer semantic classification task is provided in Table 12. The data for this task enable the model to obtain relatively high scores for most of the classes. The average performance scores are around .70 for precision, recall, and F1-macro. The predictions for the classes militant organization and labor union are the most accurate. Conversely, the classes chamber of professionals and NGO are predicted more poorly.

Table 12.
Data size and performance scores for the semantic participant classification at sentence level.
Data setPerformance
TrainDevelopmentTestPrecisionRecallF1-macro
Militant Organization 55 12 .90 .75 .81 
Political Party 234 31 52 .68 .71 .69 
Chamber of Professionals 23 .25 .40 .30 
Labor Union 60 14 .78 .78 .78 
NGO 119 16 27 .68 .55 .61 
No 1,369 182 275 .93 .94 .93 
Overall average .70 .69 .69 
Data setPerformance
TrainDevelopmentTestPrecisionRecallF1-macro
Militant Organization 55 12 .90 .75 .81 
Political Party 234 31 52 .68 .71 .69 
Chamber of Professionals 23 .25 .40 .30 
Labor Union 60 14 .78 .78 .78 
NGO 119 16 27 .68 .55 .61 
No 1,369 182 275 .93 .94 .93 
Overall average .70 .69 .69 

The former category does not show any pattern in the confusion matrix. But the latter category is mostly confused with the Political Party and Chamber of Professionals.

Various parts of this corpus were used in shared tasks that were open to any research team to participate in 2019 and 2020. This subsection presents a summary of the insights pertaining to the quality of the corpus reported by the participants.

The Lab ProtestNews [16,17] was organized in the scope of Conference and Labs of the Evaluation Forum (CLEF 2019). The evaluation setting required participants to achieve unsupervised domain adaptation from India to China on English data. The Lab consisted of three subtasks that were document, sentence, and token classification. Although deep learning approaches outperformed traditional ML techniques, the gap between the system performances on India and China was still significant. The difference between the data distributions from India and China was measured using Jensen-Shannon (J-S) divergence and out-of-vocabulary rate of token (OOV) ratio by the team ProTestA [43]. The team ProTestA observed a positive significant correlation between the J-S similarity scores and the system performances across the three tasks. Moreover, the team DeepNEAT [44] reported nontrivial differences between the longest sentences in training and test data, which are 440 and 643, respectively.

The shared task Event Sentence Coreference Identification (ESCI) was organized in the scope of the workshop Automated Extraction of Socio-political Events from News [18]. The ESCI task required participants to group event sentences extracted from the same document in case they are about the same event. One of the task participants reported that the title of the news articles co-occured with some meta-data such as publication time and place in the same sentence. This team remedied the issue using regular expressions to remove the meta-data in these sentences [45].

Finally, as supplementary information to Table 8, we report the effect of these configurations on a random sample from four news archives from India. We have aggregated results of the tools on randomly sampled 20,000 news articles from each source. Figure 2 provides the temporal event distribution from July 2002 to January 2018 for each configuration provided in Table 8 . We observe a comparable pattern across the sub-figures. However, the intensity of the events as measured in terms of event count differs across these configurations. Moreover, the Tok and Doc+Sent+Tok, which yield the highest recall and precision, respectively, have the potential not to be comparable. An evaluation against real event count, which has not been performed in the scope of our study, should direct us towards the right path to follow.

Temporal distribution of news from four sources from India.

Figure 2.
Temporal distribution of news from four sources from India.
Figure 2.
Temporal distribution of news from four sources from India.
Close modal

We introduced a gold standard corpus (GSC) that enables benchmarking and the creation of automated tools for contentious political event-related information collection across contexts. The methodology we developed to ensure the quality of the corpus, our observations during the application of this methodology, and the results obtained via the automated tools created using the corpus have been reported in detail. The clear performance drop when the test data differ from training data, known as the domain or covariate shift problem [46], shows how critical it is to incorporate cross-context aspects to the corpus creation, tool development, and evaluation. Handling each context separately at least in the evaluation phase is an indispensable part of the solution to measure and improve reliability of the performance scores. This is achieved by testing the models on data collected from the target contexts. Our recent ML models were mainly created using training data from only a single country. The following steps should incorporate data from multiple contexts and engage with domain adaptation techniques at the model creation phase [47, 48].

We have kept track of what is included and excluded at each level to better automatize the task and allow quantification of the recall, which has been a notable gap in this field. Restricting data sets by using keywords or basing a protest knowledge base on a subset of a source due to practical reasons was harming the validity and reliability of the resulting data sets. Starting with a random sample and proceeding with recall-optimized active learning during the creation of the gold standard corpus ensures that training data will improve the quality of the final gold standard data set. We obtained 97% recall and 60% precision when we used recall-optimized ML models to extend the corpus.

Our approach based on random sampling from multiple contexts has enabled us to create a test-bed for reliably measuring the performance of the state-of-the-art ML methodology on the target tasks and settings. We have invested in the line of research that shows improvement when data are prepared using random samples and when deep learning methodologies such as BERT are facilitated in comparison to using keyword filtered data sets and traditional ML in cross-context settings [49]. However, Yörük et al. [49] reported on the performance of the random samples only relevant to the document classification task. We will continue this investigation on sentence classification and event extraction tasks.

The immediate step that will follow this study is to evaluate models trained with English data from India, China, and South Africa. We have already extended the GSC with news sources in Portuguese and Spanish and the cross-lingual extension and validation of the models will follow shortly. Other extensions will be on semantic categories such as violent vs. non-violent, urban vs. rural, and economic demands vs. non-economic demands. Furthermore, we will handle documents and sentences that contain multiple events. The recent models assume there is a separate event for each event trigger identified by the token extractor [27]. However, our observations directed us to identify and link the triggers that denote the same event [50, 51]. We will be developing tools for linking the event triggers about the same event in our pipeline [52, 18]. We will work on news sources in Portuguese and Spanish to cover at least Brazil, Mexico, and Argentina. This work will facilitate cross-lingual ML models to be created. Consequently, the data from one language are expected to enhance the work in a new language [53]. Any error found will be corrected and the corpus will be extended with new data to avoid reducing the corpus to a temporally restricted and biased collection.

A. Hürriyetoğlu ([email protected]) led the study, wrote and submitted the paper, developed text processing tools, and created machine learning models reported in this study. E. Yörük ([email protected]) determined the need for this effort, and together with Ç. Yoltar ([email protected]) and B. Gürel ([email protected]) operationalized the concept of the contentious politics to be applied for this study, identified sources used for collecting data, and ensured quality. O. Mutlu ([email protected]) wrote the results of the paper with A. Hürriyetoğlu, collected data, maintained an online annotation platform, ensured quality of the annotations, developed text processing tools, and created deep learning models reported in this study. F. Duruşan ([email protected]) wrote the annotation manual and the annotation setting related parts in the paper, proofread the paper, and supervised the annotation team. D. Yüret ([email protected]) advised the team on the annotation methodology and state-of-the-art machine learning model creation.

This study is funded by the European Research Council (ERC) Starting Grant 714868 awarded to Dr. Erdem Yörük for his project Emerging Welfare.

International sources were filtered based on meta-information to focus on the case countries.

The period covered in these archives is between 2000 and 2017.

Only publicly accessible online information is processed and shared in terms of online URLs. We design our data collection, annotation, and tool development in a way that it would not yield any sensitive information that could be used to target individuals by malicious state actors or information about individuals. The precautions considered are: using and distributing data via URLs, and express personal characteristics in terms of broad categories such as student or worker.

This performance was measured on an AL sample that was predicted as positive and 200 news articles that were excluded from annotation as they were predicted as negative at this sampling operation. The training data consisted of around 4,000 news articles that were randomly sampled and annotated from the same country of the resulted AL batch.

Although planned events and protest threats could have a role in our analysis [24], they are neither relevant in the CP context nor their prevalence, which is below 0.5% of a random sample according to our observations, allows their automated analysis.

We treat the event triggers and any other expressions that have a hyphen between them as a single token, e.g., ‘stone-pelting’. But, when there is not a hyphen between words, which is the case for ‘shouting slogans’, the expression consists of two tokens. The first token is annotated as B-trigger and the following token(s) are annotated as I-trigger.

https://catalog.ldc.upenn.edu/LDC2017T09, accessed on November 25th, 2019.

https://trec.nist.gov/data/reuters/reuters.html, accessed on November 25th, 2019.

Around 10% of the positively annotated documents at the document level in a random sample reports a protest event that does not occur in a country under focus.

An event sentence may contain more than one participant or organizer. Therefore, the event sentence and annotation counts do not match.

Please follow the instructions on the Global Contentious Politics Gold Standard Data (GLOCON GOLD) repository to obtain the corpus: https://github.com/emerging-welfare/glocongold.

https://glocon.ku.edu.tr/, accessed on April 7, 2021.

For our document model, in inference time, if we split the text in subparts smaller than 512 tokens and take the logical or of each subpart's prediction as that document's prediction, the performance increases 2–3 F1 macro points in comparison to just using the first 512 tokens in a document for prediction.

The international data are our Guardian sample that is filtered using active learning for China.

https://github.com/flairNLP/flair, accessed on April 5, 2020.

We exclude 15 documents that contain events not related to India.

Please find working notes of all participants on http://ceur-ws.org/Vol-2380/, accessed on January 21, 2021.

This is only to temporally compare results of each configuration. Any additional comment on these figures requires further investigation.

The data sets generated and/or analyzed during the current study are not publicly available and we explained how researchers should contact us to obtain the whole data at section 5.3. Supplemented data including the annotation manuals are available in the Science Data Bank repository, 10.11922/sciencedb.j00104.00092, under an Attribution 4.0 International (CC BY 4.0). The detailed protest data sets are available from the corresponding author on reasonable request.

[1]
Giugni
,
M.G.
:
Was it worth the effort? The outcomes and consequences of social movements
.
Annual Review of Sociology
24
,
371
393
(
1998
)
[2]
Tarrow
,
S.
:
Power in movement: Social movements, collective action and politics
.
Cambridge University Press
,
New York
(
1994
)
[3]
Chenoweth
,
E.
,
Lewis
,
O.A.
:
Unpacking nonviolent campaigns: Introducing the NAVCO 2.0 dataset
.
Journal of Peace Research
50
(
3
),
415
423
(
2013
)
[4]
Weidmann
,
N.B.
,
R⊘d
,
E.G.
:
The Internet and political protest in autocracies, chapter coding protest events in autocracies
.
Oxford University Press
,
Oxford
(
2019
)
[5]
Raleigh
,
C.
, et al.:
Introducing ACLED: An armed conflict location and event dataset: Special data feature
.
Journal of Peace Research
47
(
5
),
651
660
(
2010
)
[6]
Yörük
,
E.
:
The politics of the Turkish welfare system transformation in the neoliberal era: Welfare as mobilization and containment
. PhD dissertation, The Johns Hopkins University, Baltimore (
2012
)
[7]
Nardulli
,
P.F.
,
Althaus
,
S.L.
,
Hayes
,
M.
:
A progressive supervised-learning approach to generating rich civil strife data
.
Sociological Methodology
45
(
1
),
148
183
(
2015
)
[8]
Leetaru
,
K.
,
Schrodt
,
P.A.
:
GDELT: Global data on events, location, and tone, 1979–2012
. In:
Annual Meeting of the International Studies Association
, pp.
1
49
(
2013
)
[9]
Boschee
,
E.
,
Natarajan
,
P.
,
Weischedel
,
R.
:
Automatic extraction of events from open source text for predictive forecasting
. In:
Subrahmanian
,
V.S.
(ed.)
Handbook of Computational Approaches to Counterterrorism
, pp.
51
67
.
Springer
,
New York
(
2013
)
[10]
Schrodt
,
P.A.
,
Beieler
,
J.
,
Idris
,
M.
:
Three's a charm?: Open event data coding with EL: Diablo, Petrarch, and the open event data alliance
. In:
Annual Meeting of the International Studies Association
, pp.
1
24
(
2014
)
[11]
Sönmez
,
C.
,
Özgür
,
A.
,
Yörük
,
E.
:
Towards building a political protest database to explain changes in the welfare state
. In:
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
, pp, pp.
106
110
. Association for Computational Linguistics, Stroudsburg (
2016
)
[12]
Wang
,
W.
, et al.:
Growing pains for global monitoring of societal events
.
Science
353
(
6307
),
1502
1503
(
2016
)
[13]
Ettinger
,
A.
, et al.:
Towards linguistically generalizable NLP systems: A workshop and shared task
. In:
Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems
, pp.
1
10
. Association for Computational Linguistics, Stroudsburg (
2017
)
[14]
Ward
,
M.D.
, et al.:
Comparing GDELT and ICEWS event data
.
Event Data Analysis
21
(
1
),
267
297
(
2013
)
[15]
Lorenzini
,
J.
, et al.:
Towards a dataset of automatically coded protest events from English-language newswire documents
. Paper presented at the Amsterdam Text Analysis Conference,
2016
.
[16]
Hürriyetoğlu
,
A.
, et al.:
A task set proposal for automatic protest information collection across multiple countries
. In:
Azzopardi
,
L.
, et al. (eds.)
Advances in Information Retrieval
, pp.
316
323
.
Springer International Publishing
,
Cham
(
2019
)
[17]
Hürriyetoğlu
,
A.
, et al.:
Overview of CLEF 2019 Lab ProtestNews: Extracting protests from news in a cross-context setting
. In:
Crestani
,
F.
, et al. (eds.)
Experimental IR Meets Multilinguality, Multimodality, and Interaction
, pp.
425
432
.
Springer International Publishing
,
Cham
(
2019
)
[18]
Hürriyetoğlu
,
A.
, et al.:
Automated extraction of socio-political events from news (AESPEN): Workshop and shared task report
. In:
Proceedings of the Workshop on Automated Extraction of Socio-political Events from News
, pp.
1
6
. European Language Resources Association (ELRA), Luxemburg (
2020
)
[19]
Croicu
,
M.
,
Weidmann
,
N.B.
:
Improving the selection of news reports for event coding using ensemble classification
.
Research & Politics
2
(
4
), 2053168015615596 (
2015
)
[20]
Hanna
,
A.
:
MPEDS: Automating the generation of protest event data
. Available at: osf.io/preprints/socarxiv/xuqmv. Accessed 5 March 2021
[21]
Makarov
,
P.
, et al.:
Towards automated protest event analysis
. In:
New Frontiers of Automated Content Analysis in the Social Sciences
, pp.
1
14
(
2015
). Available at: https://doi.org/10.5167/uzh-143877. Accessed 5 March 2021
[22]
Doddington
,
G.
, et al.:
The automatic content extraction (ACE) program – tasks, data, and evaluation
. In:
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC'04)
, pp.
837
840
. European Language Resources Association (ELRA), Luxemburg (
2004
)
[23]
Mitamura
,
T.
,
Liu
,
Z.Z.
,
Hovy
,
E.H.
:
Overview of TAC-KBP 2015 event nugget track
. In:
Proceedings of the 2015 Text Analysis Conference, TAC 2015
. Available at: https://tac.nist.gov/publications/2015/additional.papers/TAC2015.KBP_Event_Nugget_overview.proceedings.pdf. Accessed 5 March 2021
[24]
Gerner
,
D.J.
, et al.:
Conflict and mediation event observations (CAMEO): A new event data framework for the analysis of foreign policy interactions
. Paper presented at International Studies Association,
2002
.
[25]
Huang
,
R.H.
, et al.:
Distinguishing past, on-going, and future events: The EventStatus corpus
. In:
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
, pp.
44
54
. Association for Computational Linguistics, Stroudsburg (
2016
)
[26]
Makarov
,
P.
,
Lorenzini
,
J.
,
Kriesi
,
H.
:
Constructing an annotated corpus for protest event mining
. In:
Proceedings of the First Workshop on NLP and Computational Social Science
, pp.
102
107
. Association for Computational Linguistics, Stroudsburg (
2016
)
[27]
Weischedel
,
R.
,
Boschee
,
E.
:
What can be accomplished with the state of the art in information extraction? A personal view
.
Computational Linguistics
44
(
4
),
651
658
(
2018
)
[28]
Tanev
,
H.
,
Piskorski
,
J.
,
Atkinson
,
M.
:
Real-time news event extraction for global crisis monitoring
. In:
Kapetanios
,
E.
,
Sugumaran
,
V.
,
Spiliopoulou
,
M.
(eds.)
Natural Language and Information Systems
, pp.
207
218
.
Springer
,
Berlin
(
2008
)
[29]
Johnson
,
E.W.
,
Schreiner
,
J.P.
,
Agnone
,
J.
:
The effect of New York Times event coding techniques on social movement analyses of protest data
. In:
Hancock
,
L.E.
(ed.):
Narratives of Identity in Social Movements, Conflicts and Change
, pp.
263
291
.
Emerald Group Publishing
,
Bingley
(
2016
)
[30]
Parker
,
R.
, et al.:
English Gigaword
(5th Ed.)
Linguistic Data Consortium
,
Philadelphia
(
2011
)
[31]
Strawn
,
K.D.
:
Protest records, data validity, and the Mexican media: Development and assessment of a keyword search protocol
.
Social Movement Studies
9
(
1
),
69
84
(
2010
)
[32]
Maney
,
G.M.
,
Oliver
,
P.E.
:
Finding collective events: Sources, searches, timing
.
Sociological Methods & Research
30
(
2
),
131
169
(
2001
)
[33]
Settles
,
B.
:
Active learning literature survey
. Computer Sciences Technical Report 1648, University of Wisconsin–Madison (
2009
)
[34]
Lewis
,
D.D.
, et al.:
Rcv1: A new benchmark collection for text categorization research
.
The Journal of Machine Learning Research
5
,
361
397
(
2004
)
[35]
Jenkins
,
J.C.
,
Jacobs
,
D.
,
Agnone
,
J.
:
Political opportunities and African-American protest, 1948–1997
.
American Journal of Sociology
109
,
277
303
(
2003
)
[36]
Krippendorff
,
K.
, et al.:
On the reliability of unitizing textual continua: Further developments
.
Quality and Quantity
50
(
6
),
2347
2364
(
2016
)
[37]
Denis
,
P.
,
Baldridge
,
J.
:
Global joint models for coreference resolution and named entity classification
.
Procesamiento Del Lenguaje Natural
42
,
87
96
(
2009
)
[38]
Pradhan
,
S.
, et al.:
CoNLL-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes
. In:
Joint Conference on EMNLP and CoNLL - Shared Task, CoNLL '12
, pp.
1
40
. Association for Computational Linguistics, Stroudsburg (
2012
). Accessed 5 March 2021
[39]
van Gompel
,
M.
,
Reynaert
,
M.
:
FoLiA: A practical XML format for linguistic annotation - a descriptive and comparative study
.
Computational Linguistics in the Netherlands Journal
3
,
63
81
(
2013
)
[40]
Devlin
,
J.
, et al.:
BERT: Pre-training of deep bidirectional transformers for language understanding
. In:
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pp, pp.
4171
4186
. Association for Computational Linguistics, Stroudsburg (
2019
)
[41]
Akbik
,
A.
,
Blythe
,
D.
,
Vollgraf
,
R.
:
Contextual string embeddings for sequence labeling
. In:
Proceedings of the 27th International Conference on Computational Linguistics
, pp.
1638
1649
. Association for Computational Linguistics, Stroudsburg (
2018
)
[42]
Tjong
,
E.F.
,
Sang
,
K.
,
De Meulder
,
F.
:
Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition
. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL. Association for Computational Linguistics, Stroudsburg (
2003
)
[43]
Basile
,
A.
,
Caselli
,
T.
:
ProTestA: Identifying and extracting protest events in news notebook for ProtestNews Lab at CLEF 2019
. In CLEF (Working Notes) (
2019
)
[44]
Basar
,
E.
,
Ekiz
,
S.
,
van den Bosch
,
A.
:
A comparative study on generalizability of information extraction models on protest news
. In: CLEF (Working Notes) (
2019
)
[45]
Örs
,
F.K.
,
Yeniterzi
,
S.
,
Yeniterzi
,
R.
:
Event clustering within news articles
. In:
Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020
, pp.
63
68
. European Language Resources Association (ELRA), Luxemburg (
2020
)
[46]
Storkey
,
A.J.
,
Sugiyama
,
M.
:
Mixture regression for covariate shift
. In:
Proceedings of the 19th International Conference on Neural Information Processing Systems (NIPS'06)
, pp.
1337
1344
. MIT Press, Cambridge (
2006
)
[47]
Li
,
D.
, et al.:
Learning to generalize: MetaLearning for domain generalization
. In:
AAAI Conference on Artificial Intelligence 2018
. Available at: https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16067/16547. Accessed 5 March 2021
[48]
He
,
R.D.
, et al.:
Adaptive semi-supervised learning for cross-domain sentiment classification
. In:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pp.
3467
3476
. Association for Computational Linguistics, Stroudsburg (
2018
)
[49]
Yörük
,
E.
, et al.:
Random sampling in corpus design: Cross-context generalizability in automated cross-national protest event collection
. To appear at American Behavioral Scientist (
2021
)
[50]
Ruppenhofer
,
J.
, et al.:
Semeval-2010 Task 10: Linking events and their participants in discourse
. In:
Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval'10)
, pp.
45
50
. Association for Computational Linguistics, Stroudsburg (
2010
)
[51]
Gabbard
,
R.
,
Freedman
,
M.
,
Weischedel
,
R.
:
Coreference for learning to extract relations: Yes virginia, coreference matters
. In:
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
, pp, pp.
288
293
. Association for Computational Linguistics, Stroudsburg (
2011
)
[52]
Lu
,
J.
,
Ng
,
V.
:
Event coreference resolution: A survey of two decades of research
. In:
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)
, pp.
5479
5486
. International Joint Conferences on Artificial Intelligence Organization, Menlo Park (
2018
)
[53]
Conneau
,
A.
,
Lample
,
G.
:
Cross-lingual language model pretrainingz
. In:
Wallach
,
H.
(eds.)
Advances in Neural Information Processing Systems
32
, pp.
7057
7067
. Available at: http://papers.nips.cc/paper/8928-cross-lingual-language-model-pretraining.pdf. Accessed 5 March 2021
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.