Author Mentions in Science News Reveal Widespread Disparities Across Name-inferred Ethnicities

Media outlets play a key role in spreading scientific knowledge to the general public and raising the profile of researchers among their peers. Yet, how journalists choose to present researchers in their stories is poorly understood. Using a comprehensive dataset of 223,587 news stories from 288 U.S. outlets reporting on 100,486 research papers across all areas of science, we investigate if the authors' ethnicities, as inferred from names, are associated with whether journalists explicitly mention them by name. By focusing on research papers news outlets chose to cover, our analysis reduces concerns that differences in name mentions are driven by differences in research quality or newsworthiness. We find substantial disparities in name mention rates across ethnically-distinctive names. Researchers with non-Anglo names, especially those with East Asian and African names, are significantly less likely to be mentioned in news stories covering their research, even when comparing stories from a particular news outlet reporting on publications in a particular scientific venue on a particular research topic. The disparities are not fully explained by authors' affiliation locations, suggesting that pragmatic factors such as difficulties in scheduling interviews play only a partial role. Furthermore, among U.S.-based authors, journalists more often use authors' institutions instead of names when referring to non-Anglo-named authors, suggesting that journalists' rhetorical choices are also key. Overall, this study finds evidence of ethnic disparities in how researchers are described in the media coverage of their research, likely affecting thousands of non-Anglo-named scholars in our data alone.


Introduction
Scientific breakthroughs often attract media attention, which serves as a key mechanism for public dissemination of new knowledge (1,2). Science reporting not only distills research insights but also puts a face on who was responsible for the research. The media coverage can then feed back into researchers' careers (3). Furthermore, science reporting may over time shift the public's perception of who a scientist is (4). Under-representing particular demographic groups can perpetuate the view that scientists are white males (5,6), and potentially weaken the pipeline of recruiting diverse students into academic careers (7)(8)(9).
Disparities in media attention of science can be separated into (i) the coverage amountwhose paper gets covered and (ii) the coverage type-how is it covered, conditioning on the paper already being covered (e.g., does it also mention the author by name?). These two kinds of coverage entail different processes and are affected by different factors. For instance, the coverage amount can be impacted by the newsworthiness of the paper, which however, might have limited effect on how journalists are engaging with the author once the paper is deemed worthy of coverage. It is therefore important to distinguish the two kinds of disparities and examine them separately. Here we investigate the latter mechanism-the rate at which scientists are mentioned by name in the media coverage of their research. Thus our study focuses on how rather than whether reporters chose to cover a scientific paper.
In writing about specific scientific developments, journalists face choices over how much attention to devote to each relevant researcher, and whom to ignore altogether. Empirical and theoretical literature motivates the possibility that ethnic disparities exist in journalists' choices of whom to feature and the nature of the resulting coverage (10)(11)(12).
Theoretically, we hypothesize a number of mechanisms that may produce ethnic disparities Second, even for authors located within the same geographical region (e.g., in the U.S.), certain authors may have limited proficiency in speaking English. Furthermore, journalists may rely on their professional networks to contact sources. Analyses of the media landscape in the U.S. (33,34) and other markets (35) show that the demographics of journalists and editors are highly unrepresentative of the broader populations. The demographics of journalists are likely to correlate with that of individuals in their professional networks (36), suggesting that the researchers journalists can reach most readily are also unrepresentative. To the extent that these pragmatic factors-interviewing difficulties and professional networks-correlate with the perceived ethnicities of names, certain researchers may be more or less mentioned.
Third, while science journalists aim to write stories that appear credible to their audiences (37), they may lack direct information on the credibility of authors of the relevant research papers and may not have the time to acquire such information. Facing unfamiliar names and time constraints, journalists may rely on stereotypes, inferring for example that some researchers are less competent or authoritative on some topics than others, or expecting their audiences to harbor such perceptions. Prior research has found such stereotyping in the context of researcher gender and gender-typical research topics (38). Inferences of competence and authoritativeness can lead journalists to choose some names over others, which is a form of statistical discrimination (39,40).
Fourth, journalists may not be the relevant actors at all. Some news coverage originates from press releases created by in-house public relation staff at universities. News outlets often reprint these press releases in part or in full, and any disparities therein may thus be passed on directly to the outlets and their audiences.
Here, we present the first large-scale and science-wide analysis of ethnic disparities in author mentions in science news and the mechanisms producing them. We use a computational analysis of 223,587 news stories mentioning 100,486 published papers to test for disparities in the type of media coverage by examining whether the covered paper's authors are mentioned by name (see Data and Methods). For each paper, we focus on authors at the highest "risk" of being mentioned in the story: first author, last author, and any authors designated as "corresponding".
By focusing on papers that already were deemed newsworthy, our research design side-steps the question of whose research is covered in the news in the first place, choices which may themselves be associated with ethnicity.
We use mixed-effects regression models to control for a broad range of plausible confounding factors, including affiliation location, author prestige, authorship position, and corresponding author designation (see Data and Methods). The models also enable us to measure differential mentions within a particular news outlet covering a particular academic journal on a particular research topic, which helps ensure that we are comparing media mentions of researchers doing comparable work. Nevertheless, these models cannot provide conclusive causal evidence of ethnic discrimination by journalists or other actors.
We use the term "ethnicity" rather than race or nationality for three reasons: (1) race is a biological attribute based on physical appearance; (2) a person's nationality is largely unknown and very fluid, especially in the U.S.; (3) journalists only have access to author names upon reading the paper, and names reflect cultural heritage that is more related to ethnicity and signals a richer set of information than race.
Lacking the information about authors' self-identity, we based our study on the perceived ethnicity to distinguish it from authors' true ethnicity. This research choice entails substantial trade-offs. In fact, authors' self-identities may differ from their perceived ones, and some authors may self-identify with more than one ethnicity. In some cases, journalists know authors' self-identified ethnicity. Nevertheless, in many cases, journalists will not know how authors self-identity themselves and rather infer them from names. In these cases, using authors' selfidentities would be problematic, as it would misrepresent the actual perceptions journalists form and possibly use when they write their stories.
We algorithmically inferred the perceived ethnicity from authors' names, which mirrors how a reader might perceive ethnicity based on regularities in where the name originates. This choice may introduce bias because algorithmic inference of ethnicity is not perfectly aligned with human perceptions (41), a limitation we return to in the discussion.
Our adoption of the perceived ethnicity construct and the operationalization of it via names have three merits: (1) it enables us to measure disparities in the information environment jour-nalists actually face, and is thus more likely to illuminate their decision processes; (2) the construct of perceived ethnicity inferred via names has been widely used for decades in audit studies that use names to signal ethnicity or race to evaluators (42)(43)(44)(45); (3) the literature suggests that self-identified ethnicity and the perceived one are highly correlated, and that humans can and do infer ethnicity from names fairly accurately (45)(46)(47)(48)(49).
Overall, we do not measure the true ethnicity, but rely on the perceived identity inferred from names. Therefore our conclusions should be interpreted as reflecting disparities among scientists with name-inferred ethnicities rather than self-identified ethnicities directly. However, we provide some evidence of consistent results when coding author names with racial selfidentities using the U.S. census data (SI Fig. S4).

Who Gets Mentioned?
We find substantial and wide-spread disparities in author mentions across name-inferred ethnicities. These disparities are robust to the inclusion of increasingly stringent controls (Model 5 in SI Table S5). Specifically, compared to British-origin named authors, most authors with minority-ethnicity 1 names are significantly less likely to be mentioned, with European names disadvantaged the least while East Asian and African names disadvantaged the most.
In contrast to ethnicity, we find no disparity in author mentions across genders. However, when fixed effects for paper keywords are not considered, the author gender variable appears to have a significant effect (Model 3 in SI Table S5). As gender representation varies widely across academic disciplines (17,50), this result suggests that gender differences in mention rates are likely to be explained by relative author mention rates to research papers in different fields.
To quantify ethnic disparities in mentions, we calculated the average marginal effects for the author ethnicity and gender variables using the fullest model (Model 5 in Data and Methods). As shown in Fig. 1

-based Authors
In science reporting, journalists often directly seek out the authors by phone or email to contextualize and explain their results. If an author is at a non-U.S. institution, a journalist from a U.S.-based outlet could be less likely to reach out due to challenges in time-zone differences or lower expectations of fluency, potentially resulting in a lower rate of being mentioned or quoted.
Indeed, our previous result shown in Fig. 1 (based on Model 5) already controls for author's affiliation location, which indicates that international scientists are significantly less likely to be mentioned compared with their U.S. domestic counterparts of the same ethnicity (see the affiliation location coefficient in SI Table S5), suggesting that affiliation location is one major 6 factor influencing the mention probability.
However, the same regression also suggests that location drives only part of the mention disparities, as disparities between minority and British-origin still exist conditioning on authors being in the same geographical location (whether inside or outside of the U.S.). In other words, the chance of being mentioned is not entirely determined by whether the author is in the U.S.
or not; if it is, we would see no ethnic disparity after controlling for location.   rates for authors with East Asian-associated and African-associated names. We note that this result suggests, but does not prove, that the perceived fluency is a driving mechanism, as other mechanisms such as the rhetorical value of names, may also produce this result.
To more directly test the rhetorical mechanism, we examine "institution-substitution" where the author is mentioned by their institution but not by name (see Data and Methods), e.g., being  Surprisingly, the ethnic disparities in mention rates remain consistent across all outlet types, as shown in Fig. 3, with authors of non-British-origin names being mentioned less frequently.
Larger disparities are found for ethnic categories that are more culturally distant from Britishorigin (e.g., East Asian and African). Although the three outlet types have similar sizes of absolute disparities, they vary substantially in the relative scale, as the average mention rates of Science & Technology outlets and General News outlets are 34.0%-61.9% less than Press Releases outlets (SI Table S4).
The disparity in Press Releases outlets is particularly notable, as stories in these outlets typically reuse content from university press-releases, suggesting that universities' press offices themselves, while less biased than other outlet types, still prefer to mention scholars with British-origin names. This result is unexpected because local press offices are expected to have greater direct familiarity with their researchers, reduce the misuse of stereotypes, and to be more responsible for representing minority researchers equitably.
The largest disparities are seen in General News outlets, e.g., The New York Times and The Washington Post, where again scholars with African-and Chinese-associated names have 6.0-8.0 percentage points drop in mention rates. General News outlets mention authors with a 24.2% chance on average (SI Table S4), so this drop nearly reduces to two thirds the perceived role of a large community of scientists. As General News outlets have well trained editorial staff and science journalists dedicated to accurately reporting science and tend to publish longer stories that have room to mention and engage with authors, this result is alarming. Historically, these ethnic minorities have been underrepresented, stereotyped, or even completely avoided in U.S. media (27), which has continued in objective science reporting across all outlet types. The mechanisms of this variation deserve further investigation.
6 Is the Situation Getting More Equitable?
The longitudinally-rich nature of our dataset allows us to examine how author mentions in science news have changed over the last decade. Mention rates are on average decreasing over time, as shown by the coefficient of the mention year scale variable in Model 5 (SI Table S5).
To examine the time trends across demographic categories, a separate Model 5 was trained to quantify the marginal change per year increase for each gender and ethnicity in our full data.  Note that demographic attributes not under study were still included in each model, e.g., when examining the temporal changes in mention rates for male and female authors, ethnicity was still included as a factor, and vice versa.
As shown in Fig. 4, the mention year has a negative association with author mention rates for all gender and ethnic groups, and the larger decrease for the British-origin indicates that their overall advantages are shrinking. Indeed, authors with non-Chinese East Asian names, one of the most disadvantaged group in this study, have the lowest decreasing rate.
However, the estimated rates of change are relative small for most ethnic groups, suggesting that the existing disparities are unlikely to disappear in the short term without intentional behavior change. Since the relationship between the mention year and author's mention probability is nonlinear in the model's assumption, we are unable to make broader predictions as to when the mention equality will be reached eventually. We also refrain from adopting other more sophisticated time series analysis models to forecast the trajectory of mention rates in the long run, because such extrapolation will be of little practical use, especially given that the long-term changes in the academia and media practices remain unforeseeable.

Discussion
Our analyses reveal that the attention researchers get in news mentions is strongly related to the ethnicities associated with their names. The effects are robust to a variety of plausible confounds, and even appear when controlling for the (1) particular news outlet, (2) particular scientific venue, and (3) particular research topic. Although we cannot claim that the reported effects are causal, this unusually strong observational evidence deserves further attention.

Ethnicity and Gender
Authors with most non-British-origin names are mentioned substantially less when their research is covered in science news. Mention rates are especially low for East Asian and African names, less pronounced for European names, are even less pronounced for Indian and Middle Eastern names. As science becomes more global and is increasingly driven by authors of non-Western ethnicities, the way English-language media responds to non-British-named scholars will only grow in importance.
In contrast to ethnicity, we do not find gender disparities in mentions of scholars once the research fields are controlled for. One possible reason is that fields vary in their overall level of mention rates and in their gender representation (50). Looking within fields masks gender disparity that may exist between them. We would like to note that this result may not apply to Asian-named authors, as their gender is often classified as "Unknown" based on names.

Ruling in and out different mechanisms
Our analyses point to a multi-causal generation of ethnic disparities, in which both pragmatic difficulties of interviewing researchers (location and perceived fluency) and journalists' tastes regarding names' rhetorical values play key roles.
In support of pragmatic difficulties, we find that international locations (which tend to have scholars with more non-British-origin names) have a negative effect on mention rates (SI Table S5). However, location is not the only driving mechanism as disparities still occur when controlling for location ( Fig. 1). Additional evidence is that disparities persist among both international authors and U.S.-based authors, which would disappear if location was the decisive factor (Table 1). In support of the mechanism of perceived fluency in speaking English, we find that ethnic disparities appear in direct quotations among U.S.-based authors. These authors are unlikely to suffer from time-zone difficulties in scheduling interviews, but may differ in their perceived English fluency based on their name (Fig. 2b).
In addition to these pragmatic factors, journalists' rhetorical choices are key. In support of the role of this, journalists are more likely to "substitute" a direct name mention with the researcher's institution for authors with East Asian and African names (Fig. 2c), suggesting that the context of discovery is important, but the institution serves the journalists' rhetorical goals better than the name. Additional evidence comes from outlet types: when journalists' role in the news articles is minimal-when the outlet simply republishes a university press release-the (relative) disparities are also minimal; when the news stories are written by journalists themselves, the (relative) disparities are the largest. However, we note that the disparities in Press Releases outlets also suggest that journalists are not the only actor behind the inequality.

Limitations
Although the scale and the breath of our dataset enable the use of unusually fine-grained controls, the analysis is not without limitations. First, the observational nature of the data precludes strong causal statements. Second, the analysis was conducted with perceived ethnicities, which do not reflect self-identities accurately, nor account for multi-ethnic identities. We hope our work stimulates the collection of such data where possible, to enable more accurate and finegrained conclusions (52). Thus, a key limitation of our design and the voluminous audit study literature must be acknowledged: such types of studies do not measure whether journalists actually form an inference of ethnicity when seeing names. We believe assuming that they do form such inferences is very reasonable and supported by the large empirical disparities we observe here. More direct evidence on journalists' decision processes is a fruitful direction for future research. Besides, we inferred the perceived ethnicity via a name-based classifier, Ethnea. Although journalists, like the classifier, may have no information about authors except their names, the inference will undoubtedly not match all actual human perceptions about the authors. Furthermore, the classifier is unable to identify key demographic groups, such as African American scholars. Nevertheless, as an exploratory test, we repeated our analysis using an additional classification of race defined in the U.S. census data (SI Fig. S4), which includes "Black" as one of the labels. The result does not show statistically significant underrepresentation of Black scholars relative to "White." Note that African-named authors (based on Ethnea) are not necessarily classified as "Black" based on the Census data (SI Tables S6-S7).
Third, some plausible covariates are unavailable for inclusion, such as the number of citations a paper received at the time of being mentioned. However, we anticipate the effect of such covariates to be small given current controls. SI Fig. S1 shows that the majority of papers were mentioned within one year after publication, which limits the citations a paper can accrue in such a short academic time period.
Fourth, we did not test other potential mechanisms. For instance, reporters often choose whom to interview based on who is listed as the corresponding author. Although our model controls for the corresponding author status (SI Table S5), however, which author of a paper is designated as corresponding-and whose contribution is seen as deserving of formal authorship at all-may itself be a product of structural discrimination with respect to authors' demographics. Thus disparities seen in the press may be partly driven by decades or centuries of decisions that are ingrained in publishing practices and institutions (53).
Fifth, our data contain too few examples of some ethnicities (e.g., Polynesian and Caribbean) to accurately estimate disparities; such ethnicities are regrettably omitted, though we recognize that these groups likely experience disparity from their minority status as well.
Sixth, our study has focused solely on the disparity in the reporting behaviors of U.S.-based news outlets. Many of these outlets are often global in reach and mentions in them often serve as markers of prestige for scholars. However, the outlets' behavior may not be representative of broader media coverage practices. At present, the Almetrics data only provide sufficient quantity for U.S.-based outlets to control for potential confounds and explanations (62% of all news mentions are solely from U.S.-based outlets), which is critical for our study's design.
Nevertheless, bias is likely not unique to one country and additional global data is necessary to move beyond a U.S. focus and study country-specific and global journalistic practices.
Lastly, this research relies on large-scale datasets and algorithms that may themselves encode systemic social inequalities. For instance, which venues are considered "mainstream" and therefore worthy of tracking by Altmetric may be the outcome of racial inequities (54). Which groups the algorithms choose to identify as distinct groups are choices that may reflect long histories of racialization seen through a "white racial frame" (55,56). The availability of data also drove our focus on English-language science and media, thereby accumulating more activity around certain settings over others. We believe these limitations place substantial scope conditions on the findings.

Conclusions and Implications
Our work shows that science journalism is rife with disparities in which author receives name attribution, with authors from certain ethnic groups receiving much more name mentions and quotations than their peers conducting comparable research. These ethnic disparities likely have direct negative consequences for the careers of unmentioned scientists, and skew the public perception of who a scientist is-a key factor in recruiting and training new scientists.
Our findings have two implications for science policy and science journalism. First, bringing the attention to large-scale ethnic disparities in author mentions in science news, of which journalists may themselves have been unaware, can be an agent of change. Second, decisionmakers at U.S. research institutions may take these ethnic disparities into account when making hiring or promotion decisions. More importantly, addressing this problem requires more research to investigate the mechanisms leading to it, which we hope this paper helps stimulate.

Data and Methods
To test for and quantify gender and ethnic disparity in author mentions, we constructed a massive dataset by combining news stories with metadata for the scientific papers they cover, and then inferring demographic attributes of the papers' authors based on their names. Journalists can mention several authors when covering a paper in a news story. Since the first author and the last author often contribute most to the work and are recognized as such in science journalism guidelines (57), we include them in our analysis by default. We also include any additional corresponding author of a paper. We treat each (story, paper, author) triplet as an observation in the regression, with 524,052 observations in total. In order to control for the effects of journalists' ethnicity and gender, we first used the newspaper Python package (https://github.com/codelucas/newspaper) to extract the journalists' names from the retrieved html news content. Since not all stories in each outlet contain the journalist information and the newspaper package does not work perfectly for every story that has journalist information, we focused on the top 100 outlets (ranked by the story count). With manual inspection, we verified that this package can consistently and reliably identify journalists' names for 41 of the top 100 outlets. We excluded extracted names with words signaling institutions and organizations (such as "University", "Hospital", "World", "Arxiv", "Team", "Staff", and "Editors"). We also cleaned names by removing prefix words, such as "PhD.", "M.D.", and "Dr.". We eventually obtained the journalist's name in 100,163 news stories (18.1% of all cleaned stories) for 41 outlets. Note that we did not drop any data where the journalist's name is missing. When coding journalists' gender and ethnicity, we assigned "Unknown" to those missing names.

Retrieving Paper Metadata
The Altmetric database does not contain detailed author information and therefore an additional dataset is needed to identify the authors of mentioned papers. We used the Microsoft Academic Graph (MAG) data (58)  from MAG based on DOIs (matching based on lower-cased strings), which were mentioned by 472,762 stories from 288 outlets. MAG also provides rich metadata for papers, including author names, author rank, author affiliations, affiliation rank, publication year, publication venue, the paper abstract, and paper topical keywords. As all of this information will be used in our regression models, we excluded papers with missing metadata and story-paper-author triplets from rare ethnicity groups, leaving us with 100,486 papers in the final dataset.

Story-Paper-Author Triplets and Corresponding Authors
We further used the Web of Science database (2019 version) to retrieve the corresponding authors for 86.0% papers in the final dataset based on the DOI. The remaining papers are mainly from disciplines such as computer science that do not have the norm to designate corresponding authors.
We focused on several authors whom journalists are likely to mention by name when covering a paper in a news story, including the first author, the last author, and any middle author who is designated as the corresponding author (note that the first author and the last author can be corresponding as well). It is possible that some papers could have equal-contributing first authors, however, our data does not have this information. We estimate that such cases are rare.
For solo-author papers, we included the single author in the analyses. Papers in a few research fields that commonly use the alphabetic-based authorship ordering are also included as journalists may be unfamiliar with this norm. To examine whether a specific author is mentioned, we treated each (story, paper, author) triplet as an observation in the regression.

Inferring Author and Journalist Gender and Ethnicity
As authors' gender and ethnicity are not directly available, we relied on the inferred demographic associations of their name. While such inferences could be inaccurate relative to how authors self-identify, self-identities are generally not available to journalists. Instead, classifierbased predictions on gender and ethnicity reflect stereotypical norms of the expected demographics given a name-norms that journalists are likely to share and unconsciously use when first examining the author names of a paper and deciding whom to mention. Therefore, while imperfect, we based our study on these inferred attributes.
Gender and ethnicity were inferred using the Ethnea API (59), which is specifically designed for use in bibliometric settings like ours. We grouped the 24 observed individual ethnicites from The library makes its prediction based on the nearest-neighbor matches on authors' first and last names using the PubMed database of scholars' country of origin, which offers superior performance over alternative approaches (60,61).
Author names in the MAG have varying amounts of completeness. While most have the first name and surname, special care was taken for three cases: (1) If the name has a single word (e.g., Curie), the ethnicity and the gender were both set to Unknown, as Ethnea requires at least an initial. Single-word name cases occurred for 208 authorships in the final dataset. (2) If the name has an initial and surname (e.g., M. Curie), we directly fed it into the API, which provides an ethnicity inference but returns Unknown for gender due to the inherent ambiguity. (3) If the name has three or more words, we took the first word as the given name and the last word as the surname. However, if the first word is an initial and the second word is not an initial, we took the second word as the given name (e.g., M. Salomea Curie would be Salomea Curie) to improve prediction accuracy and retrieve a gender inference.
While Ethnea is trained with scholar names, we also applied it to infer the gender and ethnicity of journalists. Ethnea assigns fine-grained ethnic categories that are leaning towards country of origin. Here, we recognize that ethnicity, race, and nationality are three related concepts. Ethnicity categorizes people based on origin and cultural background, which is often reflected in names, whereas race is a social construct. In contrast, nationality reflects country of affiliation and is a bit more fluid due to immigration or migration. We thus decided to use the term "ethnicity" because it is the most accurate and relevant concept in the study of names.
To test for macro-level trends around larger ethnic categories and to ensure sufficient samples to estimate the effects, we grouped individual ethnicities into higher-level categories based on geographical proximity and cultural distance (SI Table S1).
Note that due to the sample size and our hypotheses, African, Chinese, Indian, and English (renamed as "British-origin") were kept as separate high-level categories. Caribbean and Polynesian authors were excluded due to less than 100 observations (triplets) in total. A few authors with organization names were also excluded. Examples of names classified into each ethnicity are provided in SI Table S8. Ethnea returns binary gender categories: Female and Male, though we recognize that researchers may identify with gender identities outside of these two categories. For both gender and ethnicity separately, some names are classified as "Unknown" if no discernible signal is found for the respective attribute by Ethnea.

Final Dataset and Statistics
The final dataset consists of 223,587 news stories referencing 100,486 research papers. As some stories mentioned more than one paper and some papers were mentioned in more than one story, we have 276,202 (story, paper) mention pairs. Since multiple authors are likely to be mentioned per paper, we have 524,052 (story, paper, author) triplets in total to test whether an author is mentioned in a story.
The distribution of the number of papers and news stories over time and attention per paper are shown in SI Figs. S1a-b. News story data is left censored and primarily includes stories written after 2010, as Altmetric.com was only launched in 2012, which limits the collection of earlier news. As shown in SI Fig. S1c, news stories can mention papers that were published several decades before, highlighting the potential lasting value of scientific work. However, the majority of papers are mentioned within the same year or just a few years after publication. SI Table S2 shows the the number of authorships and triplets for authors in each broad ethnicity group, and SI Table S3 shows the number of triplets by journalists' inferred ethnicities.

News Outlets Categorization
To estimate differences across outlets, we grouped 288 news outlets into three categories according to their news report publishing mechanisms (SI Table S9). The three categories are (1) Press Releases, (2) Science & Technology, (3) General News. The categorization is based on manual inspections of three random stories per outlet.
The Press Releases category is unique since many outlets in this group commonly-if not exclusively-republish university press-releases as stories, making them reasonable proxies for estimating disparity in universities' own press office. The Science & Technology category consists of magazines that focus on reporting science, such as "MIT Technology Review" and "Scientific American." These outlets typically construct a large scientific narrative referencing several papers in their stories. The General News category includes mainstream news media such as "The New York Times" and "CNN.com" that publish stories in a wide variety of topics.
They have well-trained editorial staff and science journalists who are focused on accurately reporting science. SI Table S4 shows the number of (story, paper, author) triplets by outlet types. The average number of words per story for each outlet type is shown in SI Fig. S2.

Check Author Attributions in Science News
Our dataset does not come with information on author mentions. We thus developed a computational approach to identifying author mentions and quotes (based on their last names) and institution mentions for each (story, paper, author) triplet.

Detecting Author Name Mentions
We normalized both the news content and the author names to ensure that this approach works for names with diacritics. For each story-paper-author triplet, the author's last name was searched for using a regular expression with word boundaries around the name, requiring that the name's initial letter be capitalized. While the chance exists that this process may introduce false positives for authors with common words as last names (e.g., "White"), such cases are rare because (i) few authors in our dataset have common English words as their last names, and

22
(ii) these words rarely appear at the beginning of a sentence in the story when they would be capitalized. However, a particular exception is for two common Chinese last names "He" and "She," which can appear as third person pronouns at the start of sentences. We thus imposed additional constraints for these two names such that they must be immediately preceded with one of the following titles to be considered as a name mention: "Professor", "Prof.", "Doctor", "Dr.", "Mr.", "Miss", "Ms.", 'Mrs.". Occasionally, the author name can occur within a reference to the paper at the end of the story, which should not be counted as a name mention. As authors are typically mentioned at the beginning or in the middle of the news story, we removed the last 10% of the story content when checking name mentions (note that we obtained similar results without this filtering). Ultimately, author names were found in 41.2% of all (story, paper, author) triplets.

Author-Quote Detection
Authors can be mentioned by name in different forms, including quotation (e.g., "We are getting close to the truth." said Dr. Xu.), paraphrasing (e.g., Timnit says she is confident, however, that the process will soon be perfected.), and simple passing (e.g., A recent research conducted by Dr. Jha found that drinking coffee has no harmful effects on mental health.).
We used a rule based matching method to detect explicit quotes for each (story, paper, author) triplet. We first parsed our news corpus using spacy (https://spacy.io/). We identified 18 verbs that were commonly used to integrate quoted materials in news stories, from the most 50 frequently used verbs in our news corpus, including "describe", "explain", "say", "tell", "note", "add", "acknowledge", "offer", "point", "caution", "advise", "emphasize", "see", "suggest", "comment", "continue", "confirm", "accord". A sentence is determined to contain a quote from the author if the following two conditions are met: (i) both the quotation mark and the author's last name appear in the sentence, and (ii) any of the 18 quote-signaling verbs (or their verb tenses) appears within five tokens before or after the author's last name. A manual inspection of 100 extracted quotes revealed no false quote attributes. This conservative method only gives an underestimation of the quote rate, as it may not be able to detect every quote due to unusual writing styles or article formatting. So the benefit of British-origin named scholars in getting a quote (Fig. 2) may be even higher.

Detecting Institution Mentions
We checked institution mentions based on exact string matching with authors' listed institution names in the MAG, i.e., for each (story, paper, author) triplet, we examined whether any of the author's full institution name appears in the news story. Similar to quote detection, this method may not be able to identify every instance of institution mentions due to noise in the MAG or the story using slightly different nomenclature such as an institution's abbreviation. However, a full list of alternative names for each institution is not available to us, we thus used this conservative method. For this reason, minority scholars' trend in being substituted by institutions (Fig. 2) is likely underestimated.

Mixed-Effects Regression Models
We adopted a mixed-effects logistic regression framework to examine the demographic disparity in author mentions in science reporting. In our regression framework, each (story, paper, author) triplet is an observation, with the dependent variable indicating whether the author is mentioned or not in the story. Many factors are known to influence name mentions that could confound the analysis of ethnicity and gender, such as author reputation, institutional prestige and location, publication topics and venues, outlets, and journalist demographics. Here, we provide details of these factors and present a series of five regression models that build upon one another by adding more rigorous control variables at each step. The increasing level of model complexity allows us to test the robustness of the effects of ethnicity and gender association, and also to examine potential factors at play in science coverage.

Model 1: Naive Disparity
The first model directly encodes our two variables of focus, gender and ethnicity association, as the sole categorical factors in the regression. Here and throughout the study, we treat the reference coding for ethnicity association as British-origin and for gender association as Male.
While overly simplistic in its modeling assumptions, Model 1 nevertheless tests for systematic differences for whether authors of a particular demographic are mentioned less frequently and serves as a baseline for layering on controls to explain such disparity.

Model 2: Paper Author Controls
Many author-level attributes other than demographics could influence journalists' perceptions on authors and the coverage of them. Model 2 introduces 13 additional factors to control for features of papers' authors.
Prestige Factors. The reputation of the author may also influence the chance of being named. High-status actors and institutions tend to receive preferential treatment within science (62)(63)(64), and we hypothesize that these prestige-based disparities may carry over to media coverage as well. To account for prestige effects, we include the author rank and institution rank provided by the MAG (65). We take the highest institution rank for authors with multiple affiliations. This ranking estimates the relative importance of authors and institutions using paper-level features derived from a heterogeneous citation network; while similar to h-index, the method has been shown to produce more fine-grained and robust measurements of impact and prestige. Institution and author ranks are not necessarily directly related, as institutions may be home to authors of varying ranks (e.g., early-or late-career faculty) and the same author may appear with different affiliations on separate papers due to a career move. Note that for rank values, negative-valued coefficients in the regression models would indicate that higher-ranked individuals and those from higher-ranked institutions are more likely to be mentioned.
We also add a variable indicating the author's institution location with three categories: (1) domestic, (2) international, (3) unknown. For authors with multiple affiliations, we assign "domestic" if there is at least one U.S. institution. This variable controls for geographical factors that may influence journalists' willingness to contact by phone or video chat service and therefore influence whether they mention the author. We infer the country for institutions based on their latitude and longitude provided in the MAG.
Popular authors who have lots of press coverage may be more likely to be mentioned. We add a factor indicating whether the author is among the top 100 most popular scholars based on their number of papers mentioned in the news in our final dataset.
In multi-author papers, the team often designates one or more corresponding authors, who are presumably more likely to be contacted and therefore mentioned by journalists. Our data includes the corresponding author information for most papers. We thus include a variable indicating whether the author is corresponding or not on the covered paper.
Last Name Factors. People are known to have a preference for both familiar and more easily-pronounceable names (66,67), and this preference could potentially affect which author a journalist mentions. Therefore, we introduce two factors as proxies: (1) the number of characters in the last name as a proxy for pronounceability, and (2) the log-normalized count of the last name per 100K Americans from the 2018 census data. As journalists are drawn from U.S.-based news sources, the latter reflects potential familiarity.
Other Authors. Scientific knowledge is increasingly discovered by teams, as tackling complex problems often require the collaboration between experts with diverse sets of specialization When journalists examine a paper's authors, the team size may influence their understanding of the distribution of credits among authors, potentially reducing the chance of any author being mentioned for papers with many authors. We thus include a variable for the number of authors.

Model 3: Paper and Story Content
The content of the paper and story, and journalist demographics also can play a role in affecting author mentions. We thus control for the following factors in Model 3.
Year of News Story (Mention Year). Disparities in science coverage may have temporal variations due to unpredictable factors that are directly or indirectly related to research. For instance, the available funding resources can affect the number of research outputs in a year, which would in turn influence the amount of time and space journalists devote to scientists in news articles. We thus control for the year of the news story, i.e., the mention year of the paper.
We treat it as a scalar variable (zero-centered).
Year Gap between Story and Paper. News stories often reference older scientific papers in the narrative, as shown in SI Fig. 1c. For older papers, at the time of a recent story publication, the original authors may be unable to be reached or the story may be framed differently from recent science that is considered "fresh." Indeed, citing timely scientific evidence in a news report can increase credibility perceptions of a story (37,71). Therefore we include a variable that quantifies the year difference between the mention year and the publication year of the mentioned paper.
Number of papers mentioned in a story. A story can mention several papers to help frame and construct its scientific narrative, and potentially increase its news credibility perception.
However, referencing many papers in a story may reduce the amount of space and attention allocated to each paper by journalists, and therefore may decrease the chance of its authors being mentioned. We thus control for the number of mentioned papers in a story.
News Story Length. Longer articles provide more space in depicting stories about the science being covered, we thus control for the story length, measured as the total number of words.
Paper Readability. Given the tight timelines under which journalists work, quickly identifying and understanding insights is likely critical to what is said about a paper. A paper's readability may thus influence whether a journalist feels the need to reach out to the author, with more readable papers requiring less contact. Readability, in turn, may also be tied to author's demographics like gender (72), making it important to take readability into account. Due to licensing restrictions, the full text of the majority of papers is unavailable freely; therefore we compute readability over the paper abstract using three factors: (1) the Flesch-Kincaid readability score, which estimates the grade-level needed to understand the passage; (2) the number of sentences per paragraph, which is a proxy for information content and density; and (3) the type-token ratio, which is a measure of lexical variety. Another reason we focus particularly on the abstract is that journalists may not read the entire paper but very likely read the abstract.
Journalist Demographics. It is ultimately the journalist's decision to mention authors when writing science reports. Motivated by the commonly observed homophily principle in social networks (36), we hypothesize that the mentioning behavior in science reporting is associated with homophilous effects by ethnicity and gender. To model such effects, we include the journalists' demographics in the regressions. Due to insufficient instances of journalists identified in news stories (SI Table S3), we further coarsen the 9 broad ethnicity categories into four groups:

Model 4: Paper Domains and Topics
Some scientific domains and topics may be inherently more attention-getting than others. Some may be harder to understand without seeking additional explanation from authors. Furthermore, journalists' academic backgrounds may be unequally distributed across scientific fields, resulting in different propensities to reach out to authors.
We thus include factors to capture the paper's topics using data from the MAG, which includes a large volume of keywords (665K) at different levels of specificity. A paper can have multiple keywords, with each having a confidence score between 0 and 1. To capture high-level topical and methodological differences, we focus on the most common 199 keywords that occur in at least 500 papers in our final dataset. Each keyword is used as an independent variable in the regression, whose value is the keyword's confidence score for the paper.

Model 5: News Outlets and Publication Venues
Individual news outlets may follow different standards of practice in how they describe science, creating a separate source of variability in who is mentioned. Publication venues each come with different levels of impact and topical focus that potentially affect the depth of journalistic focus on papers published in them. To accurately model these sources of variations, we treat outlets and venues as random effects in regression Model 5. This mixed-effect regression model implicitly captures a robust set of factors involved in science reporting such as the tendency of specific journals to be mentioned more frequently (e.g., Nature, Science, or JAMA) and the focus of news outlets on specific topics covered by different journals.

Additional Ethnicity Coding
Although Ethnea is specifically designed for inferring scholars' ethnicity in bibliographic records, it is not expected to be entirely error-free. As a robustness check, we replicated our analyses by inferring the ethnicity for the names of authors and journalists using two separate data sources to test whether the observed disparity persists.
Specifically, we used the ethnicolr (https://pypi.org/project/ethnicolr/) library to code ethnicity using either data derived from (i) the nationalities listed in Wikipedia infoboxes to infer nationality-based ethnicity, or (ii) self-reported ethnicity data associated with last names from the 2010 U.S. census. While these two sources of data use different definitions and granularities of ethnicity from Ethnea, they nonetheless provide approximately-similar categories to Ethnea that enable us to validate our results.

Ethnicity based on Wikipedia
We used the Wikipedia infobox data to code ethnicity based on the first name and the last name (48,60). To make the results comparable to that based on Ethnea, we  SI Fig. S4 shows the average marginal effects in mention rates for scholars with names having minority ethnicity (or race) compared to British-origin (or White) named authors. As neither tool infers gender, we thus report the result for gender here using Ethnea's labels. Like the case of Ethnea, we find strong evidence of disparities for Asian-associated names in author mentions in science news, highlighting the robustness of our findings in the main text.
Acknowledgements: We thank Altmetric.com for sharing the mention data used in this study.

Supplementary Materials
Tables S1 to S10 Figure S1 to S4          S4. The average marginal effects in mention probability for author names' demographic associations, using Wikipedia data for coding ethnicity (Left) or U.S. Census data for coding race (Right) based on author (or journalist) names. Note that gender is still inferred using Ethnea.

A. Associations of Control Variables with Author Mentions
Although our focus is on ethnicity and gender, we find that many control variables are strongly associated with author mention rates. Examining the influence of these factors can lead to a better understanding of the mechanisms at play in science reporting. Below we interpret their effects based on Model 5 (Table S5) along three themes: (1) prestige related inequality, (2) impact of co-authorship, and (3) story content effects.
Not surprisingly, being designated as the corresponding author is positively associated with name mentions. Scholars who have a high professional rank or are affiliated with prestigious institutions receive outsized name mentions in science news when their research is covered. Popular authors whose research received many press coverage are more likely to be mentioned by name. This result suggests that the benefits of status, the socalled "Matthew Effect" [1], persist even after publication.
Having more co-authors on a paper has a negative effect on the author being mentioned. Compared to the last author position, the first author is more likely to be mentioned by name, whereas the middle author is less likely to be named. The observed first position effect might due to the fact that, among papers (excluding solo-author papers) that have the corresponding author information, 59.9% have the first author as corresponding and only 36.1% have the last author as corresponding. Solo-authored papers have been decreasing over time and are associated with lower impact on average [2,3]. However, our results highlight an underappreciated benefit-conditional on a paper being referenced in the news, a solo author is significantly more likely to be mentioned compared to authors of a multi-author paper. Although seemingly counter to previous studies, it has a natural explanation-there is only one person to mention if need be.
The coefficients for story features point to the multifaceted nature of science reporting. Although the volume of science reporting is increasing over time (Fig. S1a), journalists tend to mention authors less frequently in later years. At the same time, while older papers are still discussed in the media (Fig. S1c), journalists are less likely to mention authors of these studies as often. When more papers are referenced in a story, their authors are less likely to be mentioned. We hypothesize that such stories are often citing multiple scientific papers to construct a large narrative and thus those papers are only mentioned in passing. Longer stories are more likely to mention author names as they have more space to engage the authors.

B. Does It Matter Who Is Reporting?
Understanding whether ethnic disparities are related to journalists' own identities may help uncover the mechanisms producing them. First, journalists of different ethnicities may differ in their overall tendencies to mention authors. If so, disparities may be driven by the composition of journalists. Our fullest model controls for journalists' name-inferred ethnicity, and shows that journalists with minority-identity associated names are not more or less likely to mention authors compared with journalists with Male or Britishorigin names (Table S5, Model 5). We also note that, when dropping controls for outlets (Models 3-4), journalists' ethnicities become significant, suggesting that journalists' differential behavior might be explained by variations at the outlet level, i.e. certain news outlets mention authors more or less often and certain groups of journalists are under-or over-represented in those outlets.
Second, there might exist interactive relationships between authors' and journalists' ethnic identities. One intuitive hypothesis, which we call "ethnic hierarchy," is that all journalists, regardless of their perceived ethnicity, prefer to mention British-origin named scholars over others. On the other hand, journalists may prefer to mention authors of the same ethnicity, which we call "ethnic homophily". Evidence for demographic homophily is pervasive [4]. For example, concordance of gender identities between actors has been found to