Credibility of scientific information on social media: Variation by platform, genre and presence of formal credibility cues

Abstract Responding to calls to take a more active role in communicating their research findings, scientists are increasingly using open online platforms, such as Twitter, to engage in science communication or to publicize their work. Given the ease with which misinformation spreads on these platforms, it is important for scientists to present their findings in a manner that appears credible. To examine the extent to which the online presentation of science information relates to its perceived credibility, we designed and conducted two surveys on Amazon’s Mechanical Turk. In the first survey, participants rated the credibility of science information on Twitter compared with the same information in other media, and in the second, participants rated the credibility of tweets with modified characteristics: presence of an image, text sentiment, and the number of likes/retweets. We find that similar information about scientific findings is perceived as less credible when presented on Twitter compared to other platforms, and that perceived credibility increases when presented with recognizable features of a scientific article. On a platform as widely distrusted as Twitter, use of these features may allow researchers who regularly use Twitter for research-related networking and communication to present their findings in the most credible formats.


INTRODUCTION
Scientific institutions and scientists themselves are increasingly making use of online communication platforms to disseminate scientific findings (Duggan, Ellison et al., 2015). Covering science in informal outlets, such as Twitter, alongside traditional science news and peerreviewed journal articles has the potential to reach a cross-section of science enthusiasts (Büchi, 2016;Ranger & Bultitude, 2014). However, presented with this crowded online arena, readers will employ various heuristics, conscious and unconscious, to assess the credibility of information (Flanagin & Metzger, 2007). Here, we define credibility as a perceived feature of trustworthiness that readers apply to information they encounter (Schmierbach & Oeldorf-Hirsch, 2012). If writers and publishers of online science information want information they present to be seen as credible, they will need to be cognizant of the methods users employ to make these judgements. In this study we focus on describing features that contribute to user perceptions of the credibility of science information when presented online. How and where scientists present their work may enhance or undermine its credibility, and a stronger understanding of these features is important to upholding the credibility of reliable science information, particularly in platforms that struggle to filter out misinformation.
The reasons for science journalists to present their work online are self-evident, but there are also many reasons for scientists to present their findings online. In addition to science outreach, many researchers find general-use social media applications useful during many stages of research, including selecting research questions, collaborating, publicizing findings, and keeping up to date on recent publications (Collins, Shiffman, & Rock, 2016;Côté & Darling, 2018;Holmberg, Bowman et al., 2014;Rowlands, Nicholas et al., 2011;Tenopir, Volentine, & King, 2013). The utility of general-use social media emerges in part from their lower time demands and barriers to entry than more traditional forms of outreach (McClain, 2017), and Twitter in particular has been a major platform used by scientists for outreach (Côté & Darling, 2018;Mohammadi, Thelwall et al., 2018). The use of Twitter among scientists also appears to be increasing, as the majority of those users reported having their accounts for less than two years (Collins et al., 2016). While the choice of whether or not to engage in informal online science communication has been the subject of some controversy (Collins et al., 2016;Yammine, Liu et al., 2018), there is less guidance about selecting between the myriad of platforms for online engagement or about effectively composing posts. As many researchers are poised to engage in social media as a professional task, the stakes are high for maintaining the credibility of the information they present online.
Credibility is understood as a subjective feature that is perceived by individual readers and not an innate feature of the information (Bian, 2012;Schmierbach & Oeldorf-Hirsch, 2012;Shariff, Zhang, & Sanderson, 2017;Tseng & Fogg, 1999). As such, online credibility evaluations are commonly understood to be influenced by formal cues, the features surrounding the presentation of information, such as website design, source attribution, and genre conventions (Flanagin & Metzger, 2007). Even within a single online platform, differences in writing style and narrative frame (Nadarevic, Reber et al., 2020;Shariff et al., 2017), cues indicating audience reception (Winter & Krämer, 2014), and the perception of both scientific and social consensus (Bode, Vraga, & Tully, 2020;Borah & Xiao, 2018;Kobayashi, 2018) can often affect the credibility of scientific information. Complicating the influence of these formal features on credibility, specific media types often have strong reputations among users that affect their perceptions of the information in those settings (Lucassen & Schraagen, 2013;Winter & Krämer, 2014). While it would be challenging to disentangle a platform's reputation from the influence of formal cues, information on Twitter is often considered less credible than on other online news media (Schmierbach & Oeldorf-Hirsch, 2012). Furthermore, this reputation is often accepted as a premise in the strong body of recent literature devoted to assessing the spread of misinformation, hostility, or bot-like behavior on Twitter (Anderson & Huntington, 2017;Robinson-Garcia, Costas et al., 2017;Shao, Ciampaglia et al., 2018;Vosoughi, Roy, & Aral, 2018) and to containing and correcting misinformation (Bode et al., 2020;Smith & Seitz 2019). However, as Twitter has recently come into wider usage in academic contexts (Collins et al., 2016), an updated study focusing on the credibility of scientific information on Twitter is needed.
Despite the recency of the movement, academic Twitter has already adopted several conventions in the composition of tweets concerning scientific findings. Tweets about science exhibit a high incidence of linking URLs, and tweets also frequency employ "mentioning" (including others' user IDs) as another way to refer outwards to additional information or support (Büchi, 2016;Schmitt & Jäschke, 2017). Furthermore, because scientists frequently use Twitter to promote new publications (Holmberg et al., 2014), several features of the scientific paper that are readily transferrable to tweets, such as the title, abstract, and figures, are formal cues that may contribute to credibility. Outside the fringes of discourse, the public sees scientists as trusted sources of reliable scientific information such that trust in scientists may "spill over" into higher rates of trust for science-specific media (Brewer & Ley, 2013). For known scientists to personally endorse science information, even by briefly acknowledging their authorship or positive evaluation of the information, may similarly confer a boost to that information's credibility. The credibility of information on social media is strongly affected by any personal connections to the original poster (Turcotte, York et al., 2015), which has led some to suggest that in the fight against misinformation, scientists should draw on their preexisting personal networks on social media to present credible scientific information online (McClain, 2017). It may also be possible that formal cues that are strongly and widely associated with academic research, such as scientific abstracts or figures, may also confer increased credibility for science information. Social approval can strongly impact credibility evaluations (Kobayashi, 2018;Winter & Krämer, 2014). Most social media platforms have some means for users to rate content (e.g., "Like," "Favorite," and "Share") and to view aggregated ratings of others-these indicators of social approval may impact perceptions of credibility.
In this study, we used online surveys through Amazon's Mechanical Turk (AMT) to examine the perceived credibility of scientific information appearing across several popular online media platforms, with a focus on Twitter. We explored both the relative credibility of Twitter compared with other online genres and the extent to which various formal and textual features within Twitter are related to the perceived credibility of scientific information. We focus on Twitter because Twitter is widely used for the dissemination of science information (Mohammadi et al., 2018), and it has higher rates of participation for scientists than other social media, even as the platform is beset with concerns over the credibility of its content (Shao et al., 2018;Vosoughi et al., 2018). Amid these concerns, this study will offer a more concrete evaluation of Twitter's credibility compared with other platforms commonly used for disseminating science information. While the primary goal of this study is a descriptive understanding of the features that contribute to information's perceived credibility on Twitter, the results of this study will also inform recommendations for composing tweets about scientific papers that are more likely to be perceived as credible.

METHODS
We conducted a study in two parts. In Part One, we conducted a survey through AMT investigating user credibility evaluations of real-world instances of media coverage in Twitter compared to four other online media: online video, blogs, news articles, and scientific abstracts. The results of this survey were used to assess the extent to which the perceived credibility of information on Twitter differs from that of other common online media. In Part Two, we conducted a survey that asked respondents to rate the credibility of artificially constructed tweets. The results of this survey were used to determine the extent to which a tweet's perceived credibility was related to its formal cues.

Material Selection
Part One compared the degree of perceived credibility of tweets against other forms of media. We used Altmetric.com, a major altmetric data aggregator, to identify and sample journal articles that appeared at least once in a news site, a blog, on Twitter, and in an online video.
Out of these media types, only Twitter is a specific branded platform; however, because each type represents a specific genre with stylistic and formal conventions, we will refer to each source as a platform that is hosting and presenting the scientific information. We used purposive sampling to identify five articles across a range of topics. These studies and their abbreviated labels are provided in Table 1.
At the time of sampling, relatively few articles were covered on all four platforms; this is an admitted limitation of the present study and is further addressed in our discussion. In total, we had 25 unique topic/platform combinations. For each of the text examples (Twitter, abstract, news, and blog) a screenshot was taken with no modification. However, given the heterogeneity of the sources, the screenshots were cropped to remove varying contextual influences and identifying information as well as to enforce a similar size and proportion across all screenshots. We maintained all titles, bylines, and some pictures to keep the sources in their original form. Examples of screenshots for one study are provided in Figure 1.
Given that Twitter has been identified as a resource for paper sharing among academic scientists (Mohammadi et al., 2018), Part Two of the study focused exclusively on identifying features in tweets that were most strongly associated with high perceived credibility. Our sample was drawn from a single field-specific journal, the Annual Review of Clinical Psychology, to control for the field differences observed in Part One. The labels given to each of the studies included in Part Two are listed in Table S1.
We constructed a series of sample tweets based on each article, with varying combinations of the following characteristics: • Use of first-person possessive language such as "our" to indicate authorial ownership of the paper's findings; • Low, medium, or high tweet reception numbers in the form of likes and retweets; • Use of verbatim paper title, neutral paraphrase, or positive paraphrase; and • The presence of a figure from the selected paper, its abstract, or no visual.
Users of Twitter are expected to follow other tweeters whom the user likely considers trustworthy. Because we cannot control for the familiarity between the user and the tweeter, we instead chose to simulate the experience of a user encountering information from an unknown source. Identifying details, such as a Twitter handle or profile picture, were removed by placing a grey box where this information should appear. Each tweet contained a URL to its corresponding journal article as this is a common practice for science tweets (Büchi, 2016), but tweets were presented as images so the link was not clickable. All tweet screenshots are Post-study caffeine administration enhances memory consolidation in humans (Borota, Murray et al., 2014) Dogs Dogs are sensitive to small variations of the Earth's magnetic field (Hart, Nováková et al., 2013) Sweeteners Artificial sweeteners induce glucose intolerance by altering the gut microbiota (Suez, Korem et al., 2014) Vitamins Vitamin D and the risk of dementia and Alzheimer disease (Littlejohns, Henley et al., 2014) Marriage "A diamond is forever" and other fairy tales: The relationship between wedding expenses and marriage duration

Survey: Part One
We sought user credibility judgements using AMT service, a crowd-sourcing application where respondents are solicited to perform tasks in exchange for monetary payment. Drawing survey participants from AMT is a common research procedure that is accepted to be a reputable means for recruiting a diverse set of survey participants in the current surveyweary climate (Cassese, Huddy et al., 2013;Sheehan & Pittman, 2016).
All responses to Part One of the survey were obtained on June 16 and 17, 2015. Five distinct Human Intelligence Tasks (HITs) were generated in AMT. Each HIT was a set of five distinct topics from each of the five platforms, which ensured that topic and platform were presented to each participant. While the embedded video needed to be presented last due to technological constraints, the other platforms and topics were presented to respondents in several shuffled sequences. A pilot test was conducted with two participants, which was used to refine the presentation of the HIT. The recruitment material is presented in Text SI.1. We asked those who had completed one of the previous News Credibility HITs not to complete another because, once exposed to the five topics, the respondent's impressions of credibility and familiarity would likely be compromised ( Figure 3). Even though a participant could theoretically look up the author or articles, they were not directly linked to the sources. Example of tweet screenshots shown to survey respondents. Twelve tweet variants were constructed for each of the 14 studies used in the second survey. Each constructed tweet contained a combination of features relating to (A) whether there was a claim of ownership; (B) its audience reception (low reception was indicated with 0 retweets and 1 like, medium reception with 16 retweets and 23 likes, and high reception with 852 retweets and 1,300 likes); (C) the phrasing of its title; and (D) the presence of a visual. Username and profile images were removed. Because only 12 tweets were created for each study, not all possible tweet variants exist for every study; instead, each study only contains 12 of the possible feature combinations, organized such that all features are equally represented.
Once the respondent accepted a HIT, they were presented with a series of screenshots followed by a video for the five topics. While the video was presented last due to the need to embed it, the order of the screenshots was randomized. In addition, a statement corresponding to the topic was presented. As with the topics that were chosen, these statements were limited by the need for them to appear clearly in all five platforms for each topic. They were intended to be simple so that it was clear that they directly corresponded to the finding of each original article. The statements of fact corresponding to each article were as follows in Table 2.
Following the statement, participants were asked the following in the survey interface: 1. Based on this source, how credible do you find the following statement on a scale of 1-7, with 1 being not at all credible and 7 being very credible? 2. Before viewing this HIT, how familiar were you with this topic on a scale of 1-7, with 1 being not at all familiar and 7 being very familiar?
After responding to the five topic/genre combinations, respondents were asked to provide basic demographic information (i.e., gender, age, location, education level). Respondents selected gender from one of "Male", "Female", or "Other". Age was entered freely as an integer. Location was selected as either from the United States or not. Users selected their highest education earned from a list of seven categories including "less than high school", "high school", "Associate", "Bachelors", "Masters", "Doctorate", or "Professional"; to simplify analysis this variable was later collapsed into a binary variable indicating whether or not the respondent had obtained a university degree (Bachelors, Masters, Doctoral, or Professional). Each of these questions allowed for no answer to be provided, which would result in a missing value. Respondents were also offered a free-input box to detail any issues that they had with reading

Sweeteners
Artificial sweeteners may lead to diabetes.

Vitamins
Vitamin D deficiency causes an increased risk of dementia and Alzheimer disease.

Marriage
Marriage duration is inversely associated with spending on the engagement ring and wedding. the textual excerpts or viewing the videos (which were hosted on YouTube or embedded in news websites).
HITs were sent in batches over the course of the 2 days to recruit sufficient numbers of participants with no duplication. Respondents were initially paid $0.10 per HIT. However, given the length of time that it took to complete the HIT this was increased to $0.25 per HIT for the final batch. A validation question ("What is two plus two?") was added to detect participants who were answering questions randomly or otherwise not paying full attention to the task.
After removing the records of respondents who failed the validation question and duplicate respondents (in which case the respondent's first response was kept and all later responses excluded), the final data set consisted of 4,601 responses from 924 respondents. Most respondents for Part One, 56.0%, were male (n = 517), and 2.9% (n = 27) selected either "other" or provided no gender. The average age of a respondent was 33 years, whereas the median age was 30 years old. The majority, 58.7%, of respondents reported having a university degree or higher (n = 542). The majority of respondents, 90.8%, reported their location as within the United States (n = 839).

Survey: Part Two
To assess the relationship between tweet characteristics and perceptions of credibility, we performed a second survey with AMT using artificially constructed tweets (see Section 2.1) on December 15, 2017. The description of the new HITs read as follows: "You will be shown a variety of tweets on a variety of topics. You will be asked to assess the credibility of these tweets." Participants were paid $0.25 for completing one HIT and were prohibited from responding to more than one HIT using the Unique Turker script (Ott, 2016).
Participants were shown 14 tweet variants on each of the 14 distinct topics. While the order of the topics remained constant, from topic 1 to topic 14, the variations and combinations of tweet features were shuffled for different respondents. For each tweet, the participants were instructed to answer, "How credible do you find this tweet on a scale of 1-7, with 1 being 'not at all credible' and 7 being 'very credible'?" A series of demographic questions were posed at the end of the HIT that were presented and processed in the same fashion as for Part One. A validation question (i.e., "What is 2 + 2?") was also asked, and a comment box was provided. The most frequent comments related to the size and/or resolution of certain tweets, which participants reported made it more difficult to ascertain their credibility.
After removing the records of respondents who failed the validation question, the final data set for Part Two consisted of 1,568 responses by 112 respondents. The demographics of Part Two survey respondents were similar to those of Part One, though with fewer U.S. respondents and more having a university degree or higher. The majority of respondents for Part Two, 55.4% (n = 62) identified themselves as male, and three either reported their gender as "other" or did not report a gender. The average age of a respondent was 35 years, whereas the median age was 32 years. The majority, 69.9%, of respondents reported having a university degree or higher (n = 78). The majority of respondents, 75.9%, reported their location as within the United States (n = 85).

Analysis
For Part One, we used the data from the first survey to investigate the respondent's views of a statement's credibility and its relationship to the platform of presentation. The relationship between stated credibility and platform was assessed with a linear mixed effects model with the worker ID of the survey respondents as the random intercept to account for variability between respondents, implemented using lme4 (Bates, Mächler et al., 2015) in R and Rstudio (R Core Team, 2019;RStudio Team, 2019). Visualizations were in the resulting regression: The stated credibility (an integer from one to seven) was used as the response variable and the platform as a predictor variable. The predictor platform was a categorical variable for which "abstract" was set as the reference level. The topic of study was included as a control variable and represented as a categorical variable for which "Caffeine" (see Table S2) was set as the reference level. We also controlled for the respondent's stated familiarity with the topic (integer from one to seven). Other control variables included the gender, age (in decades), education level, and location of the respondent. Records with missing demographic information (most often gender or age) accounted for only about 3.4% of total responses and were excluded from the regression analysis. We followed this regression with a series of univariate analyses that were used to investigate the extent to which the distribution of and mean credibility differed between combinations of the platform of presentation and topic of the study.
For Part Two, using the data from the second survey, we conducted another linear mixed effects model with the worker ID as the random intercept to assess the extent to which the characteristics of tweets related to their perceived credibility. A regression analysis was used in which the stated credibility (an integer from one to seven) was used as the response variable and the characteristics of the tweet as the predictor variables. Four predictor variables were used to characterize each tweet: • a binary variable indicating whether the tweet's author claimed ownership of the study; We controlled for the topic of the study, which was represented as a categorical variable with 14 levels, of which the first was held as the reference (Table S4). Other control variables included respondent demographics, which were coded in the same way as in the first regression in Part One. Records with missing demographic information, comprising three respondents who did not report a gender, were excluded from the second regression analysis.
All data preprocessing and analysis was carried out using the programming language R 3.5.1 and RStudio version 1.1.46. Code and anonymized data to replicate this analysis have been made available at https://github.com/murrayds/sci-online-credibility.

RESULTS
The object of this study has been the relationship between the perceived credibility of science information and the platform on which it was presented, and the extent to which several language and formal features on a salient platform, Twitter, contributes to perceived credibility. In this section, we present the results of the analyses of participant responses indicating the perceived credibility of science information from two surveys, the first on perceived credibility on five different online platforms and the second on different presentations within Twitter, one of the salient platforms in Part One.
Using an analysis of the Part One survey, we addressed the extent to which science information will be credible on Twitter compared to science information on other platforms. As shown in Figure 4, all topics were associated with lower credibility than the "Caffeine" study (see Table 1 for the terms that were used for each study topic) with the "Dogs" topic being the least credible (  Coefficients of mixed effects linear regression using the Credibility (1-7, treated as continuous) as the response variable. The x-axis corresponds to the coefficients for each predictor variable listed on the y-axis. Error bars correspond to the 95th percentile confidence intervals. The effects of the predictor variable of interest, Platform, have been highlighted. For Platform, the factor level "Abstract" has been held as the reference level. For Topic, "Caffeine" was the reference level. For Gender, "Female" was the reference level. For Education, "Associate−" (associates degree or lower) was the reference level. More detailed output can be found in Table S1. (B) Boxplots of the distribution of credibility scores for each platform. The notch corresponds to the median value, whereas the red line corresponds to the mean. Dots represent each observation scattered around each possible Credibility value. Labels refer to the number of responses for each credibility score as a proportion of all responses for that platform. (C) The distribution of responses for each Platform/Topic combination. The black dashed line refers to the mean of each distribution, which is also stated in the top left of each plot. associated with increased perceived credibility (ß = 0.0036, CI = [−0.051, 0.059]). Compared to those with associates degrees or lower ("Associates−"), respondents with a university degree or higher ("University+") were trivially associated with lower perceived credibility (ß = −0.004, CI = [−0.12, 0.094]). A respondent being from the United States was associated with lower perceived credibility (ß = −0.50, CI = [−0.71, −0.28]).
Controlling for topic and respondent demographics, all platforms were associated with lower credibility than journal abstracts. The platform with credibility scores most similar to journal abstracts was news articles ( (Table S3) revealed that the platform accounted for the largest proportion of total variance, followed by the stated familiarity and the topic of study. The demographics of respondents accounted for little of the total variance.
We also investigated the distribution of responses for each platform in Figure 4, which reinforced our findings. Abstracts were associated with the largest (median = 5, mean = 5.11) and tweets with the lowest (median = 4, mean = 3.61) credibility ratings, 1.5 points of difference. Put another way, abstracts were associated with around 40% greater credibility than tweets. Tweets also received the largest proportion of one-point credibility ratings (17.5%) and had the lowest 25th percentile rating of two points.
While general patterns can be observed between platforms, there is also heterogeneity between topics (Figure 4). In all but one topic ("Caffeine") abstracts had the highest credibility on average. This is true even for the "Vitamin" topic, for which respondents complained that the screenshot of the abstract was too blurry to read. For all but one topic ("Sweeteners"), tweets had the lowest credibility ratings, with the lowest overall for "Dogs." For this topic, the distribution of ratings for tweets was nearly the inverse of the abstract: The former received a plurality of one-point ratings whereas abstracts received a plurality of seven-point ratings. It is more difficult to discern clear patterns among the remaining platforms, though taken together, Figures 4A, 4B, and 4C suggest that online videos and news articles received roughly similar credibility ratings, whereas blogs tended to receive lower ratings than news outlets but higher ratings than tweets.
Tweets were associated with the lowest credibility ratings compared to all other platforms. However, Twitter remains one of the most popular media platforms used by researchers (Piwowar, 2013;Rowlands et al., 2011;Tenopir et al., 2013;Van Noorden, 2014). While researchers have little control over the platform constraints of Twitter itself, there are possibly actions they can take to compose tweets in a way that maximizes their perceived credibility. In the second part of this study, we assessed the extent to which formal cues of tweets related to their credibility. Specifically, whether source cues indicating a closer association with the scientists or the scientific process will be associated with higher perceived credibility, and whether indicators of reception positively correlate with perceived credibility.
In Figure 5, we present the results of a multiple linear regression applied to the responses for Survey Two, using the credibility rating as the response and tweet characteristics, topic, and respondent demographics as predictor variables. Topic 1 was held as the reference level, and compared to this reference, all topics were associated with higher credibility except for topic 2, which was trivially associated with lower credibility. The topic associated with the highest credibility was topic 14, which was associated with a 1.9 point higher score than topic 1 (ß = 1.9, CI = [2.6, 3.2]). The effects observed when controlling for other variables differed from the univariate distribution of credibility by topic (see Figure SI.2), for which topic 6 had the highest average credibility and topic 12 had the lowest. Male respondents were trivially associated with higher perceived credibility ratings than female respondents (ß = 0.049, CI = [−0.33, 0.42]). Older participants were also only trivially associated with lower perceived credibility ratings (ß = −0.074, CI = [−0.24, 0.096]), and respondents with a university degree or higher ("University+") were trivially associated with greater credibility ratings (ß = 0.015, CI = [−0.42, 0.45]) when compared to those with associates degrees or lower ("Associates−"), . A respondent being from the United States was associated with lower credibility ratings (ß = −0.30, CI = [−0.78, −0.17]). All of these findings reinforce those found in Part One. Figure 5. Charts and graphs associated with greater credibility. (A) Coefficients of linear regression using Credibility (1-7, treated as continuous) as the response variable. The x-axis corresponds to the coefficients for each predictor variable listed on the y-axis. Error bars correspond to the 95th percentile confidence intervals. The effects of the predictor variable of interest, tweet characteristics, have been highlighted. For presence of visual, "no visual" was the reference; for title phrasing, "no paraphrase" was the reference; for claim of ownership, "no claim" was the reference; for reception, "low" was the reference. For Topic, "Caffeine" was the reference level. For Gender, "Female" was the reference level. For Education, "Associate−" (associates degree or lower) was the reference level. More detailed output can be found in Table S3 The tweet features indicating the presence of a visual accounted for the greatest total variance, followed by topic of the paper (see the ANOVA in Table S5). Controlling for topic and respondent demographics, tweets that contained a chart from the paper were associated with a full 2.30 points higher credibility rating than tweets that did not contain any visual (ß = 2.30, CI = [1.0, 3.6]), which was held as reference, even as the confidence interval was quite wide. Similarly, tweets that featured a screenshot of the abstract were associated with 1.5 points higher credibility than those with no visual (ß = 1.5, CI = [0.88, 2.2]). Paraphrasing, both positive and neutral, was associated with lower credibility than no paraphrasing-0. We further investigated the distribution of credibility responses by each tweet characteristic. Figure 5B demonstrates the extent to which tweets containing a visual, whether a chart (median = 5, mean = 5.14) or a photo of the abstract (median = 5, mean = 5.06), were rated as more credible, on average, than tweets with no visual (median = 4, mean = 4.01). Differences were less pronounced for title phrasing, differing from effects observed in Figure 5A when controlling for topic and respondent demographics. Figure 5C shows that, considering only the raw distribution, the average credibility for both positive (median = 5, mean = 4.87) and neutral (median = 5, mean = 4.75) paraphrasing is higher than that of no paraphrasing (median = 5, mean = 4.63). This may indicate that other factors, such as the title of the study itself, have more to do with the perceived credibility of the tweet than the phrasing. The opposite was true for claims of ownership-while claiming ownership was associated with increased credibility in Figure 4A, claiming ownership is associated with slightly lower mean (median = 5, mean = 4.68) than not claiming ownership (median = 5, mean = 4.84) in Figure 4C. For reception there were no notable differences between the regression and raw distribution; compared to tweets with a low reception (median = 5, mean = 4.76), tweets with moderate reception were associated with slightly lower (median = 5, mean = 4.64) and tweets with high reception with slightly higher credibility (median = 5, mean = 4.87).

DISCUSSION
Researchers have been encouraged to engage with a wider audience with their research, and online platforms offer a prime avenue through which they can reach broad interested audiences. Exploring the traits associated with credibility for science information is increasingly important for understanding the interactions between conversations in the public and scientific spheres. Modern theories of science communication posit a permeable boundary between science and society, where rather than a passive and receptive audience, the public has agency to actively evaluate and engage with science information (Bucchi & Trench, 2016). By making science information sharable, news and social media platforms facilitate in-depth discussions between scientists, while also providing greater access and participation for other stakeholders in these formerly closed scientific conversations (Bucchi & Trench, 2016;Simis-Wilkinson, Madden et al., 2018). As internet audiences discuss science information across platforms and genres, they expand the discourse, introducing and emphasizing different terms than those supplied by the original scientific paper (Lyu & Costas, 2020). Public conversations about scientific findings are valuable not only for disseminating scientific findings but also to invite stakeholders in the broader community to discuss broader social implications that may be underexplored within science.

Platform Credibility
However, online platforms are also crowded with both accurate information and disinformation. To cut through noise, researchers and science communicators need a better understanding of how the platform and composition of their content relate to its perceived credibility. Research on credibility emphasizes the role of formal cues that users draw upon as they evaluate unfamiliar subject matter (Lucassen & Schraagen, 2013). Even within the limited space of tweets, the inclusion of particular formal cues is associated with higher credibility of science information. We observed evidence consistent with past studies that science information is least credible on Twitter compared to science information on other platforms (Schmierbach & Oeldorf-Hirsch, 2012). Twitter was rated the least credible platform, even when controlling for the topic, respondent's stated familiarity, and respondent demographics. While there was heterogeneity by study, Twitter was rated as least credible, on average, for four of the five study topics. Differences in platform have been shown to influence user perceptions of credibility to the point that such associations have been framed as bias in previous studies (Schmierbach & Oeldorf-Hirsch, 2012;Winter & Krämer, 2014). Research on credibility emphasizes the role of formal cues that users draw upon as they evaluate unfamiliar subject matter (Lucassen & Schraagen, 2013). The particular formal cues of Twitter, such as character restrictions that leave little space for the inclusion of cues that could signal credibility, may affect the overall credibility of information on Twitter.
In contrast to tweets, scientific abstracts were consistently rated as the most credible, associated with 0.77 more points from credibility ratings from the second most credible source, news articles, and 1.4 points higher credibility ratings than tweets. The images of abstracts may impart some degree of credibility to a claim. This notion was supported by an accidental finding from our analysis: After receiving the survey results, we found that that the screenshot of the abstract for the "Vitamin" study was blurry and mostly illegible ( Figure S3A). Even though it was unreadable, the abstract was still rated more credible than other platforms. The association of the abstract with the institution of science may bolster its credibility, as was theorized by Brewer and Ley (2013) in their study of public perceptions of climate science. The high credibility of the blurry abstract suggests that the format of the abstract and its close association with the institutions of formal research could be more important to credibility than the content itself.
Patterns for the remaining platforms were less clear. Generally, abstracts were rated as most credible, followed by news articles and online videos, which tended to have similar credibility ratings. Blogs were rated as one of the least credible platforms, second only to tweets. The confidence intervals calculated for these three variables in the regression analysis all overlap, suggesting a lack of certainty in this ordering (see Figure 4A). However, tweets were rated as the least credible, with confidence intervals that overlapped with no other platforms. This low credibility may not necessarily undermine Twitter's usefulness for scientists performing outreach and professional networking; these findings only sound a warning note that the deliberate use of appropriate formal and stylistic features to convey credibility is important when composing tweets.

Credibility and Twitter's Formal Cues
We observed only partial support for the idea that formal cues in tweets that indicate a closer association with the scientists or the scientific process will be associated with higher perceived credibility. Specifically, we found that only one feature of a tweet, the presence of a visual in the tweet, was significantly associated with increased ratings of credibility. The inclusion of a figure from the paper was associated with highest credibility, and the inclusion of a screenshot of an abstract in a tweet was nearly as high. The boost in credibility associated with scientific figures is reminiscent of Latour's discussion of scientific inscription devices; such signifiers can be persuasive devices for establishing the objective facts of science, both in and outside the lab (Latour, 1987). While this finding suggests that scientists can use figures to lend credibility to their work, it also warrants caution: Figures and other scientific signifiers such as mathematical formulas, even when trivial, may confer credibility when it's not justified (Tal & Wansink, 2014). Furthermore, information density may vary between different figure types, which may have a strong influence on the credibility, as evidenced by the variability of ratings and wide confidence intervals. While the overall formatting of the scientific figures was in a similar style as the figures originated from the same journal, the figure type varied from study to study and included circle charts, schematic diagrams, bar graphs, and tables. As abstracts and scientific figures have strong associations with formal science, they may also appear especially credible. Future work should investigate the effects of the inclusion of a generic image that is not the abstract or a figure to investigate whether scientific figures in particular contribute to credibility, or if any image confers a similar credibility boost. The remaining features that we observed were at most weakly associated with credibility. There was some evidence that paraphrasing the title and claiming ownership relate to the tweet's perceived credibility. However, because these effects disappear when other variable are controlled, it is possible that they are artefacts of the papers themselves, rather than of the composition of the tweet.
We observed only weak evidence that indicators of reception will positively correlate with perceived credibility. Compared to tweets with low reception (low number of likes and retweets), tweets with moderate reception were trivially associated with lower credibility. However, tweets with high reception were associated with slightly higher credibility. While this effect was found to be statistically significant, its confidence intervals nearly included zero and its estimated effect was small, associated with only 0.19 point greater credibility than low reception ( Figure 5A; Table S4). Our finding regarding reception disagrees somewhat with past studies, which found that for other online formats, the presence of peer evaluations was strongly related to credibility (Kobayashi, 2018;Winter & Krämer, 2014). In this way, our findings provide further support to the idea that although scientific consensus itself appears to be convincing to social media users, indicators of audience reception on social media play little role in increasing the credibility of that message (Bode et al., 2020;Borah & Xiao, 2018). This discrepancy may result from differences between Twitter's formal cues and those of other platforms, if likes and retweets on Twitter are not interpreted by viewers as social approval, or if that social approval is not as important to evaluating the credibility of information on Twitter in other contexts. Twitter's reputation for spreading misinformation (Shao et al., 2018;Vosoughi et al., 2018) may also lead users to become indifferent to reception.

Limitations
We note several limitations to this study. For Part One of the study, articles were selected that appeared across several existing platforms, which was a high standard for inclusion; this meant that the five studies we selected were covered online due to their attention-grabbing qualities and may not be representative of scientific information as a whole. Variance in the presentation of topics across each platform is also a limitation; for example, the title of the "Dogs" news article was sensationalized as "Scientists observe dogs relieving themselves, discover something amazing," whereas the scientific article was more neutrally titled with "Dogs are sensitive to small variations in the Earth's magnetic field." These differences in tone may have affected the users' evaluations of credibility for these articles. There were also changes in information density across these sources; some formats, such as abstracts, contained more text and therefore more information, which may have given users greater opportunity to form a conclusive opinion on their credibility. There were issues with some screenshots that were discovered after the survey was conducted; for instance, the abstract for the study about Vitamin D and reducing the risk of Alzheimer's was blurry. Furthermore, only the Francis-Tan and Mialon (2015) paper showed the authors' institutional affiliation, which could have affected the abstract results. More generally, there was variation in many factors in the screenshots used to represent each platform. The use of real-world examples of online science information lent realism to the study, but it also introduced many potential confounding factors, such as differences in formatting, presence of images, and the presentation of authors.
In Part Two of our study, we mitigated this heterogeneity by selecting articles from a single psychology journal, but this may affect the generalizability of our findings if factors relating to tweet credibility for psychological sciences are not consistent across scientific fields. Furthermore, as the study topics of the tweets in Part Two were presented in a consistent order, it is inadvisable to draw any conclusions about the relative credibility of the topics due to order effects. The use of AMT, rather than other traditional survey methodology, makes it difficult to assess the representativeness of our respondents and also limits the generalizability of our findings. In addition, real Twitter users are more likely to know the authors of tweets they encounter, whether personally or by reputation. A user's personal relationship with the poster or sharer of news has been shown to affect the credibility of the posted information (Turcotte et al., 2015), and a similar dynamic may also contribute to tweet credibility when the tweeter is known to the user; but we have been unable to represent this within the survey. However, because many tweets that Twitter users are exposed to have been retweeted from users whom they do not follow, we believe that perceived credibility in the absence of a pre-existing relationship between users is important in itself. In the interim between the 2015 survey period and the present, public attitudes towards the credibility of information on Twitter may also have changed. Despite these limitations, we observed clear and consistent trends that lend insight into the factors influencing online credibility of scientific information.

CONCLUSION
In this study, we conducted a survey to investigate the extent to which perceived credibility of scientific information differed between five online platforms. We observed that journal abstracts were consistently deemed most credible, whereas tweets were consistently deemed the least credible, though we also noted high variance by topic. We then conducted a second survey in which we assessed the extent to which perceptions of the credibility of tweets related to the characteristics of their composition. We constructed a set of artificial tweets reporting science information but each with a unique combination of key characteristics, including whether the tweet displayed an image relating to the paper, the paraphrasing of the paper's title, whether the author claimed ownership of the research, and the tweet's reception (number of likes and retweets). We observed that the presence of a chart or a screenshot from the paper was associated with greater credibility but little to no evidence of meaningful differences based on other factors.
As researchers continue to leverage these platforms, it becomes important that they compose scientific information in ways that make it most credible. Beyond helping researchers to communicate their work, composing credible content may help the general public focus on the most credible scientific information, rather than the wealth of disinformation. During outreach efforts, it is important that researchers be cognizant of how context and format affect the credibility of the information they present. Some formal cues are more influential than others in user's evaluations of credibility. Our findings provide evidence of the importance of platform and topic to perceived credibility online. We also demonstrated the importance of visuals to the perceived credibility of tweets. More work is needed to understand how the credibility of scientific information differs by context and platform, and the ways in which the formal cues of information impact its credibility.

ACKNOWLEDGMENTS
We would like to thank the members of our research group who gave useful feedback on a presentation of this work, and for useful discussions during our weekly meetings.