A Data Collection on Secondary School Students' STEM Performance and Reading Practices in an Emerging Country

Abstract Science, technology, engineering, and mathematics (STEM) education has become a critical factor in promoting sustainable development. Meanwhile, book reading is still an essential method for cognitive development and knowledge acquisition. In developing countries where STEM teaching and learning resources are limited, book reading is an important educational tool to promote STEM. Nevertheless, public data sets about STEM education and book reading behaviors in emerging countries are scarce. This article, therefore, aims to present a data set of 4,966 secondary school students from a school-based data collection in Vietnam. The data set comprises of five major categories: 1) students' personal information (including STEM performance), 2) family-related information, 3) book reading preferences, 4) book reading frequency/habits, and 5) classroom activities. By introducing the designing principles, the data collection method, and the variables in the data set, we aim to provide researchers, policymakers, and educators with well-validated resources and guidelines to conduct low-cost research, pedagogical programs in emerging countries.


INTRODUCTION
Globally, technology has become a crucial component of our cultural, social, and economic landscape. Scientific and technological innovation is essential means to promote economic advancement and social justice. However, the emerging countries, which are lagged economically, are also markedly inferior in their research and technology developments [1,2,3]. Therefore, it is suggested that a focus on science, technology, engineering, and mathematics (STEM) education may serve as a useful direction for these nations. It is also a crucial factor to achieve United Nations Educational, Scientific and Cultural Organization (UNESCO)'s Sustainable Development Goal (SDG) 4-Quality Education [4]. In particular, STEM education can equip young people with the skills to adapt to the digital age and create life-changing opportunities for themselves and their families.
To improve STEM learning effectiveness, especially in developing contexts, reading is a critical element that needs addressing in the era of technology. Reading is a fundamental activity that helps enhance various cognitive capabilities, academic achievement, and educational and occupational attainment in young adulthood [5,6,7]. Children's technological problem-solving skills could be improved by developing a reading culture at home [7]. Furthermore, reading is also found to facilitate the creation of new ideas and inquiries for the problem-solving process involved in STEM learning [8]. Empirically, Braun et al. [9] and Fang and Wei [10] found that secondary school students in the United States who spend time reading additional books perform better in both reading and scientific subjects than those who do not. However, a majority of prior studies have been conducted in developed settings. In contrast, the topics regarding STEM performance and reading behaviors among children in developing countries remain understudied.
The lack of data might induce the research shortage in emerging countries. Observing multiple open globally-scoped data repositories (Mendeley Data, Zenodo, Harvard Dataverse, Figshare, etc.) and data journals (e.g., Scientific Data and Data in Brief ), we found limited data sets on STEM education and reading behaviors among children. There are series of national survey data concerning children's educational conditions and performance. Still, the majority of them were conducted in developed countries, specifically Australia and United States [11,12].
Of those in developing contexts, a data set provided by Ranjeeth et al. [13] offers resources to investigate the relationship between Indian secondary school students' academic performance and outside-class habits. Nonetheless, it lacks focus on STEM-related achievement, and its questions are not systematically structured. Another set of data about Nigerian students' academic performance in STEM-related disciplines is also accessible on Data in Brief [14]. The contents are collected solely from undergraduate schools. Besides data sets made open on data repositories and journals, we have found a wide array of global data sets directly related to the SDG 4 on UNESCO Institute for Statistics's online database [15]. Those data sets are structured to scope the macro progress in achieving SDG 4's targets, whereas data sets concentrating on individuallevel development are scarce.

A Data Collection on Secondary School Students' STEM Performance and Reading Practices in an Emerging Country
Therefore, the current paper presents a highly organized and multifaceted data set regarding socioeconomic status, family conditions, reading habits and preferences, perceived classroom activities, and academic performance among junior high school students in Vietnam. Vietnam is a unique case of an emerging nation that has been highly regarded for K-12 students' STEM achievement. Notably, Vietnam's students have recorded high-performance levels in the Programme for International Student Assessment (PISA) over the three years in which they participated [16,17,18], despite recent participation. When putting this in alignment with Vietnamese households' comparatively limited socio-economic status, this significant attainment is a worth investigating case.
The current data set is a valuable resource to provide evidence that helps policymakers and education leaders to improve educational quality and equity. Given the low cost of a substantial data collection [19] and the shortage of earlier research on similar issues in emerging countries, the current data set and its design might be useful points of reference for other developing countries with similar context with Vietnam.
Several high-quality findings regarding book reading and STEM performance have been obtained and published using only a portion of the current data set [20,21,22]. A typical example is a recent publication that analyzes the social gap and gender disparity in Vietnam, which associates with STEM outcomes [20]. As the reproducibility crisis occurs across disciplines and hampering knowledge accumulation progress, Munafò et al. [23] encourage scientists to improve transparency through open science. Thus, the open access to the current data set will enable other researchers to replicate and validate the published findings, strengthening the results' integrity and reliability amidst the movement to make science more transparent [24,25].

Data Collection Sample and Design
The current research's data [26] were collected through a school-based data collection of STEM performance, reading habits, and preferences of junior high school students (Grade 6 through Grade 9, which corresponds to age 11 to 15) in Ninh Binh Province, Vietnam. The data collection was conducted in two periods. The former was between December 2017 and January 2018, while the latter was from February until July 2018. In total, 4,966 responses of adolescent students were acquired.
The questions in the questionnaire were initially designed by Vuong and Associates office. Then, an officer in the provincial Department of Education and Training distributed the paper questionnaires to academic administrators of 16 public junior high schools in the area who were later in charge of gathering the questionnaire responses. All participants in the data collection process adhered to the institutions' ethical code responsible for the research.
The research project can be structured into five phases: 1) questionnaire design, 2) questionnaire collection, 3) quality control for questionnaire answers, 4) data set generation, and 5) data analysis.

Data Validation
For ensuring the quality and validity of the data, the data were rigorously checked several times in three phases of the data collection: 1) in-school phase, 2) gathering phase, and 3) computerizing phase ( Figure 1). During the first phase, the homeroom teachers were in charge of collecting the questionnaires from students and initially checking the response accuracy and missing answers; then, the person who took care of the school academic activities was responsible for validating the collected questionnaires one more time. In the next phase, all the questionnaires from 16 schools were transferred to the data collection administrator-an officer in the provincial Department of Education and Training, for the third quality and validity check. After the examination, the administrator handed the responses over to the current study's principal investigator for the computerizing phase. During this phase, research team members entered the data into an MS Excel file and cross-checked the file to ensure that the inputted data precisely represented the answers on questionnaires. When there was doubt about the response, the team members consulted with the principal investigator for resolution. The research team contacted the data collection administrator for validation in case there existed serious errors.
The data set has been capitalized using both frequentist and Bayesian approaches in several publications, including [20,21,22]. From these works, various significant results have been found. Regarding socioeconomic backgrounds, gender had a negligible correlation with students' STEM results at schools, and it was also found that female students can achieve better results than their male counterparts [18]. Regarding the correlation with reading behavior, empirical results also show that higher grades in STEM-related subjects are predicted by reading interest, with students who love reading books achieving higher scores than those who take no interest in books [19]. For raising the reading interest among students, scholarly culture at home and reading practices were suggested to be significant determinants [20].
In the current data descriptor, we also conduct an exemplary analysis using the current data set utilizing the Bayesian estimation and Markov Chain Monte Carlo (MCMC) technique. The example is shown in Section 3 below.

Data Collection
Questionnaires were distributed to all classes in 16 public junior high schools by the academic administrators and then collected by the homeroom teachers. Eventually, 4,966 students' responses were received. Overall, boys' and girls' responses shared half of the sample (49.36% and 49.62%, respectively), while the remaining 1.03% (51/4,966) of the total students did not report their biological sex. The percentage of students among four grades (from Grade 6 to Grade 9) was proportionately similar with 24.90% (1,237/4,966) from the sixth grade, 24.04% (1,194/4,966) from the seventh grade, 23.86% (1,185/4,966) from the eighth grade, and 25.51% (1,267/4,966) from the ninth grade. A majority of the sample reported to like reading (88.40%), and 41.80% of students were the first child in their families. Students' average score of STEM-related subjects ("APS45") and favorite type of books ("Topic") are plotted in Figure 2. Figure 1. Diagram of the data collection and validation. Note: There were three primary data collection phases: 1) in-school phase, 2) gathering phase, and 3) computerizing phase. The questionnaires' responses were checked one or two times before being proceeded to the next phase. In case there was doubt regarding the nature of an answer, the data validator would check with the former validator in the process.

Response Coding
After receiving the hand-written questionnaires, the research team codified responses into 37 different variables, which could be classified into five main categories: 1) personal information (including STEM performance), 2) family-related information, 3) book reading preferences, 4) book reading frequency/habits, and 5) classroom activities. The variables' coding and a brief description of Categories 1 and 2 are shown in Table A1, while those of Categories 3 to 5 are displayed in Table A2. Below is a detailed explanation of each category's variables in the text.
1) Personal information. The personal information category includes "Sex" (male, female), "Grade", "School", "APS45", "APSVNEN", "FutureJob", and "Hobby". The variable "APS45" represents the average scores of the most recent 45-minute examinations of Mathematics, Physics, Chemistry, and Biology. In contrast, the variable "APSVNEN" indicates the average score of mid-term tests of Mathematics and other natural science subjects. As for "Hobby", students were given six main kinds of activities, namely: reading books, watching TV/listening to music, housework/farming, observing nature, interacting with friends/family members, and others. For better utility, these activities were coded as "a", "b", "c", "d", "e", and "f", respectively.

Figure 2. Visual representations of students' information.
Note: Figure 2A displays the distribution of the average score of the most recent 45-minute tests of Mathematics, Physics, Chemistry, and Biology by the sex of the students. Overall, male students had a higher mean score of STEM-related subjects than female students. Figure 2B shows the distribution of students by their favorite type of book, and within each favorite type of book the distribution of students by their grade. "a", "b", "c", "d", "e", and "f" correspond with mathematics/physics, literature, foreign language, natural science/chemistry/biology, history/geography, and information technology, respectively. 27.57% (1,369/4,966) of students were keen on literature-related books, and the distribution of students by grade within the literature type of book was generally balanced.

A Data Collection on Secondary School Students' STEM Performance and Reading Practices in an Emerging Country
2) Family-related information. The second category regarding family-related information could be subsequently classified into three sub-groups. In the first sub-group, "RankingF" and "NumberofChi" demonstrate the student's birth order in the family and the number of children in the student's family, respectively. The second sub-group regards parents' characteristics, such as father's/mother's education ("EduFat"/"EduMot"), father's/mother's age ("AgeFat"/"AgeMot"), and father's/mother's career ("CareerFat"/ "CareerMot"). The educational level of the student's parents was captured in four levels: under high school ("UnderHi"), high school ("Hi"), undergraduate ("Uni"), and graduate school ("PostGrad"). The last sub-group focuses on the economic aspects of the family, including perceived economic status ("EcoStt"), awareness of the family's monthly income ("KnowledgeInc"), and estimated monthly income ("EstIncome"). The perceived economic status was coded as follows: "poor" = low-income family, "med" = mediumincome family, and "rich" = wealthy family, whereas the variable "KnowledgeInc" only consists of two answers, "yes" or "no". The students were asked to report their families' estimated income ("EstIncome") in Vietnam Dong.
3) Book reading preferences. The third category primarily concerns the reading preferences of students, which could be shown by seven variables. The reading interest of a student was referred to as "Readbook" (yes, no). Favourite type of book was captured as the "Topic" (mathematics/physics = "a", literature = "b", foreign language = "c", natural science/chemistry/biology = "d", history/geography = "e", and information technology = "f"). Besides the favorite type of book, the prioritized book type when being gifted was represented by a categorical variable "Typebook" (novel = "a", biography = "b", popular science = "c", arts = "d", vocational instruction = "e", and other = "f"). Subsequently, the reason of the prioritization was shown by the "Reason" (personal preferences = "a", recommended by parents = "b", recommended by teachers/friends = "c", and serendipity = "d"). The "PrioAct" specifies the primary activity conducted by students when meeting a good book; it consists of "a" (sharing with friends/family), "b" (recording), "c" (applying the content to daily life), and "d" (reflecting and relating to personal knowledge). Students' prime activity after reading a good book was shown as the "AftAct" (finding more books on the exact issue = "a", finding more books on the related issue = "b", finding books on the new issue = "c", and reading the book again = "d"). Eventually, the student was asked to provide two favorite books, of which the variable was recorded as the "Read_like".

4)
Book reading frequency/habits. The fourth category comprises five variables respecting the reading habits of students. The first two variables, "TimeSci" and "TimeSoc", concern the length of time students spent reading science books and literature/social sciences-related books daily, respectively. Both variables were coded as follows: less than 30 minutes = 1, between 30 and 60 minutes = 2, and over one hour = 3. Two other variables in this category are habits conducted by students' parents, such as buying books ("Buybook") and reading stories ("Readstory") for their children. These two variables were constructed with binary responses -"yes" or "no". The last variable in this category was generated from the question regarding books' primary source ("Source"). To answer this question, students were asked to choose one alternative among buying books on their own or with parents' money ("buy"), borrowing from friends or libraries ("borrow"), or being gifted or rewarded ("gift").

5) Classroom activities.
The classroom is an important place to build up students' reading behaviors, interests, and preferences. Therefore, we created a separate section with four questions in the questionnaire to examine students' perceptions of classroom activities. The first question, which corresponds to the binary variable "EncourAct" (yes or no), is to see whether students were keen on activities that encouraged reading. Then, students were asked to give the activity that they were most interested in ("MostlikedAct") among the four following options: book exhibition = "a", storytelling competition = "b", story-writing competition = "c", and illustrating books' content by drawing = "d". Besides favorite activities that encouraged reading, we also examined students' perceptions of bookshelf conditions in the classroom ("Bookcase"). Students were required to assess the condition according to the following alternatives: diverse and interesting = "a", lack of good titles = "b", lack of book = "c", and no bookshelf = "d". The last variable ("Notread_like") was generated from an open-ended question in which students could freely provide three favorite titles that they would like to read but were not available on the classroom bookshelf.

EXEMPLARY DATA ANALYSIS
The responses were initially inputted into an MS Excel file, and then converted to a comma-separated values (.csv) file. Bayesian linear regression applying the Markov Chain Monte Carlo (MCMC) technique was employed for analyzing the data with the dependent variable being "APS45" (average score of the most recent 45-min tests of Math, Physics, Chemistry and Biology of students) and independent variables being "Readbook" (reading interest) and "Topicgr" (students' perceived economic status). Here, the Bayesian approach is preferable rather than the frequentist approach because of several replication issues induced by the P-value. Specifically, "the wide sample-to-sample variability" in the P-value and the P-hacking practices are two major contributors among reasons that exacerbate the reproducibility crisis [23,27,28].
The procedure was adapted from the Bayesian protocol for examining social data by Vuong et al. [29]. All the priors of the model's coefficients were set as uninformative before simulation. The bayesvl R package was utilized due to its user-friendly operation and graphical visualization power [30,31]. The Bayesian analytical model to examine the associations between "APS45" and "Readbook" and "Readstory" is defined as follows: The codes utilized in the R (version 4.0.2) for the Bayesian modelling and MCMC simulation are shown below:

A Data Collection on Secondary School Students' STEM Performance and Reading Practices in an Emerging Country
The posterior coefficients of the estimation of "APS45" against "Readbook" and "Readstory" are displayed in Table 1. Table 1 presents the posterior estimates acquired from the Bayesian linear regression model employing "APS45" as a dependent variable and "Readbook" and "Readstory" as independent variables. Two independent variables were set as binary variables during the model fitting process with "yes" = 1 and "no" = 0. The model was fitted with four Markov chains, 5,000 iterations, and 2,000 warm-up iterations. After running the MCMC simulation, we obtain a good demonstration of convergence by two standard diagnostics: All coefficients' Rhat's values (Gelman shrink factor) were one, and all values of n_eff (effective sample size) were beyond 1,000 ( Table 1). The visual diagnostic of Markov chains' convergence is also displayed in Figures 3 and 4. "Readbook" and "Readstory" coefficients obtained positive mean and acceptable standard deviation (μ Readbook = 0.44 and s Readbook = 0.14; μ Readstory = 0.14 and s Readstory = 0.09). Based on these results, we suggest that improving the interest in reading books and encouraging parents to read stories for their children might result in better STEM performance among students. The coefficients' posterior distributions are displayed in Figures 5A and 5B, the histogram and two-dimensional density plot, respectively.   Figure 3 illustrates the Markov property's visual diagnostic of the coeffi cients' posterior estimates by trace plots. The trace plots in Figure 3 were created from four Markov chains, each containing 5,000 iterations, of which 2,000 were warm-up iterations. All the trace plots exhibit two primary characteristics meeting the Markov property: stationarity and good mixing.   Figure 5A illustrates the posterior distribution of coeffi cients' "Readbook" and "Readstory" on a line histogram, while Figure 5B displays the distribution of simulated posterior values on a two-dimensional surface. In both Figures 5A and 5B, coeffi cients' distributions mostly correspond with the axes' positive values, showing the positive infl uence of "Readbook" and "Readstory" on students' STEM performance ("APS45"). The coeffi cient "Readbook" had a more signifi cant impact on "APS45" than "Readstory".

USAGE NOTES AND CONCLUSION
The available data set offers several key features that can be used in future research. First, researchers can continue studying the relationship between children's reading habits and educational achievements. The effects of parental involvements, gender, and school activities on children's reading behaviors in an emerging context are also worth exploring. Evidence from these analyses will be beneficial to educators and policymakers, especially those in developing countries.
The deposition of the data set also allows researchers to replicate previous findings. Amidst the reproducibility crisis [23,27,28], replication is an important aspect to produce robust and precise evidence. Making the data set open also improves the transparency and published results' integrity and reliability. Thus, previous findings will be put under the scrutiny of post-publication review. Hopefully, this practice will support the cause of open access and prevent irreproducibility.
Finally, while the data set is an important asset, the survey's design principles are also valuable. We designed the study on a low-cost basis due to the limited fundings. The data set and its subsequent analysis have somewhat proved that this method is cost-effective. Thus, researchers from developing countries and those working with limited resources can use these design principles. In a larger context, we hope this method will also reduce the cost of science [19] and contribute significantly to scientific communities worldwide.
In conclusion, the current data set was systematically designed and collected. Hence, it provides researchers, policymakers, and educators with well-validated resources regarding secondary students' multifaceted aspects for formulating pedagogical programs and strategies in Vietnam and other countries, especially emerging countries with similar contexts.

DATA AVAILABILITY STATEMENT
All the data were anonymized and stored in a .csv format file that are available in the Science Data Bank repository, https://doi.org/10.11922/sciencedb.j00104.00090, under an Attribution 4.0 International (CC BY 4.0).