Using web content analysis to create innovation indicators—What do we really measure?

This study explores the use of web content analysis to build innovation indicators from the complete texts of 79 corporate websites of Canadian nanotechnology and advanced materials firms. Indicators of four core concepts (R&D, IP protection, collaboration, and external financing) of the innovation process were built using keywords frequency analysis. These web-based indicators were validated using several indicators built from a classic questionnaire-based survey with the following methods: correlation analysis, multitraits multimethods (MTMM) matrices, and confirmatory factor analysis (CFA). The results suggest that formative indices built with the questionnaire and web-based indicators measure the same concept, which is not the case when considering the items from the questionnaire separately. Web-based indicators can act either as complements to direct measures or as substitutes for broader measures, notably the importance of R&D and the importance of IP protection, which are normally measured using conventional methods, such as government administrative data or questionnaire-based surveys.


INTRODUCTION
The majority of researchers in innovation and technology management today still rely on public databases and questionnaire-based surveys to obtain most of their data to perform quantitative studies on industrial strategies and innovation activities. However, public databases are often incomplete or too general in nature. Although questionnaires remain precise instruments, the process of designing, testing, and administering questionnaire-based surveys can be especially time-consuming and costly for researchers. Furthermore, oversolicitation of respondents and of their time militates against questionnaire-based surveys, which suffer from increasingly lower response rates (less than 10%) and thus threaten the validity of studies performed using such methods, for instance because of the potential for significant nonresponse biases.
To complement questionnaire-based data, social scientists have often used secondary data. With the development of "Big Data" analytical tools, websites are increasingly recognized as additional information gold mines. Researchers in innovation and technology management are now investigating whether they can use the information that organizations provide on their websites to acquire valuable additional data for their research. Technology companies generally maintain websites that allow the media, as well as potential investors, customers, suppliers, and collaborators, to learn about the nature of their activities. The information provided on company websites is as rich as it is diversified, including products, services, business models, R&D activities, and more. As these corporate websites are freely available to anyone with internet access, a n o p e n a c c e s s j o u r n a l researchers need to evaluate their value as data sources. Before fully validating their usage, a few questions need to be answered: Is it possible to extract the information contained on these websites and convert it into useful data for research purposes? Is the available information reliable and sufficient to provide an accurate picture of the firm's specific characteristics? In other words, can the contents of a corporate website be used to identify different business innovation attributes?
This research aims to study high technologies that should have emerged by now, nanotechnologies and advanced materials, but that are still difficult to assess because of their pervasiveness throughout the industry. Nanotechnologies and advanced materials, as so-called enabling technologies, have innumerable applications and are present in all major industrial sectors, such as food, agriculture, electronics, renewable energy, the environment, biomedical, and healthcare. The ability to develop and use nanotechnologies and advanced materials is expected to give impetus to innovation performance and economic growth (Hwang, 2010). The versatility of these advanced technologies' applications and the sectors affected by them make them particularly interesting for studies in innovation management because they allow an intersectoral bird'seye view of the industry. However, owing to the pervasiveness of nanotechnologies and advanced materials throughout the economy, national statistics and other data repositories often fail to accurately measure their impact. A potential solution to alleviate this perceived lack of accurate data lies in the information contained on company websites.
The clear majority of companies working in high-technology fields, such as those producing or using nanotechnology and advanced materials, maintain updated websites. Although the online information is made available by the companies themselves, which suggests the possibility of a strong self-reporting bias, this source of information is deemed suitable for the study of the emerging technologies (Gök, Waterworth, & Shapira, 2015). For instance, Youtie, Hicks, et al. (2012) noted that small businesses tend to have less content-heavy websites, which facilitates the handling of data. A successful web mining analysis has several advantages over questionnaires, scientific publications, and patents. First, the population covered by searching the web (web crawling) is very large (Herrouz, Khentout, & Djoudi, 2013) compared to questionnaire-based studies, which generate few returns (low response rate). Contrary to government data, the frequency of updates is high, even daily, in most cases (Gök et al., 2015). One significant drawback, however, is that companies do not disclose all of their strategic and business data on their websites. The main disadvantage of such web-based data stems from problems inherent in organizing and interpreting highly unstructured information, as each site organizes various types of information differently. In addition, we do not yet know exactly what the significance is of what we are measuring, hence the necessity to concurrently use both traditional and webometrics methods and to compare them with one another.
In this study, we analyzed and compared four sets of measures of innovation and commercialization respecting nanotechnology and advanced materials firms in Canada stemming from two different data gathering techniques: word frequency analysis from web content and questionnaire-based surveys. Comparisons between the results from both methods were obtained via correlations. To ensure a convergent and discriminant validation of our results, a multitraits multimethods (MTMM) technique was then performed on the most significant measures. Formative indices were built to determine the best representation of the web-based indicators. One final MTMM matrix was used, along with an MTMM confirmatory factor analysis (CFA), as a post hoc analysis to ensure the robustness of our results.
The remainder of the article is organized as follows: Section 2 presents the theoretical development of the innovation and commercialization of high technologies, such as nanotechnology and advanced materials, along with web content analysis; section 3 describes the methodologies used to collect data and the MTMM method; section 4 analyzes the results surrounding propositions 1 and 2, and performs a post hoc analysis; section 5 discusses the results and conclusions; and lastly, section 6 addresses the research limitations and future research.

Using the Web as a Data Source for Social Sciences
A few years after the invention of the world wide web, its use as a means to get valuable data for social science studies was introduced (Almind & Ingwersen, 1997). With more and more online content from all aspects of society being produced, the web becomes a digital representation of the social construct that embodies our civilization. With such a rich repertoire of information, social scientists started to explore ways to leverage the web as a data source for research (Björneborn & Ingwersen, 2004). As a result, measuring different elements of the web, such as websites, webpages, parts of webpages, words in webpages, hyperlinks, and web search engine results, has been undertaken by a number of scholars, paving the way to a new field of research: webometrics, as an evolution from the classic bibliometric studies (Thelwall, 2009). Link analysis, blog searching, and web impact assessment were the main focus of late 2000s webometrics research. The first, link analysis, consists in analyzing the network of hyperlinks contained in web pages in the hope that they may reflect an organization's connections (Hyun Kim, 2012;Katz & Cothey, 2006;Vaughan, 2004). This method has proven useful to understand the structure of the collaborative network and the importance of a given actor within its network (Minguillo & Thelwall, 2012;Stuart & Thelwall, 2006). The interest in the second method, exploring the content of blogs, resides in their vast heterogeneous public-generated information at a given time. Their understanding provides information on the trends in public opinion on a given set of topics. Nowadays, blog searching is more prevalent in social media platforms such as Twitter (Thelwall, Buckley, & Paltoglou, 2011), and the importance of topics can be effectively monitored through Google Trends (Choi & Varian, 2012). The third, web impact assessment, offers a method for measuring the direct online impact of specific topics or documents by counting their presence on the web. These could be a good proxy for offline impact due to the dominant penetration of the internet as a means of information and communication (Thelwall, 2009). Web impact assessment could help identify patterns of diffusion and relative impact of authors, terms, scientific theories, political candidates, books, journals, and so on. In the case of innovation studies, the web could be viewed as a way to obtain indirect indicators, such as the number of attendees at a given event, the media coverage for a given product launch, or the number of requests for an organization's literature or the number of times an organization's concepts or publications are mentioned (Thelwall, 2009).
To perform web analysis, three main areas of web mining are generally used: web structure mining for link analysis; both web usage mining and web content mining for web impact assessment; and blog searching (Miner, Elder, et al., 2012). Of particular interest for this paper is web content mining, which is a technique that turns unstructured information in the form of text contained in web pages into structured data useful for research purposes.
Content analysis is an objective and quantitative method used to find relationships between textual information and the context from which the information is sourced (Krippendorff, 1980). The benefits and challenges of using the web as a data source for content analysis were already discussed at the beginning of the millennium (Weare & Lin, 2000). The main opportunity from using the web as a data source stems from the explosion of possible sources of multimedia information (text, video, audio, and image) that can be used for content analysis, which was anticipated to reduce the cost of data acquisition significantly. However, the information contained on the web is highly unstructured, completely decentralized, and not standardized in any way, which makes the quality and validity of web content analysis particularly challenging to verify (Weare & Lin, 2000) and somewhat costly to process.
Despite the limitations, some web content analyses were performed in innovation studies. For example, Herrouz et al. (2013) used a web scraping method to obtain data from small and medium-sized high-technology graphene firm websites in the United States, United Kingdom, and China. More recently, Gök et al. (2015) proposed a web content analysis based on a keywords frequency analysis to assess the R&D activities of 296 UK-based green goods small and midsize enterprises (SMEs). They compared the results of their R&D web indicator with the results obtained through a questionnaire-based survey. The results of this study showed that the web-based indicator did not correlate significantly with the nonweb-based indicator, which suggests that these two indicators did not reflect the same concept. It is therefore possible that web-mining indicators provide new information that was not captured through classical methodologies and thus are shown to be complementary.

Construct Definition
Following in the footsteps of Gök et al. (2015), in this study we focus on parameters influencing innovation and commercialization in Canadian high-technology firms. We consider the four most important factors known to influence innovation and commercialization in high-tech firms, in our case, nanotechnologies and advanced materials: research & development (R&D) intensity, intellectual property (IP) protection, collaboration, and external financing (Lee, Lee, et al., 2013).
R&D intensity refers to the effort a firm makes to generate inventions (Griliches, 1990(Griliches, , 1994(Griliches, , 1998Hausman, Hall, & Griliches, 1984;Hitt, Hoskisson, & Kim, 1997) as well as investment in its own absorptive capacity (Cohen & Levinthal, 1990). Greater R&D efforts are likely to yield better returns to innovation. Indeed, the positive relationship between R&D efforts and innovation performance has been demonstrated in several quantitative studies (Baysinger & Hoskisson, 1989;Deeds, 2001;Greve, 2003;Griliches, 1998;Hagedoorn & Cloodt, 2003;Hall, 1990;Parthasarthy & Hammond, 2002). Innovation and R&D efforts have been shown to positively influence firms' commercialization and financial performance (Geroski, Machin, & Van Reenen, 1993;Klette, Møen, & Griliches, 2000). R&D efforts should therefore give nanotechnology and advanced materials firms a technological superiority as compared to the market. R&D intensity is a measure of intention, and, as such, not a direct measurement of the performance of the R&D process as concerns the production of knowledge, inventions, and innovations (Adams, Bessant, & Phelps, 2006;Cebon, Newton, & Noble, 1999;Flor & Oltra, 2004;Kleinknecht, Van Montfort, & Brouwer, 2002). In addition, not all innovations are systematically derived from this internal R&D process (Michie, 1998), as highlighted by the rising popularity of open innovation practices (Chesbrough, 2003). Moreover, measures based solely on R&D are not the best suited for SMEs because their efforts are often neither formal (Kleinknecht et al., 2002) nor constant (Michie, 1998). For these reasons, researchers use R&D intensity less as a proxy for innovation, as suggested by (Becheikh, Landry, & Amara, 2006), and more as a means to achieve higher innovation performance.
IP protection confers a competitive advantage to companies by offering exclusivity for the commercialization of technologies, branding, copyright, industrial design, and so on. (Teece, 1986). Patents enhance the return on investment of the amounts dedicated to technology development through temporary monopoly power, license granting, or the sale of patents (Arora, 1995;Chesbrough, 2003;Feldman & Florida, 1994;Fosfuri, 2006;Mazzoleni & Nelson, 1998;Merges, 1999;Rivette & Kline, 2000). Laursen and Salter (2006) observed a positive relationship between appropriability and innovation performance, but one that eventually leads to decreasing returns. A patent, as one of the outputs of the research process, is often considered as a measure of inventiveness (Coombs, Narandren, & Richards, 1996;Flor & Oltra, 2004;OECD & Statistical Office of the European Communities, 2005). Although patent statistics are frequently used as proxies for innovative activities (Pavitt, 1985) or innovation (Becheikh et al., 2006), they have also been the subject of considerable criticism (Archibugi, 1992;Cohen & Levin, 1989;Dosi, 1988;Griliches, 1998). The use of patents varies considerably from one sector to another (Archibugi & Sirilli, 2001;Armellini, Beaudry, & Kaminski, 2017;Armellini, Kaminski, & Beaudry, 2014;Michie, 1998) and according to firm size, because SMEs, which have fewer financial resources to protect their IP, are systematically disadvantaged.
In contrast to the first two types of indicators, collaboration has a much broader impact on the innovation process: from idea generation and research to commercialization. Collaboration thus has a positive impact on several innovation performance indicators, such as the number of patents, sales growth, and the return on innovation sales (Arvanitis, 2012;Belderbos, Carree, & Lokshin, 2004;Carboni, 2013), and has been shown to improve competitiveness (Hagedoorn, Link, & Vonortas, 2000;Roja & Nastase, 2013). A constant collaborative effort is essential for the development and deployment of emerging technologies, which are becoming increasingly complex, as well as for reducing R&D costs, sharing risk, and increasing performance (Johnson & Filippini, 2009;Parker, 2000). For example, the aerospace industry requires all actors to adopt a policy of interorganizational cooperation to take advantage of all the necessary knowledge and know-how available (Jordan & Lowe, 2004). In this field, collaboration is so important that it spans all types of partners: clients (Armellini et al., 2017), including suppliers (Bozdogan, Deyst, et al., 1998); competitors (Esposito, 2004;Frear & Metcalf, 1995); and universities and institutes (Armellini et al., 2014(Armellini et al., , 2017. McNeil, Lowe, et al. (2007) have shown that collaboration with universities or government institutes enables young high-tech firms to access particularly expensive tools. In addition, Kim, Lee, and Marschke (2014) highlighted the impact of university research on scientists associated with the training and development of skilled labor, patents, and innovation in industry.
Finally, as most nanotechnology and some advanced material projects are still in the early development stages, they rely heavily on private and/or public funding to reach the commercialization and marketing phases (Kalil, 2005;McNeil et al., 2007). In surveys on barriers to innovation, firms very often refer to the lack of external financing as a major obstacle to their innovation activities (Harhoff & Körting, 1998). Venture capital investment in young U.S. high-tech firms had a positive and significant effect on their R&D productivity in the 1990s (Brown, Fazzari, & Petersen, 2009). For more than a decade, substantial public investments have targeted nanotechnologies worldwide (Crawley, 2007); more recently, in 2018, C$1.2 billion were invested in the United States (National Nanotechnology Coordination Office, 2017). As the development of these emerging technologies requires public support, the Innovation, Science and Economic Development Canada (ISED, formerly known as Industry Canada) website proposes 13 programs dedicated to funding the development of nanotechnologies and advanced materials. These public funds often have a lever effect and firms that receive such funding in grant form are more likely to also successfully raise private funding and receive better external financing offers (Meuleman & De Maeseneire, 2012).
These four pillars of the innovation process (R&D, IP, collaboration, and external financing) have been the focus of much literature and have been extensively measured by both researchers and national statistics offices. But do they transpire in corporate websites? A number of scholars have attempted to analyze web content from the information contained on companies' websites to build indicators and innovation metrics (Gök et al., 2015;Herrouz et al., 2013). Yet understanding the "real" meaning of measures created from data gathered with web mining techniques is not a trivial task. Gök et al. (2015) showed that their web-based indicator based on R&D-related keywords did not correlate significantly with nonweb-based indicators, which suggested that these two indicators did not reflect the same concept. Therefore, the web mining indicator could represent new information that was not captured through classical methodologies and could possibly be complementary. In this paper, two propositions will be tested to assess whether these web-based indicators can be used as substitutes for indicators obtained by classical questionnaire-based methods.
Within corporate websites, innovation words and factors are referred to using several synonyms and other related terms. Companies may use words on their website to provide insight into what they actually do. The more a company uses terms related to a specific factor, the stronger the signal is deemed for that factor, and therefore, the more likely the firm is liable to perform activities related to that specific factor. Thus, for each factor mentioned above, we suggest the following proposition: Proposition 1: The more words related to a factor, in our case R&D intensity, intellectual property, collaboration, and external financing, that are used on a firm's website, the more a firm is likely to perform activities related to that factor.
Connecting keywords found on websites with tangible actions taken by those companies is a bold proposition that may not be valid empirically. The content of corporate websites is written with the aim of signaling the company's "best" characteristics to all its internal and external stakeholders. This represents a rather modest but reasonable claim against the validity of the information to be found, which may suffer from a possible self-reporting bias. However, it is reasonable to assume that the information provided on companies' websites will be as true to the company's intentions as to the various concepts it wishes to share with the world. The information that stands out on a website can give a broad sense of the importance of a specific factor for the firm.
Proposition 2: The more words related to a factor, in our case R&D intensity, intellectual property, collaboration, and external financing, that are used on a firm's website, the more important a firm considers that factor.
The two propositions will be tested and compared to the results obtained with a classic questionnaire-based survey using the methodology explained in section 3.

Questionnaire-Based Data Collection
For a more general study of innovation in nanotechnologies and advanced materials, we first conducted a classic questionnaire-based survey, the core of which is based on the Oslo Manual (OECD & Statistical Office of the European Communities, 2005), to explore the following themes: innovation, commercialization, collaboration, and IP protection, among other topics. The questionnaire includes the importance of the sources of knowledge and the importance of innovation activities measured by a Likert scale from not important (1) to essential (7). Other questions cover the importance of commercialization actions, the proportion of financing for R&D and commercialization, the importance of market impacts, the number of exports, the importance of obstacles to commercialization, whether or not the firm collaborated to develop or commercialize the latest most significant product innovation, the importance of collaborators, the importance of the reasons for collaboration, IP mechanisms, patent management, and general questions such as the number of employees, revenue, and the firm's business sector. A sample of the questionnaire is provided in Appendix 1.
Due to the ubiquitous nature of nanotechnology and advanced materials, combined with the fact that they are adopted in a large number of industrial sectors, identifying companies that use or develop these technologies is not straightforward. Firms that either use or develop nanotechnology and advanced materials are not labeled as such, nor are they searchable in any obvious way. As a consequence, we used all possible means to build an exhaustive list of high-tech companies with a higher probability to be involved in nanotechnology or advanced materials with the help of several sources, including ISED, Nano-Québec (now Prima-Québec), Nano Ontario, Nanowerk, futuremarketsinc.com, and AGY Consulting, a Canadian firm specialized in emerging technologies such as nanotechnology, clean technology, and biotechnology. Combining all these data sources resulted in more than a thousand unique entries. After manually removing all obvious noneligible companies, a final list of 592 high-tech companies was then put together.
To maximize our response rate, we contracted Léger360 (now Léger), a well-known survey company, to administer the questionnaire and find new companies eligible for our study. Their professional approach is appreciated by firms that have responded to surveys in the past. Data collection began in September 2016 based on a convenience sampling method. To cover even more ground, we also used the snowball method. More precisely, respondents first had to answer the eligibility question: "Does your company develop or/and commercialize nanotechnologies and/or advanced materials?" Then, if the firms were eligible, the respondents were asked whether or not they were interested in participating in our study. Finally, they were asked whether they had any recommendations for any potential respondents who might be eligible for our investigation as an attempt to increase our sample size. Firms that agreed to participate in the survey but did not complete the questionnaire received two telephone reminders at 1-week intervals. After 332 solicitations, the incidence rate was 28%, which indicated that 93 companies were eligible.
To maximize our sample size, we added companies to the list by using the common characteristics of the eligible respondents. Companies that were eligible for the study (the 93 eligible companies listed above) were grouped and classified according to their North American Industry Classification System (NAICS) codes. Twenty-three six-digit NAICS codes corresponding to 67% of the eligible companies were identified. Léger360 acquired a list of 3,345 companies representative of the frequency distribution of the NAICS code obtained from InfoGroup.
We solicited 2,971 Canadian high-tech companies through the entire data acquisition process. Of these, 973 companies did not respond, 1,459 were not eligible, 380 refused to participate, and 168 eligible companies agreed to participate, of which 89 respondents completed the survey. Because of Léger360's code of ethics, we have no means of identifying whether respondents refused to participate before or after answering the eligibility questions. The unknown nature of the nonrespondents necessitates an approximation of the response rate. Assuming a uniform distribution, a 12% response rate was obtained by assigning the same characteristics to respondents who answered the eligibility questions to nonrespondents and respondents who refused to participate (see Table 1). As the population is unknown, we have a nonprobabilistic convenience sample for which the methodology may have induced a selection bias. In addition, we assume that the respondents were honest and answered the survey with goodwill.
The resulting sample represents a diverse selection of Canadian high-tech firms that develop or use nanotechnologies or advanced materials. Moreover, 74% of the firms are considered nanotechnology or advanced materials intensive, which means that at least 80% of their revenues come from nanotechnology or advanced materials innovations. The different application domains are as follows: 54% in advanced materials, 21% in biotechnology and medicine, 24.4% in electronics, 23.3% in equipment and devices, 13.3% in photonics, and 33.3% in other domains. More than 50% of respondents are small businesses and 83.5% are SMEs. The $94 million revenue average decreases to $31 million when the three largest firms are excluded. Finally, 85% of the firms come from Quebec (54.5%) and Ontario (30.7%), and 12% are from British Columbia and Alberta.
To test several types of biases, such as self-reporting bias and nonrespondent bias, we selected 79 eligible enterprises that did not participate in the study as a control sample. The Canadian government's 1 Directory of Canadian Companies was used as an external source of data to rule out the possibility of nonresponse bias. The database of companies from different sectors comprises information provided by the companies themselves on a voluntary basis. Although the Canadian government does not guarantee the accuracy or the reliability of the content, we assumed that the companies that willingly update information in an official public database input accurate information. We therefore assume that this mitigates the self-reporting bias that might arise from this type of source. The database provides the number of employees for 37 firms and revenues for 30 firms from our main sample, as well as the number of employees for 29 firms and revenues for 26 firms from our control sample. Using a two-tailed Mann-Whitney U test, we compared the main sample with the control group for these two metrics. We did not find any significant difference (at the 5% level) between the two samples for either metric (p = 0.115; p = 0.166 for the number of employees and revenues respectively), which suggests the absence of a nonresponse bias: There are no significant differences between the characteristics of the respondents and the nonrespondents from our list and thus the sample of eligible companies is homogeneous.
We then compared the data obtained from our questionnaire-based survey with the data from the Directory of Canadian Companies using the same two metrics to verify whether an important self-reporting bias exists. For every firm for which we had data from both our questionnaire-based 1 https://www.canada.ca/en/services/business/research/directoriescanadiancompanies.html, accessed June 4, 2020. survey and from the Directory, we tested each data pair with a two-tailed Wilcoxon Signed Ranks Test. Once again, we did not find any significant difference (at the 5% level) between our questionnaire-based survey results and the data from the Directory ( p = 0.058; p = 0.714), which suggests that the self-reporting bias issuing from the questionnaire-based survey is no different than that found in an official public database.
To build the four factors described above, R&D, IP protection, collaboration, and external financing, we identified all the relevant questions from the questionnaire-based survey and transformed the answers to these questions into different types of variables. The 12 questions used are listed in Appendix 1. Moreover, we transformed every continuous variable from the survey that did not follow a normal distribution by applying a natural logarithm (ln) or an inverse function (inv). Some Likert-scale questions could not be normalized because they were skewed on one tail or the other. We thus transformed them into dummy variables by attributing the value of 1 to answers associated with important (5), very important (6), and essential (7) and the remaining were given the value of 0.
A principal component analysis (PCA) was then performed to validate reflective constructs. Only constructs with sufficiently high Kaiser-Meyer-Olkin (KMO) measures (KMO > 0.6) were deemed valid, and only dimensions with Cronbach's alpha above 0.7 (Hair et al., 1998) were considered reliable. We used PCA with a Varimax rotation on seven-point Likert-scale questions that described the concept of R&D (R&D questions 2, 3, 5, and 6 in Appendix 1) to explore whether any particular combination of variables could lead to relevant reflective dimensions corresponding to specific factors of the concept examined. Two factors were created, but neither the KMO measure nor Cronbach's alpha reached an acceptable level (KMO < 0.6, alpha < 0.7) to satisfy the validity and reliability of the constructs. In addition, these combined variables did not correlate with each other, which suggested using a formative construct. We thus proceeded to treat each item individually rather than as a composite indicator.
In the end, we generated nine variables pertinent to R&D, one variable relating to collaboration, two variables corresponding to external financing, and two variables measuring IP. Table 2 describes the details of the questionnaire-based survey constructs. Descriptive statistics are presented in Appendix 2.

Web-Based Data Collection
While we were processing the questionnaire-based survey data, we initialized the indexing robot with Nutch 2 , an open source Web scraper software. We provided the robot with an initial list of uniform resource locators (URL) corresponding to the 89 enterprises that compose our questionnaire-based data set. This list is the first level of indexing traversal by the robot, which then browses the structure of the entire website to extract all the URLs. To begin the data collection process, the URLs linked to a company's website are written in a text file. The URLs are then injected into Nutch's database.
Once the database has been populated with the list of sites to be visited, the robot's route is generated by indicating the maximum number of links per page that we want to collect. The robot ranks the pages with a score that it calculates to prioritize the text collection. For instance, a page will have a high score if it is pointed to by many other pages. In addition, the higher the scores of pages pointing to another page, the higher the score of the page pointed to. Nutch then selects the links of the pages with the highest score to be collected first. When the links to browse are generated, Nutch will collect all the HTML code from the pages as well as the links to the other pages.
There are mainly three different types of links that structure the Internet network. Upstream links start from a page to one or more pages not necessarily on the same site. Downstream links are all links pointing to the same page. Finally, vertical links are all links within the same website (Van de Lei & Cunningham, 2006). The indexing robot scans all links without distinguishing between domain names and ranks them by score to first scan the links it considers most interesting. We have therefore limited the robot to the vertical links of the domain names present in the initial list. This is possible by modifying the Nutch configuration file limiting the robot to the initial list of domains.
By default, Nutch indexes only the text of pages in hypertext markup language (HTML) format. However, relevant information was also stored in other digital formats. We used the Tika plug-in to browse and index the text of documents in portable document format files (.pdf ) and Microsoft Word documents (.doc and .docx) format.
A further difficulty stems from the fact that our sample included companies that use multiple languages on their websites. Most Canadian companies will have English or French content and some will have other languages. Therefore, we did take into account the fact that the information could be given in languages other than English. There were several different cases to deal with. If a website was entirely in English, which corresponds to a majority of Canadian company websites, we then kept all the data from the website. If a website was available in several languages, including English, then only pages written in English were taken into account. If the website was exclusively available in French, the text was translated into English. Finally, the last case was a page whose language was neither French nor English. Such a page was not considered eligible for our study and, therefore, was not kept for the rest of the study.
Language detection was performed by the Compact Language Detector software and the translation was performed with Google's programming interface, Goslate. Translation with Goslate was also used by Arora, Youtie, et al. (2013) to analyze the content of Chinese grapheneproducing company websites.
We assigned to each recovered page a label of language, "French," "English," or "other" and translated the French pages into English for the relevant sites according to our assumptions.
Due to technical limitations, such as the structure of the websites, only 79 of these firms (88%) provided enough information to be included in our study. In the end, more than 9.7 million words from 27,000 pages were extracted from the complete texts of corporate websites. We then used a word frequency analysis on the text present in the websites. More specifically, for the 79 websites captured, we gathered information on the four innovation and commercialization factors mentioned above. For each factor, we listed all the relevant keywords that appear in corporate websites. Figure 1 shows the complete process to build the web-based innovation indicators to be compared with the questionnaire-based indicators described in the previous section.
Factors, keywords, and the web mining constructs are described in Table 3. R&D (Gök et al., 2015) and collaboration (Ramdani, 2014) keywords were selected from the literature, and IP and external financing were identified from our own investigations of the literature. The most relevant keywords of any paper, which are generally listed on its first page, particularly in the abstract 3 , serve as a basis for the list of keywords used for the construction of our factors. We used additional keywords related to specific public funds and programs. As mentioned earlier, Canada's public programs and funding opportunities for companies for the development of nanotechnology and advanced materials projects are listed on the ISED website. Specific information on these funds and programs was converted into keywords tied to the external funding factor for the individual firms for which we extracted website data. Keywords were manually tested to ensure they did not lead to false positive results.
Clustering using keyword frequency analysis with RapidMiner, a text mining program, enabled us to count the number of occurrences of each keyword for each factor. We transformed these clusters of occurrences into four continuous variables. Because the 79 companies differ in structure and size, and therefore present different quantities of information on their websites, we standardized each variable by dividing all occurrences by the total number of words appearing on their website and multiplied the resulting value by 1,000. For each continuous variable, we calculated the kurtosis and skewness measures to determine whether they followed a normal distribution. Given that none of the four variables did so, they were transformed by applying a natural logarithm (ln) or an inverse function (inv). In the case of external financing, because we did not reach normality with either transformation, this variable was therefore treated using nonparametric tests that do not require normality.
A possible selection bias arises from the fact that we selected only those companies that answered the survey. To verify whether this was the case, we ran our web mining program on the websites of the control sample and generated the same variables. We then used a twotailed Student's t test to test the difference in means for the following variables: ln(WEB_R&D)  (p = 0.130), inv(WEB_IP) ( p = 0.083) and ln(WEB_Collab) ( p = 0.144), and concluded that there is a nonsignificant (at 5% level) difference between the two samples for these three variables. Finally, using a two-tailed Mann-Whitney U test, we compared the means of the variable WEB_ExtFin ( p = 0.008) from both our sample and the control sample and found a significant difference. For this particular variable, we cannot conclude that the means of the two samples are the same. A selection bias is therefore present for the variable WEB_ExtFin and will be included in the limitations of our research.

The Multitrait Multimethod (MTMM) Matrix Technique
The literature describes two types of biases that can threaten construct validation: (a) Monomethod biases can occur when the by-products from the unique combination of a trait (in our case an innovation factor such as R&D, IP, collaboration, and external financing) and a measure introduce a systematic variance; and (b) by-products that are congeneric (i.e., inherent in the methods used) generate anomalies among the items forming the measures (Ortiz de Guinea, Titah, & Léger, 2013;Reinig, Briggs, & Nunamaker, 2007;Straub & Burton-Jones, 2007;Straub, Limayem, & Karahanna-Evaristo, 1995).
First introduced by Campbell and Fiske (1959), the MTMM is designed for the convergent and discriminant validation of a construct where a set of t traits (interchangeable with factors in our case) are measured with m different methods. It has been suggested that this method is an effective way to verify the ability of new measurement methods to successfully measure what they are supposed to, while testing for the presence of mono-method biases in social science studies, such as psychology (Bar-Anan & Vianello, 2018;Guo, Aveyard, et al., 2008), education (Campbell, Michel, et al., 2019;Gulek, 1999), organizational research (Bagozzi, Yi, & Phillips, 1991), marketing research (Lugtig, 2017), and information systems (Ortiz de Guinea et al., 2013). Even though its use for construct validity has been employed selectively since its introduction in 1959, the MTMM matrix is still one of the very few possible ways to attempt construct validation and this is why we have been witnessing a resurgence in its use in the past few years. To our knowledge, this validation method has not been used in innovation studies or to assess new innovation metrics compared to the traditional means of measuring innovation, as in the Oslo Manual (OECD & Statistical Office of the European Communities, 2005; OECD & Eurostat, 2019).
An example of an MTMM matrix with three traits (A, B, C) measured using three different methods is shown in Table 4. The matrix easily allows the visualization of four different means of assessing the quality of a measure: reliability, convergent validation, absence of either a combination of trait and method effects or a mono-method bias and, finally, discriminant validation (Campbell & Fiske, 1959).
The diagonal of the matrix (also called the reliability diagonal) comprises six Cronbach alpha values ( A1 , B1 , C1 , A2 , B2 , and C2 ), which measure the correlation of the traits with themselves and are thus an indication of the reliability of each measure for each trait (Campbell & Fiske, 1959). These mono-trait mono-method values indicate the reliability of a reflective measure (Campbell & Fiske, 1959), which is required to ensure that a measure will produce the same result under the same conditions. High Cronbach's alpha values on the matrix diagonal confirm the reliability of the traits; that is, that the variance of the measures behaves systematically (Churchill, 1979). In other words, any possible occurrence of a random error is minimized.
The diagonal at the cross-section of Method 1 and Method 2 (also called the mono-trait heteromethod, or validity diagonal) is comprised of three correlation values (r A1,A2 , r B1,B2 , and r C1,C2 ) that compare one trait measured with two different methods. High and significant values confirm the convergent validity of the methods (Campbell & Fiske, 1959); that is, that two different methods measure the same traits (Churchill, 1979).
Two triangles, called hetero-trait mono-method triangles, represent the cross-sections of traits that belong to the same method within the matrix. They are comprised of three correlation values per triangle. For instance, the hetero-trait mono-method triangle of Method 1 comprises r A1,B1 , r A1,C1 , and r B1,C1 , and that of Method 2 comprises r A2,B2 , r A2,C2 , and r B2,C2 (Campbell & Fiske, 1959). High correlation values (higher than the validity diagonal) within the hetero-trait monomethod triangle question the validity of the construct because of either a combination of trait and method effects or a mono-method bias.
Finally, two other triangles that represent the cross-sections of traits that belong to different methods within the matrix (also called hetero-trait hetero-method triangles) are comprised of three correlation values per triangle. The hetero-trait hetero-method triangle of Method 1 is comprised of r A1,B2 , r A1,C2 , and r B1,C2 , and that of Method 2 is comprised of r B1,A2 , r C2,A2 , and r C2,B2 (Campbell & Fiske, 1959). Again, high values (higher than the validity diagonal) question the discriminant validity of the methods. The discriminant validity is used to assess whether a method fails to measure something it is not supposed to measure effectively (Peter & Churchill, 1986). Bagozzi et al. (1991) have raised questions about the reasonability of the criteria needed, the lack of standard practice, the negligence of the amplitude of the differences between pairs of correlations in the analysis, and the lack of information about the nature of the variation in measures due to traits, method or random error. Fiske and Campbell (1992) listed the exceptional conditions required to successfully benefit from the method. Bagozzi et al. (1991) suggested CFA as an alternative to the MTMM for construct validation, which, although it does not share the same limitations, is unable to differentiate the random error from the variance measure and unable to verify interactions between traits and methods. Therefore, performing both methods will mitigate the limitations inherent in either method taken individually and ensure robustness of the results. In this paper, we adopt the methodology used by Ortiz de Guinea et al. (2013), who opted for an MTMM analysis followed by a post hoc analysis with a CFA to validate their results.

Keywords Analysis
More than 9.7 million words from 27,000 pages were extracted from the complete texts of 79 corporate websites. The details of the frequency analysis for each indicator are provided in  Table 5. For the R&D factor, 1,974 keywords were counted, from which six keywords (r&d, scientist, laboratory, research and development, researcher, laboratorie) account for more than 85% of the total frequency distribution. For IP, 75% of the total frequency distribution (397) is embodied by the keyword "patent," hence suggesting that most of the signal provided by the IP indicator is actually related to patenting. This is hardly surprising considering our sample of nanotech firms. For collaboration of 1,769 keywords were counted, from which

Construct validation
Each pair of variables related to the same concept from the two methods (web mining and questionnaire-based survey) was compared, using a Pearson correlation analysis when the variables followed a normal distribution, and a Spearman correlation when they did not, to assess whether the variables stemming from the web mining technique could be used as a proxy for similar concepts measured by a survey. Details of the construct comparison of the web-based indicators and the questionnaire-based indicators are provided in Table 6.

Correlation results
All correlation results between the web-based indicators and the questionnaire-based indicators are detailed in Appendix 3. For this study, the correlations were tested with two-tailed test at the 5% level of significance. Our results show a correlation of 0.306 ( p < 0.01) between the R&D web-based indicator and whether a firm is likely or not to provide R&D services to third parties (from the questionnaire). Additionally, we find a correlation of 0.306 ( p < 0.01) when we associate the R&D web-based indicator with whether or not a firm has a high percentage of employees working on R&D tasks. Moreover, a correlation of 0.284 ( p < 0.05) is observed Quantitative Science Studies between the R&D web-based indicator and whether or not a firm is likely to contract R&D services from external providers. Finally, we find a nonsignificant correlation of 0.197 ( p = 0.100) when we associate the R&D web-based indicator with whether or not a firm has a long R&D process. The fact that the variable related to the number of R&D projects is not correlated with our R&D web-based indicator (r = 0.002; p > 0.1) strongly suggests that the latter does not properly account for the size dimension of R&D activities.
The IP web-based indicator strongly correlates with the variables from the survey regarding the use of IP protection mechanisms (r = 0.368; p < 0.01) and patent-related activities (r = 0.396; p = 0.5). The web content analysis method seems to be able to appropriately capture the importance of the use of IP mechanisms using our sample.
Relations between the web-mining and questionnaire-based indicators for the other two factors are not as strong. For instance, the collaboration web-based indicator is weakly correlated and nonsignificant at 5% (r = 0.222; p < 0.1) with the questionnaire-based indicator for the firms that are confirmed as having collaborated. Finally, the external financing web-based indicator is also weakly correlated and nonsignificant at 5% with the extent of the use of external funding for commercialization purposes (r = 0.222; p < 0.1).
Consequently, we cannot reach a definite conclusion regarding proposition 1, especially in the case of our web-based indicators for collaboration and external financing (a less constraining significance level of 10% would be required to accept the correlations). In the following paragraphs, we will restrict our analyses to the R&D and IP factors to test propositions 1 and 2.

MTMM results
All the variables used to measure the R&D and IP factors from the questionnaire-based survey that correlated significantly with our web-based indicators were selected. As explained in section 3.3, the MTMM compares different combinations of methods that measure the same traits (factors). Thus, all possible combinations of traits and methods need to be tested. In the case of the webbased indicators, only one combination of trait and method is used. For the questionnaire-based indicators, three different variables are correlated with the R&D web-based indicator and two different variables are correlated with the IP web-based indicator. As such, six different combinations are possible and thus six different matrices are built. To ease the reader's comprehension of the interpretation of an MTMM matrix, Table 7 shows an example of how the results are displayed, along with a short analysis of all the different validity tests below the table. Furthermore, all traits are considered to be single items, which implies that we cannot calculate the reliability of the measure. Accordingly, the reliability diagonal will be neglected in our analysis.
In all matrices, we can observe that the hetero-trait mono-method value is low and not significant for the web-based method (r RD1,IP1 = −0.180; p-value > 0.05) and lower than all corresponding mono-trait hetero-method diagonal values (r RD1,RD2 and r IP1,IP2 ). This suggests that there is no combination of trait and method effects and that no method bias for the web-based method is present. The complete analysis of all the different matrices results is shown in Tables 8-13.
Of the six MTMM matrices produced, four matrices yield acceptable results: • Level of importance of contracting R&D with the number of patents ( Table 9) • Level of importance of providing R&D with the number of IP mechanisms used (Table 10) • Level of importance of providing R&D with the number of patents (Table 11) • Proportion of employees assigned primarily in R&D with the number of IP mechanisms used (although it is a weak construct) (Table 12) The rejection of the validity of the level of importance of contracting R&D with the number of IP mechanisms used construct (Table 8) suggests the presence of either a combination of trait and method effects or a mono-method bias, as the correlation between the two items from the questionnaire is stronger than the correlation obtained with their corresponding web-based measure. Thus, we cannot effectively discriminate the two web-based indicators from the questionnaire-based indicators, which implies that they cannot be part of the same construct. Construct validity without either a combination of trait and method effects or a mono-method bias: No, |r RD2,IP2 | = 0.433** >> r RD1,RD2 = 0.283* and r IP1,IP2 = 0.368**.
The rejection of the discriminant validity between the web-mining method and the questionnairebased survey method with the proportion of employees assigned primarily in R&D with the number of patents (Table 13) stems from the fact that the correlation is higher with the IP web-based indicator than with the R&D web-based indicator. After all, it can be expected that a high proportion of employees dedicated at R&D activities should correlate strongly with the number of patents. Therefore, these indicators do not discriminate each other, and this construct is thus rejected.

Construction of formative indices and new validation construct
Given the vast number of words used to construct the web-based indicators, as we concluded in the previous section, treating each questionnaire-based variable individually may not be appropriate. To illustrate the large lexical field of possible words related to the factors studied, and to properly test proposition 2, it is conceptually sound to build one single measure: one formative Construct validity without either a combination of trait and method effects or a mono-method bias: Yes, r RD1,RD2 = 0.306** and r IP1,IP2 = 0.396* >> |r IP1,RD1 | = 0.18 and |r RD2,IP2 | = 0.02.  index with all the questions related to R&D and IP. Because the PCA performed on all the items related to R&D and to IP presented at the end of section 3.1 did not produce any significant KMO and Cronbach's alpha measures, the use of a formative index comprising several subelements explaining our R&D factor in its broadest sense may be more appropriate.
Partial least square (PLS) regressions were estimated to determine whether it is possible to create valid formative indices for R&D and IP. To use PLS regressions, the methodology requires only the use of complete data sets (Nelson, Taylor, & MacGregor, 1996). Nonresponse is usually treated by either weight adjustment (i.e., deleting incomplete data entry and weighting remaining respondents to compensate for the deletion) or imputation (i.e., adding artificial values based on average by classes and editing methods (Särndal, Swensson, & Wretman, 1992) to replace the missing values) (Haziza & Beaumont, 2007). As our sample size for IP is already low for one of the items (39 for the number of patents), we could not afford to treat the missing data with a weight adjustment. Accordingly, we replaced the missing data with their imputation class based on control variables. We sorted the firms by sector, then by number of employees, and then by revenue. Depending on the situation, we used the mean of the class or the most conservative nearest-neighbor, a method commonly used in the literature (Haziza & Beaumont, 2007;Little, 1986;Thomsen, 1973).
Because not all the items shared the same scale, we transformed each variable into a Z-score. PLS regressions were then estimated using WarpPLS 5.0 software with the following settings: MODEL B BASIC Warp3 Stable 3 and MODEL B BASIC Linear Stable 3. The two different settings produced the same conclusions. The details of the construct comparing the web mining technique and the questionnaire-based survey are shown in Table 15 (the PLS regressions and the resulting weights used to build these indices are provided in Appendix 4).

Final MTMM analysis
This final MTMM matrix includes the two web-based indicators along with the R&D_INDEX and IP_INDEX (see Table 16). Once again, the reliability diagonal will be neglected in our analysis, as the measures are made with single items from the web method and with formative indices from our questionnaire. The mono-trait hetero-method diagonal shows high and significant correlations for R&D (r = 0.419; p < 0.01) and for IP (r = 0.520; p < 0.01), which hints at strong convergent validity. The hetero-trait mono-method value is low and not significant for the web content analysis method (r = 0.182; p > 0.05), although the questionnaire-based survey method value is high and significant (r = 0.320; p < 0.01). However, the mono-trait hetero-method values (r = 0.419 and r = 0.520) for R&D and IP respectively are much higher than the hetero-trait mono-method values (r = 0.320 and r = 0.182), which indicates no combination of trait and method effects and no mono-method biases. The first hetero-trait hetero-method value is low and not significant (r = −0.017; p > 0.05), and the other is moderate and significant (r = 0.294; p < 0.05). However, and more importantly, the correlations are much lower than the corresponding values found in the validity diagonal, which shows good discriminant validity. As all the conditions are satisfied under the original guidelines proposed by Campbell and Fiske (1959), no risk of potential biases is induced within the methods, the traits, or a combination of both. The results based on this methodology suggest that our web-based indicators reflect the importance given to innovation factors such as R&D and IP. It is also worth mentioning that the validity of this construct appears to be stronger than all the others attempted in section 4.2.
Notes: a All traits are measured by single items, no reliability statistic can be calculated.
* p < 0.05; ** p < 0.01 ultimately affect the correlations found in the MTMM matrix performed in the previous step. Thus, if these method effects are significant, a congeneric bias is present. One way to verify the effect of each item individually is to extend the MTMM matrix with a CFA, which allows the estimation of latent factors within the MTMM matrix (Maas, Lensvelt-Mulders, & Hox, 2009). Using two PLS regressions allows traits and method factors to load on the items of the estimated constructs. Convergent validity is obtained when the trait loadings are higher than the method loadings. A possible method effect might thus be induced if a method load is higher than the related trait. The discriminant validity is obtained when correlations between the traits are low or moderate. The model tested for this analysis is presented in Figure 2.
MTMM CFA results are presented in Table 17. All the traits load on every item with strong values (all above 0.8). Most of the value loadings for the methods are strong, but generally lower  than the trait loading, indicating convergent validity and no congeneric bias. Possible method effects are observed with three items (R&D2, R&D3, and IP1) which load higher with the method than with the trait. It is possible that our use of formative indices may have had an influence on the measure, thus creating a method bias flagged by the CFA results. However, 7/10 (70%) of our results respect the condition, which is considered a high enough proportion to suggest an overall absence of method bias in the construct (Ortiz de Guinea et al., 2013).
Finally, comparing both traits (as set out in Table 18) shows a moderate correlation between the two formative indices, which indicates good discriminant validity.
The results from this post hoc analysis confirm the results obtained using the MTMM matrix alone. Our web-based indicators seem to effectively reflect the importance that technological firms grant to R&D and IP.

DISCUSSION AND IMPLICATIONS FOR RESEARCH
We used web content analysis to build innovation indicators from the web text content of four factors that are consequential for the success of high-technology innovation, R&D, IP, collaboration, and external financing. We then validated the true nature of these indicators, by determining whether they were valid substitutes for specific indicators that would otherwise have required a questionnaire-based survey to be obtained. To better understand the nature of the data extracted, two propositions were extensively tested.
Our first proposition stipulated that our web-based indicators could effectively measure specific actions of a company to perform activities related to a given factor. Although significant results were obtained for IP factors and some indicators of R&D, there were no significant correlations with either collaboration or external financing factors.
For the specific case of R&D, we observed that the web-based indicator seems to reflect the promotion needs of various firms in terms of R&D. Our R&D web-based indicator is most highly correlated with the survey-based indicator related to firms more likely to provide R&D services to third parties. Thus, companies use their websites to promote their R&D service offerings. Furthermore, our web-based R&D indicator is highly correlated with the questionnaire-based indicator associated with firms that have a high percentage of employees allocated to R&D tasks. This can be explained by the willingness of firms to attract new R&D talent via their websites. Finally, our R&D web-based indicator correlated significantly with the questionnaire-based indicator tied to whether firms are more likely to contract R&D services from external providers. It is possible that the more important a firm considers R&D, the more likely it is to use both internal and external resources to achieve its R&D-related objectives, which can be described as open innovation practices (Chesbrough, 2003). However, we did not find any correlation with an important R&D indicator, the number of R&D projects and our R&D web-based indicator, which at first may seem counterintuitive, but reflects a limitation of the indicator by not properly accounting for the size dimension of R&D activities. In contrast, our IP web-based indicator correlates with both the use of IP mechanisms and patent-related activities. Prior to the protection of an intellectual asset, that is, up to the R&D project level, an idea or future invention remains secret. Once an invention is revealed to the world via the disclosure that is generally associated with formal IP protection mechanisms, these IP protection mechanisms act as a signal for the firm, showing how innovative it is or what it wants the world to see. Given that the firms' websites are used to publicly disclosing such information, it thus appears to be a promising source of data for IP protection-related data.
However promising, these preliminary results required more rigorous testing, beyond the simple correlation test between our indicators. The goal was to identify whether specific variables stand out from the others in terms of validity. To do so, we used MTMM matrices to verify the validity of all the promising constructs to assess whether our questionnairebased indicators can effectively discriminate each other when compared to the web-based indicators. Of the six MTMM matrices produced, four passed all the required validity tests, and two failed.
The fact that we obtained four valid constructs with four different combinations of R&D-IP pairs of different variables demonstrates that we cannot isolate specific actions from the questionnairebased survey from web-based indicators at this point. Accordingly, at this stage we cannot use the web-based indicator built as a direct substitute for action-related indicators, as good convergent and discriminant validity were obtained for three different questionnaire-based indicators with the R&D web-based indicator and two different questionnaire-based indicators with the IP web-based indicator.
The keywords mentioned on websites may be used by companies that contract R&D services offered by third parties, by companies that provide R&D services to other companies, or by companies that have a higher proportion of employees allocated to R&D, or any combination of the three. We cannot separate one action from another.
The same applies to IP protection. Because acceptable results have been found for the number of patents that a firm owns and for the total number of mechanisms used to protect IP, it is impossible to know precisely what is being measured by our web-based indicator. It is worth mentioning that neither factor significantly correlates with the other (r = 0.144; p > 0.1), which suggests that both variables represent different traits. Therefore, it is not possible at this point to isolate precise IP-related actions.
The second proposition suggests that our web-based indicators could measure the degree of importance that a company grants to a given factor. To do so, we combined all the elements related to a given factor to create two formative indices from the questionnaire-based indicators: R&D_INDEX and IP_INDEX. We repeated the MTMM analysis using these two indicators and the two web-based indicators and observed the convergent and discriminant validity of our construct. A post hoc analysis with a CFA confirmed our results. Therefore, our proposition 2 is accepted. In other words, we can use our web-based indicators as proxies for the relative importance that a company grants to a given factor (such as R&D and IP). It is possible to conclude that the more R&D-related keywords a firm uses, the more important it considers R&D-related activities to be, irrespective of the nature of the activities. The same also applies to the protection of IP. We know that the more IP protection-related terms are used, the more actions are taken in this direction. However, this methodology does not allow us to precisely determine the nature of the actions a firm takes, which echoes previous findings by (Gök et al., 2015). We can only guess the nature of these actions, as the indicator gives positive answers in both cases because the use of keywords alone ignores the context in which these words are used.
In a nutshell, our methodology can be used as a valid approach to provide data for future innovation and technology management studies for the relative importance given to factors such as R&D and IP, and to test the validity of the measures thus created. In most questionnaire-based surveys, this information is gathered using 1 to 7 Likert scale questions. If the goal of a study is to determine the degree of importance of core factors such as R&D or IP for a firm, the use of webbased indicators is reasonable. However, if the goal is to gather more precise information, such as the specific actions taken by a firm, these web-based indicators lack the necessary context to behave as expected. Although they may provide complementary information, they cannot be used as direct proxies.
The fact that we did not obtain significant correlations with collaboration and external funding suggests erring on the side of caution before using this method on a larger scale. The reasons for the absence of convergent validity of the web-based indicators towards those generated from the questionnaire-based survey can be attributed to the small sample size, the inherent characteristics of the subjects from our sample, the questions used to capture that information within the questionnaire-based survey, and the keywords used to capture the information from the websites. Thus, other tests with different combinations of methods, keywords, and traits need to be explored to determine whether it is possible to obtain valid and relevant information from companies' websites.

LIMITATIONS AND FUTURE RESEARCH
Given more data, our research would obviously be more robust, especially in terms of verifying the concept of collaboration and external financing normally addressed by classical methods, which can be appropriately measured using websites. For instance, we were unable to crawl data from all the companies from our survey due to technical limitations, which meant that only 79 out of 89 companies were used in this paper.
Websites are updated from time to time and the information provided changes accordingly, depending on what companies want to make public. It is thus important to note that a single web mining crawl might be insufficient to capture all relevant information, given that results are subject to change as websites are updated. Thus, a longitudinal study would be required to more clearly understand how time can influence the traits to be measured; that is, the content available on a website and the nature of the website itself. An analysis of the evolution of web-based indicators over time would provide further validity to our methodology.
The inability to measure specific actions constitutes a limitation in our web mining methodology. The main problem lies in the lack of context tied to the use of keywords alone, possibly leading to multiple false positives. Machine learning and deep learning techniques, such as recurrent neural network, natural language processing, or bag-of-words models, are promising avenues to explore the necessary context surrounding specific concepts to improve the level of precision of web-based indicators.
Furthermore, we began with theoretical factors for the conceptual framework, then identified the keywords related to these factors, and finally mined the website for these specific keywords. An interesting alternative would be to do this the other way around; in other words, to start with the website content and identify the factors that can be naturally found via unsupervised machine learning algorithms. The term frequency inverse document frequency technique (TF-IDF) could be used to provide insight into the importance of keywords relative to the rest of a document. N-gram low-frequency words clustering could also be tested to better isolate specific areas of interest (Price & Thelwall, 2005).
Company websites are purposely structured in a cooperative and agreeable manner for anyone seeking information about products, services, activities, and so on. The self-reporting bias induced by this methodology is inevitable. However, it is important to note that questionnairebased surveys and most national official public directories are all also subject to self-reporting biases. Fortunately, the bias induced by the web mining technique is as much a quality as it is a flaw, in that it provides insight on how a company wishes to be perceived. In fact, companies post what they care about, what is important to them, and who they are as an organization on their websites. This qualitative information represents the essence of the company. Future research is needed to determine whether such information could, for instance, be used as a proxy to understand a company's culture.
Other qualitative data analyses of the websites' content could be used to reduce the risk of false positives and to gather more accurate data. The use of indicators based on websites' text data will open the door to experimenting with other possible indicators derived from the same websites, such as the use of colors, pictures, and illustrations; audio and video content; choices of web design styles; use of modern web technologies; frequency of updates; possible interactions with visitors through all of the possible calls to action; and many other sources of data we may not have thought of as yet.
Finding a way to leverage the information from high-tech companies' websites will enable innovation management researchers to access to a whole new source of data that is free to use, accessible at all times, and in large quantities, which will ultimately facilitate studies in this field. Finally, the innovation research community is invited to build on this common ground to create a new systematic way to validate new constructs. If such new innovation indicators from web-based sources can be validated and clearly understood for what they truly measure, the burden on firms that are increasingly asked to answer questionnaires (from researchers, industry associations, or the government), would be considerably reduced. This paper shows that this is a promising avenue.