A Web Tool to Map Research Impacts Via Altmetrics

Currently, there is a big concern on governments and research institutes on assessing the population awareness about the scientific innovations, such as new food production technologies and the development of drugs against emerging and neglected diseases. There is an unmet demand for new ways to show the impact of scientific research on social media (population’s main communication vehicles) and assess the social outreach of the scientific output. This article presents a novel web tool to map research impacts via Altmetrics, wich are alternative metrics based on the exchange of scientific knowledge on social media and online environments.


INTRODUCTION
According to the Organization for Economic Cooperation and Development (OECD) 2015 report, business-funded Research and Development (R&D) has increased, while government-funded R&D has declined, reflecting budget consolidation policies [32]. The interference of politicians in R&D budget decisions has led to shorter funding periods that are consistent with time frames for political mandates [6,10,16]. There is a growing commitment by governments to ensure that taxpayer-funded research is translated into benefits for the population [38]. Governments face numerous conflicting demands for public funding [26], including more immediate and direct societal benefits such as the control of emerging and neglected diseases.
Moreover, the lack of engagement of scientists with the public allied to the spread of rumors and fake news combine to explain the current lack of interest of the population in scientific research [2]. Therefore, the most direct communication of scientists with the population through social media is essential, not leaving to the conventional media the exclusive task of informing people. Social media tools are becoming increasingly important in the world of science, as institutional actors, such as lobbying agencies and grant committees, are increasingly pushing researchers to show the importance of their work on social media and online environments [31]. In this way, it is necessary to develop tools and methods to aid in scientific communication, in the identification of specialists and important research that produce direct benefits for society.
Given this scenario, we have developed a web tool to map the impacts of R&D in order to understand the representativeness and recognition of researchers, considering the relationship between science and society. Our prototype aims to identify the academic and social reputation of researchers and their research from Altmetrics [33,36], which are alternative metrics based on the exchange of scientific knowledge on social media and online environments [34].
Related Work. In terms of altmetrics, there are some systems capable of tracking a wide range of "online" metrics, such as Altmetric (Altmetric.com) and Impacstory (Impactstory.org), which are tools that capture the online attention of articles and researchers from mentions in social media, Online Social Networks (OSN) and online environments (e.g. Mendeley or Wikipedia) [40]. Altmetric is a commercial service whose purpose is to track and analyze online activity with regard to academic literature, providing data feedback from millions of articles [1]. Impactstory is a tool that brings together the impact history of academic works based on altmetric information from multiple databases. It organizes the information into profiles, where the impact of an author's articles can be displayed [35]. These systems, although powerful tools to evaluate the social outreach of researchers and articles, are limited to a single kind of measurement: online mentions. Our tool, however, focuses on a broader view of the research, providing: (i) bibliometric indicators in a given field; (ii) academic reputation/influence of researchers on the scientific community; (iii) social outreach/acceptance of scientists and their research by lay people; and (iv) compilation of important research in a given field/topic. To do so, the tool relies on three kinds of measurements instead of a single one: productivity (Bibliometrics), academic impact (Social Network Analysis -centrality metrics) and social impact (online mentions -Altmetrics).

SYSTEM DESCRIPTION
The system is divided in four modules, as shown in Figure 1. It was implemented in PHP (native) with the library EasyRDF (easyrdf.org) version 0.9.0; Javascript (native) with the libraries Cytoscape.js (js.cytoscape.org) version 3.2.9, jQuery version 2.1.4, and Google

Academic data collection and processing
This module is responsible for retrieving data from publications in indexing databases (e.g. PubMed, Web of Science, Scopus, etc.) to build Scientific Co-authorship Networks (SCN) [17,18,28,30] based on specific areas or topics of interest (e.g. neglected diseases such as Zika or Chikungunya). The module operates extracting pieces of information from the publications such as title, authors name, affiliations, date of publication, article id, and others. Using this information, we identify the co-authorship networking. The main operations performed by this module are: (i) association of two nodes (authors), based on the title of a publication, characterizing an edge. (ii) Removal of edges without associated nodes. (iii) Representation of the social network described in item (i) using a matrix. (iv) Removal of duplicate items. (v) Identification of edges weight, based on the frequency of common co-authorship. (vi) Assignment of identifiers at each node and edge, allowing the reading and storage of SCN data in the database for later visualization of the co-authorship graph and extraction of academic impact metrics.

Social media data collection and treatment
This module is responsible for collection, preprocessing and triplification of data from social media publications such as online news, blogs posts, discussion forums and OSN (e.g. Facebook, LinkedIn, Google+, etc.) to build a thematic database. The collection of these data is based on Webhose API (webhose.io), which allows the monitoring of social media in real time and automatic collection of posts on specific topics (e.g. Zika or Chikungunya) 24 hours a day. The collected data is in an unstructured format. Then, data is converted to the semi-structured format (JSON/XML) and the fields/terms of the posts, such as URI, title, text, author, country, domain, date, language and shares on OSN are extracted. The next step is the data description using RDF triples [15,27,41], based on the RDF data model available in: (realm0.github.io/1). Finally, the triples are stored in the Apache Jena Fuseki triplestore (jena.apache.org) for later extraction of the altmetrics. The procedures described in 2.1 and 2.2 were performed with the aid of the Knime tool (knime.org). Knime was used to optimize the preprocessing of the large volume of text contained in the XML, JSON and CSV data files [5].

Academic impact analysis
This module is responsible for mapping important clusters and individuals, extracting productivity and academic impact metrics of the SCN (built by the module described in 2.1) in three levels: (i) Global -maps the SCN as a whole from the global graph, which allows comparing publication and collaboration behavior among researchers from different areas. It uses as parameters: number of researchers, publications and components (subnets of connected nodes) in the SCN, sum of publications considering each researcher individually, sum of researchers considering each publication individually, average number of publications per researcher and average number of researchers per publication.
(ii) Local -maps the existing subnets, which helps to identify clusters of important researchers. At this level, the system identifies the clusters of researchers, names the subnets according to the number of components (e.g. it assigns the id 'subnet 1' to the cluster with the highest number of researchers) and associates the researchers with their respective subnets. It uses as parameter the subnet's number of nodes/elements.
(iii) Individual -maps the most influential researchers from the number of publications of a researcher and his network centrality. At this level, the system have as input the researchers in their respective subnets (in case of disconnected graphs) using as parameters the number of publications and the number of elements ordered by the centrality metrics (these parameters are configurable).

Social impact analysis
This module is responsible for extracting social impact metrics, based on SPARQL queries performed on the triples database (section 2.2) from the system interface. These queries are necessary for the extraction of altmetric indexes that can measure: (Query 1) the reach of the researches in primary (online news) and secondary (e.g. scientific blogs and forums) communication vehicles; (Query 2) its acceptance by the population, via its dissemination on OSN such as Facebook and Google+; and (Query 3) its visibility at the global level, identifying the country of origin of the publication. The queries use as parameters the mentions to a researcher in publications, the mentions/shares on OSN and mentions by country.
However, the name of a researcher can be cited in various ways on social media. Envisioning to solve this problem, we created a spellings dataset mapping the researchers and their different names cited in academic publications.
Thereafter it is possible to create the altmetric ranking, based on the mentions and shares returned by the first two queries, and generate the map that shows the geographical distribution of the mentions based on the results of the third query. The altmetric ranking is acquired by means of the altmetric score, which is calculated by summing the results obtained in Query 2 using the formula (1): where 'n' is the number of mentions on news, blogs, and discussion forums, 'fb' is the number of shares on Facebook and 'gp' is the number of shares on Google+. The smaller weight for the OSN shares is based on a similar criterion used by altmetric.com [3].
As the queries are carried out, the social impact metrics are extracted and saved in the triplestore. This enables the categorization of researchers based on their reputation, by correlating the academic impact with social impact. Four impact categories are possible: (i) high academic impact and high social impact -outstanding researchers in the scenario. They are names of significant influence in their field of work, belonging to networks of scientific collaboration with strong geopolitical/institutional references and strong online presence. (ii) High academic impact -researchers that gather and participate of well-defined research nuclei, but with little online presence. (iii) High social impact -researchers that do not have a well-defined collaboration network, but often prefer other ways to share results, such as fast-tracks, OSN (e.g. Facebook) and scientific blogs, making their dissemination more practical and faster. (iv) Low academic impact and low social impact -researchers of minor importance in the scenario and irrelevant online presence.
Researchers are plotted in the graph, where the dimensions are "academic index" (X-axis, normalization of the academic score) and "social index" (Y-axis, normalization of the social score). Each quadrant of the diagram has a particular meaning: (i) Lower left quadrant -low academic impact and low social impact. (ii) Upper left quadrant -low academic impact and high social impact. (iii) Lower right quadrant -high academic impact and low social impact. (iv) Upper right quadrant -high academic impact and high social impact. The researchers' scores are normalized using formula (2): (2) where x = (x 1 , ..., x n ) represents the set of values, Z i is the normalized value of x i in the i t h iteration, min(x) is the smallest value in the set and max(x) is the largest value in the set.

Data storage
As the algorithms process the data of the modules 2.3 and 2.4, the academic and social impact information is also triplified, according to the RDF data model (realm0.github.io/2), and stored persistently in the Apache Jena Fuseki triplestore. Thereby, our tool allows the creation of sessions, which provide a way to save the calculated academic and social impact rankings. This also enables one to check the progress of a field or even make comparisons among different fields (e.g. how scientists collaborated to drive the major advances in Zika research and how the population reacted to the findings).

USAGE SCENARIO: NEGLECTED DISEASES
So far, we used our tool in several studies to identify specialists and important research on two neglected diseases: Zika and Chikungunya. Although the tool can be applied on any area of science, topic of interest or scientific database, we choose these topics meeting the demand of Fiocruz (Oswaldo Cruz Foundation) and ZIKAlliance (international consortium) [43], that is, evaluating the triple arbovirosis outbreak caused by the Aedes aegypti mosquito (Zika-Dengue-Chikungunya) in Brazil [39] and worldwide, especially Zika [29]. Further discussion about these studies can be found in [19][20][21][22][23][24][25]37].
We demonstrate the tool (online and/or laptop) by feeding it with previously collected and cleaned datasets. As for the choosen datasets, we collected data about Zika and Chikungunya on Pubmed, via its search mechanism (http://www.ncbi.nlm.nih.gov/pubmed). This way we retrieved and preprocessed (as explained in 2.1) data from 1,932 Zika publications, prior to 12/21/2016, and 3,757 Chikungunya publications, prior to 02/13/2018 to build the SCNs. As search string we used the terms "zika" and "chikungunya" in the filters "title", "abstract" and "text". We also collected data on social media via Webhose.io API, using as string the terms "zika" OR "zyka" OR "zikv", for Zika and "chikungunya" OR "chicungunha" OR "chikv", for Chikungunya. This collection took into account publications on news, blogs and discussion forums, as well as shares on Facebook and Google+, in 115 languages, from december 2014 to december 2018. 1,351,284 publications on Zika and 257,022 publications on Chikungunya were collected in this process. We then used the module 2.2 to build the Zika and Chikungunya thematic databases.
The next step is to analyze the academic and social impacts using the system interface, available at <http://www.realm.net.br>.
System Interface: Tutorial. In this subsection we provide a brief description on how the system interface works. The goal is to guarantee our studies replicability to other researchers. For this purpose, we made the Zika and Chikungunya datasets, along with their respective spellings dataset, available at: <https://goo.gl/vsBeK3>. We also made a vídeo tutorial explaining how the system interface works, using a smaller dataset (Zika -crawled period: Oct -Dec 2016. Available at <https://goo.gl/pVFJmP>) as example. The video tutorial is available at <https://youtu.be/NfcdG8o0PyE>. Academic Impact. When accessing the academic impact module three actions are possible: (i) load an SCN file; (ii) select an SCN from files already used in previously registered sessions; (iii) open saved sessions recorded on a specific topic. When uploading (option (i)) or selecting an existing file (option (ii)), the system prompts the selection of a topic for the session (e.g. Zika). After the selection and submission to the server, the session identification is saved to the database, along with the SCN file path. After saving the session, the system calls the functions of the analysis and ranking algorithms and performs the Global Analysis. The user must then enter the number of subnets to be analyzed (the system will analyze the N subnets of highest number of elements). Then it is necessary to inform the parameters of the Individual Analysis (number of publications and number of elements), so that the analysis and ranking of researchers is executed. Finally, the system will display a previous result of this analysis and ask if the user wants to save it in the session. If agreed, the system saves the data in the triplestore and displays the results of the Global, Local and Individual Analyses (Figure 2a and Figure 2b), and displays a link to the SCN (Figure 2c) where it is possible to observe (and interact) in details the scientific collaborations in the field and the formed clusters.
Social Impact. When accessing the social impact module two actions are possible: (i) open a social impact already linked to a session; (ii) create a new social impact for a registered session. Selecting option (ii) the system requests the spellings dataset (explained in 2.4), which can be a new upload or one that has already been saved in other sessions of the same topic. After the upload, the system populates a form with all the spellings of the dataset. At this point, one can update the dataset by adding new spellings, researchers, or editing its data. When sending the spellings, the system updates the Figure 2: Academic and social impact displayed on the system's interface dataset and starts the queries execution, using all the three datasets as parameters to search, on the text of the social media publications, the spellings of each researcher identified in the Individual Analysis. As a match occurs, the system counts the occurrence and applies the formula (1) to assign the researchers' altmetric score. After this procedure, data is stored and the system displays the social impact results. Thereafter it is also possible to see the featured researchers in the scenario (Figure 2d) and their 'Details' tab, a profile-like page that displays the scientists' most mentioned/shared research on social media and the map of these mentions (Figure 2e). System Evaluation. The tool was evaluated regarding its usefulness and correctness using Technology Acceptance Model (TAM) [11][12][13]. To do so we conducted a study, relying on qualitative analysis, to evaluate it according to 3 constructs: Perceived Usefulness (PU), Ease of Use (EU), and Self-Predicted Future Use (SPFU). Fiocruz and ZIKAlliance specialists were invited via email and 7 volunteered for the study. The volunteer researchers performed 15 tasks individually, where the primary goal was to observe their interaction with the tool. During and after usage, the specialists commented on the eases/difficulties in using it, as well as issues related to the interface and its displayed results. The evaluations were recorded in audio and video (with users' consent) and via a questionnaire with 14 items, of which 6 related to PU, 6 related to EU and 2 related to SPFU. To assess the users' perceptions we used a five-point Likert scale ranging from "strongly agree" to "strongly disagree".
Despite some difficulties on how to navigate the system, all participants succeeded to perform all 15 tasks on their computers. Overall, the tool achieved good results on the 3 constructs, especially regarding EU and SPFU, where 5, among the 7 specialists, assigned it a high score (4 to 5 on all items). In other words, most participants found the tool useful, easy to use and would rather prefer to use it in future than other methods to find specialists. The experts also gave their opinions about the results presented by our tool, recognizing the ranked researchers, explaining their importance and pointing out missing names. According to the specialists, the results are consistent and reflect very evident realities, among them: (i) The mapping of the outbreak evolution in its most critical period. (ii) The mapping of the interactions between researchers, population and media. (iii) The impacts of the scientific output dissemination on social media, allowing us to better understand how the population sees and interprets the findings made by the scientists. The TAM evaluation results are available at <https://realm0.github.io/3>.