Challenges and ambition
Challenge 1: Understanding Science evolution
Our first challenge is to build global semantic maps of the evolution of science on large scientific domains by applying appropriate scientometric models on large databases like the Web of Science (about 30% of the worldwide scientific production, interdisciplinary but biased toward hard science), MedLine (the main biomedical archive), Repec (the main archive for economics) or Open Archives likes arxiv.org (main pre-print archive for physics, maths and computer science). The ISC-PIF partner already has access to these datasets, including the WoS raw format until 2015, with the appropriate licences to share them within a research project. These data sets will be a general driving force for the project and for the experimental validation of the solutions. As a first goal, the project intends to contribute to our overall understanding of the evolution of science, and therefore will confront some of the results to competing extant philosophical theories about the science building and scientific change. The challenge is to pursue this task in the light of the main existing accounts of scientific evolution – traditional cumulative accounts of science, Darwinian accounts such as Hull (1988) or memetics, popperian accounts (Popper 1963), kuhnian accounts in terms of revolution and paradigms (Kuhn 1970), or lakatosian accounts in terms of labile research programs (Lakatos 1978), as well as more recent views of bayesian theory change. Most of these conceptions of scientific evolution have been elaborated and tested with a small amount of papers and books; we intend to assess their validity through a confrontation to the patterns that we will unravel through databases that cover a large range of time and a huge amount of scientific publication. The project will also consider the recent claim that current science is more “hypothesis driven” than “theory driven”, and have as a background the idea (Nowotny et al. 2001) that the novel regime of science focuses on problems rather than disciplines: the phylomemies obtained through quantitative analysis should have different patterns according to whether this thesis is true of false. We intend to test this sociological insight about the production of scientific knowledge.
The project partners have been pioneers in the reconstruction of science dynamics mining corpora at large scale [Chavalarias & Cointet 2013][Chavalarias et al. 2011] and they have shown that we can characterize quantitatively the different phases of the evolution of scientific fields and automatically build “phylomemetic” topic lattices (as an analogy with genealogic trees of natural species) representing this evolution. The reconstruction of phylomemetic lattices from scientific production is of utmost importance for a wide range of actors :
- philosophers and historians of science, who need to test their theories with data, in particular about the ways fields cross-fertilize and novelty emerge,
- scientists who want to position themselves in their field, understand the ins-and-outs and find domains with high discovery potential,
- policy makers who want to spot emergent fields, foster innovation and get key indicators to assist them in decision-making processes,
- industry, that have to find its ways through the scientific production and evaluate the potential for innovation and technological transfer,
- librarians who need to propose classifications of documents which respect not only scientific topics hierarchy but also the evolution of ideas.
This reconstruction is now within reach since science has been one of the first domains of human activity to have digitized archives.
Science and human activity in general is largely built on the exchange and registration of knowledge in textual form. Well before the advent of the Web, the increasing amount of scholarly literature made it difficult for a scientist to keep up to date in his field of interest and a main task of scientific editors was and still is to collect and organize articles in different domains. Today, for example, collections like the Web of Science (WoS) or Scopus provide researchers, administrators, faculty, and students with quick, powerful access to world’s leading scientific metadata containing dozens of millions of records, including hundred millions of cited references and retrospective coverage in the sciences, social sciences, arts, and humanities going back to 1900 (with a qualitative and quantitative leap in metadata around 1990). Granted, this does not constitute science itself, but scientific papers are a privileged record of scientific activity, and those collections are partial but robust windows on this record.
EPIQUE will involve philosophers of science, who perceive phylomemetic lattice structures as a tool for testing general accounts of progress and change in science; they will also play a role in validating the lattices produced in the context of the project. This input will allow a better calibration of the protocol for reconstructing lattices, and therefore, will provide feedback on the methodology itself. As users, historians and philosophers of science involved in the project will design case studies (esp. in biology, ecology, economics) of a particular evolution of a target concept across subfields of a general field (i.e. the concept of “diversity” or of “function” in ecology); those case studies will in turn contribute to fine-tune the algorithmic tools. Finally, within EPIQUE, philosophers of science will integrate the research in philosophy of science and the computer science work on big data [NPS14][NPS11].
Challenge 2: Large-scale text topic detection and alignment
The goal of building a global map of the evolution of science is also challenging from a computer science point of view. It is the ambitious goal of making sense of unstructured text through generic data processing tasks (graph clustering, similarity matching, indexing) which become complex when dealing with very large amounts of digitized text. The size of the digital archives and science repositories (Medline, Arxiv.org, WoS, etc.) needs the development of new solutions exploiting recent parallel data processing frameworks like Hadoop, Spark, and Pregel. The project partners already have some first promising results in using Spark [Guichard 2014][Lajus 2014] for building phylomemetic trees over such corpora. Our goal is to go further in defining appropriate scalable data structures and algorithms for generating and processing large phylomemetic structures from large document corpora. These structures and methods will exploit recent efficient parallelization, approximation and compression techniques for reducing processing and storage cost. In particular, we intend to reuse and extend existing text and graph mining techniques and also explore approximation techniques for accelerating computation.
Challenge 3: Dynamicity, Interactivity and Customization
The third and probably most ambitious challenge is to make the whole mining workflow more flexible and interactive. Consider for example a chronologically ordered list of digitized publications from which the user wants to extract a phylomemetic network of the research domains. Building the result involves a number of complex processing steps which make it difficult to handle dynamic information and customization. As new articles are published, new nodes (terms) are added to the graph and the weights of some existing links are modified to reflect the change in the number of articles sharing common terms. If the topic detection algorithm is restarted on the entire graph, scalability becomes a major issue as e.g. all the term co-occurrences are analyzed. Another need of interactivity and incremental processing appears when users are only interested by some particular sub-domain or (thematic or temporal) sub-collection of documents or when they want to change the analysis time scale (eg. zoom from a yearly statistics into monthly statistics), similar to roll-up/drill-down operations in more standard OLAP systems. Finally results also might be erroneous and useless for the final user because of certain parameters which have to be changed. The challenge here is to allow for incremental processing where the workflow is able to reuse already computed results (e.g. term co-occurrence scores and related proximity measures) to avoid global re-computation for new information or parameter changes. Our goal is to extend the workflow for enabling partial and incremental computation and following content evolution online by incrementally updating generated analytic data models. We propose to study to what extent it is possible to decompose the workflow into independent tasks that can be executed independently and reuse data produced by previous workflow executions (for example, update already computed scores to reflect data changes instead of recomputing all scores on the whole input data).
General project statement
EPIQUE is the first project where science evolution will be studied at such a large scale (over the entire datasets like the WoS or MedLine). From the viewpoint of philosophy of science, it allows testing theories on science evolution and nature which have been formulated only by considering a few canonical texts (the “great scientists of the pasts”, which introduces numerous biases) on a corpus that can reliably be seen as a plausible testimony of scientific activity. Preliminary results on a small part of the corpus already demonstrate that phylomemetic lattices reveal novel semantic insights about science evolution [CC13]. We are confident that taking into account the whole corpus will not only apply to other scientific fields but it will also more fundamentally reveal deeper understanding of inter-disciplinary evolution. Facing an ever growing corpus, we do not consider scientometric workflows as sequences of independent tasks on a given dataset, but we strive for a more integrated framework which allows end users to interact and control the whole process through high level languages and interfaces (e.g. for specifying the scientific field and time-range of interest, or any criteria about the corpus such as the country of the authors). The architecture of the project strongly relies on feedback loops between production of lattices and users such as philosophers of science, historians having reconstructed some small size semantic networks and other experts. This allows for the controlled production of phylomemies by assessing results via comparison with expert knowledge in the field and expert historians having reconstructed some small size semantic networks. We not only focus on the workflow itself, but we aim to come up with a system for producing, maintaining and adjusting phylomemetic lattices on demand. This brings the double opportunity to manage complex data more efficiently and to optimize the text mining workflow; e.g., compute only the required phylomemetic lattices, share (and save) computation among users, reuse workflow refinement strategies among users. The system, serving several users, will leverage on users’ experience to provide both unprecedented efficiency and new incentives to enrich users collaborations.
- A first direct outcome of EPIQUE will be the enrichment of the open source ISC-PIF software catalog with new innovative tools for the reconstruction and exploration of multi-scale dynamics in complete real-world scientific corpora and for obtaining new insights in the evolution of complex human generated knowledge and information. In particular, advances in phylomemies reconstruction will be implemented in the Gargantext platform that is used, among others, for the teaching of controversies analysis to students in several higher-education schools and universities.
- The second, more generic result, will be a uniform framework for specifying, implementing and integrating large-scale text and graph mining tasks which can be customized independently of the higher-level mining algorithms with respect to specific cost models and hardware constraints (memory, CPU).
- A third result will be the opportunity to revisit classical hypotheses concerning the evolution of scientific fields and contents and to test and improve these hypotheses in the light of the reconstructed phylomemies and of general patterns detectable within them; this outcome will be exploited in various academic publications coauthored by computer scientists and philosophers or historians of science participating to the project.
The main goal of the EPIQUE project is to define, implement and compose a set of tools for extracting customizable maps of the evolution of science from large and representative scientific corpora like Web-of-Science, Medline and arXiv.org, and to contribute to hypothesis testing in the domain of the history and philosophy of science, regarding science evolution. These tools are based on scalable text and graph mining algorithms which are combined within EPIQUE workflows. The EPIQUE main workflow is composed of three steps:
- The term extraction and proximity graph construction step transforms a collection of text documents into a set of weighted term graphs / matrices for different user-defined time slices. A node in this graph is a set of semantically equivalent n-grams. We consider different ways to compute the relation between nodes, and in particular those computed from co-occurrence data like mutual information, X² distance, cosine distance, distributional measures, etc. This distinguishes our approach from studies like [Shahaf 2013], that also aim at scaling the process of dynamical topic detection, but consider only monograms and very simple proximity measures like mere term occurrences, both introducing large bias in the analysis (although simplifying considerably the scalability issue).
- The topic detection and alignment step consists first in the detection of topics (sets of strongly semantically related terms) within term graphs (each of these graphs corresponds to a different time interval). Second, it “matches” topics from different time intervals (for example by using Jaccard distance) to generate alignments (split, merge, equivalent) representing the temporal evolution of topics.
- The phylomemetic tree analysis and customization step which allows experts to interact with the workflow by generating and visualizing phylomemetic trees and interactively customizing the workflow by changing data (for example removing or adding a term in a topic) and parameters (for example the time interval for dividing the document collection).
Related work and scientific contributions
Epistemology and maps of science
A major issue in philosophy of science is the uncovering of time to reflect on the conceptual structure of scientific fields of their dynamics. Several theories have been formulated in the field of science evolution [Popper 1963][Kuhn 1970][Lakatos 1980][Bonaccorsi 2008] and a lot of (often conflicting) descriptions and explanations of scientific change and revision have been proposed. These theories diverge on the continuous/discontinuous character of science evolution, the relevance of Darwinian selection mechanisms (Hull 1988), the usefulness of Bayesian frameworks to capture theory change (Romeyn 2006), and the requested ties between sociology and epistemology. Sociologists and historians have indeed good case studies on the building, structure and dissipation of some specific scientific areas (Weingart et al. 1976, Stichweh 1992, Galison 1997), yet generalising their scope is difficult because of the particular nature of each social setting, and the frequent abstraction of general scientific norms. On the other hand, scientometrics has a long tradition in science mapping and science visualization using text, citation links or co-authorship relations. Until recently, only a minority of these researches have looked at the evolution of science. Most of the time, they targeted the contents of archives in a specific field (eg. [Callon et a. 1986][Chavalarias and Cointet 2013] for embryology science) or mainly relied on citation links [Cheng 2004][Besselaar, 1996][Rosvall and Bergstrom 2010][Herera et al. 2010]. To the best of our knowledge, qualitative philosophical and historical expertise on the statics and the dynamics of science has never been corroborated by quantitative approaches from scientometrics studies at the scale beyond a specific scientific area. This is precisely the aim of the present project, with the perspective of providing reliable tools for scientists and science managers, allowing them to build or design collaborative networks and develop new directions of research based on a comprehensive knowledge of the scientific landscape.
Science and technology studies have advocated that “science in action” is more than published results, and include for instance “tacit knowledge” and often unpublished controversies. EPIQUE does not intend to account for all these dimensions of the dynamics of science. It is positioned at the level of published science, as many classical works in history and philosophy of science, and considers science evolution at the level of scientific archives and databases. This distinguishes the approach of EPIQUE both from STS and from more classical history of science. We assume that phylomemetic trees can reveal the signature of specific dynamics of science, for instance regarding the impact of natural selection, the rather gradual or discontinuous pattern of evolution, etc. Therefore, it will provide a perspective on science likely to complement what we can learn through the study of local social dynamics underlying the social construction of science.
Large-scale phylogenetic problems and methods in biology
In biology, the goal of phylogenetic analysis is to reconstruct phylogenetic trees [Baum & Smith 2013] representing ancestor relationships between species, genes, etc. from molecular data like ADN sequences. There exists a huge number of computational phylogenetic methods using different machine learning techniques (maximum likelihood, Markov Chain Monte Carlo [Li 1996], Bayesian inference) depending on a formal description of the observed species characters. There are some similarities to the problems we study in EPIQUE (usage of large-scale parallel data processing [Tyson et al. 2014], scientific workflows). However most forms of molecular phylogenetics make extensive use of sequence alignment in constructing and refining phylogenetic trees (for detecting similarities), which makes these approaches very different from the text and graph mining techniques we intend to is use and extend in EPIQUE. Moreover, the structure of science is a phylogenetic lattice contrary to the phylogenies found in biology, which modifies the reconstruction strategies.
Narration and Topic Tracking
The idea of reconstructing evolution of knowledge by bridging topic detection algorithms with time tracking is a recent and hot topic in different communities : knowledge discovery [Shahaf 2013], scientometrics [Chavalarias et al. 2011][Chavalarias and Cointet 2013], social networks analysis [Shahaf 2012], visual analytics [Liu 2013], news tracking [Allan 2002]. It is relevant to study science evolution, but also for the analysis of the evolution of debates within the blogosphere and the media, or to track emerging technological niches in patents. The goal of topic detection and tracking of news stories is to build clusters of related news on the fly. Other related works include latent topic models that have been adapted to cater for the temporal dimension [Wang, 2011; Wang, 2012]. The main differences with EPIQUE are that (1) there was no attempt to relate clusters between themselves (the goal is to identify the different news “stories” or to define the major topics over time) and (2) the focus was not on efficiency and interaction in the case of large corpora.
Patent mining and visualisation
Intellectual property has become a major economic factor for many industrial companies, but also for scientific and technical organisations developing new technology. The number of patents increases every year and the U.S. Patent and Trademark Office (USPTO) registered 615 234 patent applications in 2014 and granted about half of them (326 033). Patent documents are rich technical documents which are difficult to analyse for non experts and there have been developed a number of tools for assisting patent engineers and decision makers. Most existing patent analysis systems such as Thomson Reuter’s Aureka, Google Patent or WikiPatent mainly focus on searching (“prior art search” [Oh et.al. 2013]) and ranking [Po et.al 2012], others like Patents and PatentLens provide more advanced analysis capabilities. These tools are mainly based on text mining techniques [Tseng et.al. 2007]. More recent work [Tang et.al. 2012] propose to combine more advanced mining techniques like topic driven modeling, heterogeneous network co-ranking and competitive analysis for building topic maps over patent collections.Whereas there are some similarities in the goal of building “topic maps” over patent collections, these tools are not adapted for a temporal analysis of the evolution of some technology. As discussed in the impact section, this might also be an interesting future application of EPIQUE.