In 2017, we introduced the Multivac Data Science Lab, a set of tools such as interactive notebooks on top of a dedicated cloud-based Hadoop cluster to run Apache Spark jobs: machine learning, NLP, deep learning, iterative graph computation, ETL, etc.

After a tremendous success:

We are thrilled to announce the release of Multivac Data Science Lab to all the partners!

Multivac DSL Success Stories

Over the past 14 months, we have used Multivac DSL at ISC-PIF for the following use cases:

  • Machine Learning
    • Wikipedia and Web of Science topic modeling by using LDA
    • Building a recommendation model based on 100 million Netflix ratings by using ALS
    • Outcome prediction by Classification and Regression (Decision trees, random forests, logistic regression, and naive Bayes)
    • Clustering keywords and phrases by K-means and Gaussian mixtures (GMMs)
  • NLP
    • Implementing Stanford CoreNLP in Apache Spark for distributed NLP
    • Training Universal Dependencies ML for multilingual Part of Speech detection from millions of documents
    • Implementing distributed NLP pipelines for extracting keywords and phrases from large-scale English and French documents
  • Graph
    • Politoscope community detection (100 million tweets)
    • Distributed Louvain algorithm
    • Community detection for keywords and topics by using LPA and Strongly Connected Components
  • ETL
    • Daily downloads, cleaning, extracting, and transforming 150-180 million Wikipedia page views. (total: 94 billion)
    • Extracting and transforming Politoscope data for Apache Hive and Apache SQL