Stage – Data Science and Machine Learning

Multivac Big Data Architecture

The Institute of Complex Systems based in Paris is looking for an intern specialised in data science and machine learning to work on our Multivac Platform.

Diploma required : Bac + 5 in a quantitative field (applied mathematics, statistics, computing…)
Internship starting date : Flexible
Duration : 2 – 6 month
Salary policy : the internship is paid according to the legal wage rates (approx. 500€/month)

About the Institute of Complex Systems (ISC-PIF)

Interface between disciplines, but also between research organisations and higher education, the ISC-PIF is a place dedicated to the development of innovative and interdisciplinary research on complex systems. Since 2005, the institute is facilitating access to skills, trainings, work areas and pooled research resources based on high performance computing and big data. The institute is a unit of the National Center for Scientific Research (CNRS), one of the largest french research organisation.

Address :
Institut des Systèmes Complexes de Paris 113 rue Nationale 75013 Paris

About the internship

Description :
You will be working on Multivac Platform developed at ISCPIF. Multivac Platform is one of the biggest academic repositories with over 14 billion documents hosted across 80 servers on dedicated servers and cloud services. The datasets contain metadata from published scientific papers and social networks with wide range of topics. Multivac platform is meant as an interface between researchers and Big Data, especially in domain of NLP and text mining. It offers services such as comprehensive dashboards that enable scientists to explore and discover facts with a wider overview on large-scale data through visualisations. It also offers API access that allows researchers to exploit this huge architecture and computation without any prior technical knowledge. In addition, Multivac Data Science Lab offers interactive notebooks over Apache Hadoop/Spark cluster in private Cloud.

Why Multivac Platform :

Multivac Platform is built by cutting-edge technologies such as:

  • Large-scale databases (MongoDB and Redis with over 9 billion documents)
  • Search engine clusters (Elasticsearch/Kibana with over 5 billion documents)
  • Distributed computations and real-time processing (RabbitMQ, NodeJS, etc.)
  • Cloudera Hadoop 2.0 (HDFS, YARN, Apache Spark, Apache Hive, Apache HBase, etc.)
  • Cloud services (OpenStack)

You get to learn all about these new technologies and have access to Multivac Data Science Lab. Multivac Platform hosts over 14 billion data with over 50 million data everyday.

Multivac Data Science Lab

Multivac Data Science Lab

Requirements : 

  • Master in Statistics or Data Sciences
  • Experience with Machine Learning Algorithms implementation
  • Interest in NLP tasks and Neural Network
  • Experience with Scala, Python or R
  • Experience with text mining in social networks
  • (Bonus) Experience with Twitter datasets and its APIs
  • (Bonus) Experience with Apache Spark or any other Hadoop components
  • (Bonus) Experience with a Deep Learning library (Theano, Tensor Flow …)

Responsibilities (two or more) :

  • Work on un-supervised learning algorithms for topic detection
  • Work on supervised learning algorithms for classifications and predictions
  • Develop and optimise our existing LDA implementations
  • Implement algorithms to perform NLP tasks such as clustering, topic detections, etc. (StanfordCoreNLP)
  • Implement algorithms to improve sentimental analysis and mentions clustering
  • Implement methods of automatic detection of opinions in Tweets
  • Implement methods of keyword extractions in scientific publications


How to Apply :
Please email your job application (reference in subject line: Multivac Intern) including a cover letter, a resume and an indication of availability date to maziyar dot panahi at iscpif dot fr. 


Former Interns at Multivac: