We are looking for students who are interested in Data Science and Machine Learning! A great opportunity to work with cutting-edge technologies and billions of data. You will be working on Multivac Platform developed at ISCPF : our platform is one of the biggest academic repositories with over 15 billion documents hosted across 80 servers on dedicated servers and cloud services.
Diploma required : Bac + 5 in a quantitative field (applied mathematics, statistics, computing…)
Internship starting date : Flexible
Duration : 2 – 6 month
Salary policy : the internship is paid according to the legal wage rates (approx. 560€/month)
About the Institute of Complex Systems (ISC-PIF)
Interface between disciplines, but also between research organisations and higher education, the ISC-PIF is a place dedicated to the development of innovative and interdisciplinary research on complex systems. Since 2005, the institute is facilitating access to skills, trainings, work areas and pooled research resources based on high performance computing and big data. The institute is a unit of the National Center for Scientific Research (CNRS), one of the largest french research organisation.
Institut des Systèmes Complexes de Paris 113 rue Nationale 75013 Paris
About the internship
You will be working on Multivac Platform developed at ISCPIF. Multivac Platform is one of the biggest academic repositories with over 75 billion documents hosted across 100 servers on dedicated servers and cloud services. The datasets contain metadata from published scientific papers and social networks with wide range of topics. Multivac platform is meant as an interface between researchers and Big Data, especially in domain of NLP and text mining. It offers services such as comprehensive dashboards that enable scientists to explore and discover facts with a wider overview on large-scale data through visualisations. It also offers API access that allows researchers to exploit this huge architecture and computation without any prior technical knowledge. In addition, Multivac Data Science Lab offers interactive notebooks over Apache Hadoop/Spark cluster in private Cloud.
Why Multivac Platform :
Multivac Platform is built by cutting-edge technologies such as:
- Large-scale databases (MongoDB and Redis with over 12 billion documents)
- Search engine clusters (Elasticsearch/Kibana with over 6 billion documents)
- Distributed computations and real-time processing (RabbitMQ, NodeJS, etc.)
- Cloudera Hadoop 2.0 with interactive Spark notebooks (HDFS, YARN, Apache Spark, Apache Hive, Apache HBase, Apache Zeppelin, Hue, etc.)
- Cloud services (OpenStack)
You get to learn all about these new technologies and have access to Multivac Data Science Lab. Multivac Platform hosts over 14 billion data with over 50 million data everyday.
- Master in Statistics or Data Sciences
- Basic knowledge of Machine Learning Algorithms
- Basic knowledge of Scala, Python or R
- Basic knowledge of text mining in social networks
- Interest in NLP tasks and Graph analytics
- (Bonus) Experience with Twitter datasets and other REST API services
- (Bonus) Experience with Apache Spark or any other Hadoop components
- (Bonus) Experience with a Deep Learning library (BigDL, Tensor Flow …)
Responsibilities (two or more) :
- Work on un-supervised learning algorithms for topic detection
- Work on supervised learning algorithms for classifications and predictions
- Develop and optimise our existing LDA implementations
- Implement algorithms to perform NLP tasks such as clustering, topic detections, etc. (StanfordCoreNLP)
- Implement algorithms to improve sentimental analysis and mentions clustering
- Implement methods of automatic detection of opinions in Tweets
- Implement methods of keyword extractions in scientific publications
How to Apply :
Please email your job application (reference in subject line: Multivac Intern) including a cover letter, a resume and an indication of availability date to maziyar dot panahi at iscpif dot fr.