NOTE: This position is open for the year 2022/2023.
We are looking for students who are interested in Data Science and Machine Learning! A great opportunity to work with cutting-edge technologies and billions of data. You will be working on Multivac Platform developed at ISC-PIF : our platform is one of the biggest academic repositories with over 15 billion documents hosted across 80 servers on dedicated servers and cloud services.
Diploma required: Bac + 5 in a quantitative field (applied mathematics, statistics, computing…)
Internship starting date: Flexible
Duration : 2 – 6 month
Salary policy: the internship is paid according to the legal wage rates (approx. 560€/month)
About the Institute of Complex Systems (ISC-PIF)
Created in 2005, ISC-PIF is a CNRS service and research unit dedicated to the inter-institutional and inter-disciplinary development of research on complex systems. At once a research laboratory, project incubator, shared resource center, conference center, and academic co-working space, this scientific hub provides researchers with a dynamic research environment and innovative tools based on big data and high-performance computing.
Institut des Systèmes Complexes de Paris 113 rue Nationale 75013 Paris
About the internship
You will be working on the Multivac Platform developed at ISCPIF. Multivac Platform is one of the biggest academic repositories with over 75 billion documents hosted across 100 servers on dedicated servers and cloud services. The datasets contain metadata from published scientific papers and social networks with a wide range of topics. Multivac platform is meant as an interface between researchers and Big Data, especially in the domain of NLP and text mining. It offers services such as comprehensive dashboards that enable scientists to explore and discover facts with a wider overview of large-scale data through visualizations. It also offers API access that allows researchers to exploit this huge architecture and computation without any prior technical knowledge. In addition, Multivac Data Science Lab offers interactive notebooks over Apache Hadoop/Spark cluster in a private Cloud.
Why Multivac Platform :
Multivac Platform is built by cutting-edge technologies such as:
- Large-scale databases (MongoDB and Redis with over 12 billion documents)
- Search engine clusters (Elasticsearch/Kibana with over 6 billion documents)
- Distributed computations and real-time processing (RabbitMQ, NodeJS, etc.)
- Cloudera Hadoop 2.0 with interactive Spark notebooks (HDFS, YARN, Apache Spark, Apache Hive, Apache HBase, Apache Zeppelin, Hue, etc.)
- Cloud services (OpenStack)
You get to learn all about these new technologies and have access to Multivac Data Science Lab. Multivac Platform hosts over 14 billion data with over 50 million data every day.
- Master in Statistics or Data Sciences
- Basic knowledge of Machine Learning Algorithms
- Good knowledge of Scala, Python, or R
- Good knowledge of Deep Learning libraries (BigDL, Tensor Flow …)
- Strong knowledge of text mining in social networks
- Interest in NLP tasks and Graph analytics
- Experience with Twitter datasets and other REST API services (Bonus)
- Familiar with Apache Spark or any other Hadoop components (Bonus)
Responsibilities (two or more) :
- Work on unsupervised learning algorithms for topic detection
- Work on supervised learning algorithms for classifications and predictions
- Develop and optimize our existing LDA implementations
- Develop algorithms to perform NLP tasks such as clustering, topic detections, etc. (StanfordCoreNLP)
- Implement algorithms to improve sentimental analysis and mentions clustering
- Develop and implement methods of automatic detection of opinions in Tweets
- Implement methods of keyword extractions in scientific publications
How to Apply :
Please email your job application (reference in the subject line: Multivac Intern) including a cover letter, a resume, and an indication of availability date to maziyar dot panahi at iscpif dot fr.