I’ve started to work with ElasticSearch for a while now. I gotta say it’s a powerful open source for building distributed real-time search engines and analytics engines.
It also uses Shards and Replica in distributed machines to make your architecture reliable and scalable. It’s not just a simple full-text search engine even though it does that perfectly.
What I do with ElasticSearch is that it’s connected to my MongoDB replica set and it indexes tweets as they are streamed by Twitter API. I track tweets for few projects (health, news, UN GlobalPulse, etc.) and try to index some of my projects in ES on the fly.
I have indexed over 14.86 million documents in ElasticSearch and it has worked really smooth so far. It’s not that much yet but the fact that I have my primary database (MongoDB) and a fully-function search engine (ElasticSearch) and they both are working together and compatible is just fantastic. I have 20 to 30 updates per second and so many deletes (TTL documents in MongoDB) as well. Even those operations have worked without any problem for the past two weeks.
Here is my architecture: (it’s just for development and experiment)
Large EC2 instance (m1.Large 7.5G Memory) for ElasticSearch. The CPU usage is around 15%-25% and this instance has also few Node.js apps up and running 24/7 as well.
Two EC2 instances (m2.2xLarge 34GB Memory). This is connected to ElasticSearch by a plugin named elasticsearch-river-mongodb. My ES version is 0.90.11 on Ubuntu 12.04 and MongoDB version is 2.4.9.
Btw, Marvel is great dashboard for monitoring your ElasticSearch cluster: (the cpu usage is high because I just connected one of my MongoDB collections with 16 million documents and as it can be seen in the figure, it is indexing 300 to 400 documents per second)