Elasticsearch is currently the most popular way to implement free text search and analytics in applications. It is highly scalable and can easily manage petabytes of data. It supports variety of use cases like allowing users to easily search through any portal, collect and analyze log data, build business intelligence dashboards to quickly analyze & visualize data.
This blog acts an introduction to Elasticsearch and covers the basic concepts of clusters, nodes, index, document & shards.
What is Elasticsearch?
Elasticsearch (ES) is a combination of open source, distributed, highly scalable data store and Lucene - a search engine which supports extremely fast full-text search. It is a beautifully crafted software which hides the internal complexities and provides full-text search capabilities with simple REST APIs. Elasticsearch is written in Java with Apache Lucene at its core. I should be clear that Elasticsearch is not like a traditional RDBMS. It is not suitable for your transactional database needs and hence, in my opinion, it should not be your primary data store. It is common practice to use a relational database as the primary data store and inject only required data into Elasticsearch. Elasticsearch is meant for fast text search. There are several functionalities which make it different than RDBMS. Unlike RDBMS, Elasticsearch stores data in the form of JSON document which is denormalized and doesn’t support transactions, referential integrity, joins, and subqueries.
Elasticsearch works with structured, semi-structured and unstructured data as well. In the next section, let's walk through the various components in Elasticsearch.
One or more nodes (servers) collectively becomes a cluster which holds your entire data and provides indexing and search capabilities. A Cluster can be as small as a single node or can scale to hundreds or thousands of nodes. Each cluster is identified by a unique name.
Node is a single physical or virtual machine which holds full or part of your data and provides computing power for indexing and searching your data. Every node is identified with a unique name. If node identifier is not specified, a random UUID is assigned as node identifier at the startup. Every node configuration has property `cluster.name`. At startup, the cluster will be formed automatically with all the nodes having the same `cluster.name`.
A node has to accomplish several duties like:
- storing the data
- perform operations on data (indexing, searching, aggregation, etc.)
- maintaining the health of the cluster
Each node in a cluster is capable of doing all these operations. Elasticsearch provides the capability to split responsibilities across different nodes. This makes it easy to scale, optimize and maintain the cluster. Based on the responsibilities, following are the different types of nodes that are supported:
Data node is the node which has storage and computation capability. Data node stores the part of data in form of shards (explained in the later section of the article). Data nodes also participate in the CRUD, search and aggregate operation. These operations are resource intensive and hence it is good practice to have dedicated data nodes without having the additional load of cluster administration. By default, every node of the cluster is a data node.
Master nodes are reserved to perform administrative tasks. Master node tracks the availability/failure of the data nodes. The master nodes are responsible for creating and deleting the indices (Indices are explained in the later section of the article).
This makes the master node a critical part of the Elasticsearch cluster. It has to be always stable and healthy. A single master node for a cluster is certainly a single point of failure. Elasticsearch provides the capability to have multiple master-eligible nodes. All the master eligible nodes participate in an election to elect a master node. It is recommended to have a minimum of three nodes in the cluster to avoid a split brain situation. By default, all the nodes are both data nodes as well as master nodes. Although, some nodes can be master-eligible nodes only through explicit configuration.
Any node which is not a master node or a data node will end up serving as coordinating node. Coordinating nodes act as smart load balancers. Coordinating nodes are exposed to the end user requests. It appropriately redirects the requests between data nodes and master nodes.
To take an example, user’s search request is sent to different data nodes. Each data node performs searching locally and sends the result back to coordinating node. Coordinating Node aggregates and returns the end result back to the user.
There are a few concepts that are core to Elasticsearch. Understanding these concepts from the outset will tremendously help ease the learning process.
Index is a container to store data similar to a database in the relational databases. An Index contains a collection of documents that have similar characteristics or are logically related. If we take an example of an e-commerce website, there will be one index for products, one for customers and so on. Indices are identified by the lowercase name. Index name is required to perform add / update / delete operation on the document in it.
Type is a logical grouping of the documents within the index. In the previous example of product index, we can further group documents into types like electronics, fashion, furnitures, etc. Types are defined based on documents having similar properties in it. It is difficult to decide when to use the type over index. Indices has more overheads so sometimes it is better to use different types in the same index which yields better performance. There are couple of restrictions using types as well. Two fields having same name in different type of document should be of same datatype (string, date, etc.).
Document is the basic unit of information that can be indexed. A Document is represented in the JSON format. We can add as many documents as we want into an index. The following snippet shows how to create document of type mobile in index store. We will cover more about individual field of the document in the Mapping Type section.
HTTP POST <hostname:port>/store/mobile/
"name": "Motorola G5",
"features": "16 GB ROM | Expandable Upto 128 GB | 5.2 inch Full HD Display | 12MP Rear Camera | 5MP Front Camera | 3000 mAh Battery | Snapdragon 625 Processor",
We already know what a type is? To create different types in an index we need mapping types (or simply mapping) to be specified at the time of index creation. Mappings can be defined as a list of directives given to Elasticseach about how the data is supposed to be stored and retrieved. It is important to provide mapping information at the time of index creation based on how we want to retrieve our data later. In the context of relational databases, think of mappings as a table schema.
Mapping provides information on how to treat each field of the JSON like if the field is of type date or geo-location or person name. Mappings also allow specifying which fields will participate in full-text search, specify the analyzers which can be used to transform, decorate data before storing into an index. If no mapping is provided Elasticsearch tries to identify the schema itself which is known as Dynamic Mapping. Each mapping type has Meta Fields and Properties.
Each mapping type has Meta Fields and Properties. Below snippet shows the mapping of the type mobile.
As the name indicates, meta fields stores additional information about the document. Meta fields are meant for mostly internal usage purpose and it is unlikely that end user has deal with meta fields. Meta field names starts with underscore. There are around 10 meta fields in total. We will talk about some of the important amongst them.
It stores the name of the index document belongs to. This is used internally to store / search the document within an index.
It stores the type of the document. To get better performance it is often included in search queries.
This unique id of the document. It is used to access specific document directly over the HTTP GET API.
This holds the original JSON document before applying any analyzers / transformations, etc. It is important to note that Elasticsearch can query on fields which are indexed (provided mapping for). The _source field is not indexed and hence can not be queried on but it can be included in the final search result.
Fields Or Properties
List of fields specify which all JSON fields in the document should be included in a particular type. In the e-commerce website example, mobile can be a type. It will have fields like operating_system, camera_specification, ram_size, etc.
Fields also carry the data type information with them. This directs Elasticsearch to treat the specific fields in a particular way in storing/searching data. Data Types are similar to what we see in any other programming language. We will talk about few of them here.
Simple Data Types
This datatype is used to store full text like product description. These fields participate in full-text search. These type of fields are analyzed while storing which enables to searching these fields by the individual word in it. These type of fields are not used in sorting and aggregation queries.
This type is also used to store text data but unlike Text it is not analyzed and stored as is. This is suitable to store information like user’s mobile number, city, age, etc. These fields are used in filter, aggregation and sorting queries. For e.g. list all users from a particular city, filter by their age.
Elasticsearch supports a wide range of numeric type long, integer, short, byte, double, float.
There are few more data types supported date (to store date in wide range of formats), boolean (true / false, on / off, 1 / 0), IP (to store IP addresses).
Special Data Types
This data type is used to store geographical location. It accepts latitude and longitude pair. To give an example this data type can be used to arrange the user’s photo library by their geographical location or graphically display the locations which are trending on social media news.
It allows storing arbitrary geometric shapes like rectangle, polygon.
This datatype is used to provide auto completion feature over a specific field. As the user types certain text, completion suggester can guide the user to reach particular results.
Complex Data Type
If you know JSON well, this is not a new concept. Elasticsearch also allows storing nested JSON object structure as a document.
The Object data type is not that useful due to its underlying data representation in the Lucene index. Lucene index does not support inner JSON object. ES flattens the original JSON to make it compatible to store in Lucene index. Due to this fields of the multiple inner objects get merged into one leading to wrong search results. Most of the time what you may want to use is Nested Datatype over Object.
Shards help with enabling Elasticsearch to become horizontally scalable. An index can store millions of documents and occupy terabytes of data. This can cause problems with performance, scalability and maintenance. Let's see how Shards help achieve scalability.
Indices are divided into multiple units called Shards (Refer below diagram). Shard is full featured subset of an index. Shards of the same index now can reside on the same or different nodes of the cluster. Shard decides the degree of parallelism for search and indexing operations. Shards allow the cluster to grow horizontally. The number of shards per index can be specified at the time of index creation. By default number of shards created is 5. Although, once the index is created the number of shards can not be changed. To change the number of shards that data will need to re-indexed.
Hardware can fail at any time. To ensure fault tolerance and high availability ES provides a feature to replicate the data. Shards can be replicated. A shard which is being copied is called as Primary Shard. The copy of the primary shard is called a replica shard or simply replica. Similar to the number of shards, number of replication can also be specified at the time of index creation. Replication served two purposes
- HIgh Availability - Replica is never been created on the same node where the primary shard is present. This ensures that even if complete node is failed data is can be available through the replica shard.
- Performance - Replica can also contribute into search capabilities. The search queries will be executed parallely across the replicas.
To summarize, each index can be split into multiple shards. An index can also be replicated zero (meaning no replicas) or more times. Once replicated, each index will have primary shards (the original shards that were replicated from) and replica shards (the copies of the primary shards). The number of shards and replicas can be defined per index at the time the index is created. After the index is created, you may change the number of replicas dynamically anytime but you cannot change the number of shards.
In this blog, we have covered the basic but important aspects of ElasticSearch. In following posts, I will talk about how indexing & searching works in detail. Stay tuned!
For any questions, do use the comment sections below or email us at email@example.com