Elasticsearch - Basic and Advanced Concepts

Snehal Shinde

Data Engineering

Tags:

Elasticsearch

Indexing

Analyzer

Query Performance

What is Elasticsearch?

In our previous blog, we have seen Elasticsearch is a highly scalable open-source full-text search and analytics engine, built on the top of Apache Lucene. Elasticsearch allows you to store, search, and analyze huge volumes of data as quickly as possible and in near real-time.

Basic Concepts -

Index - Large collection of JSON documents. Can be compared to a database in relational databases. Every document must reside in an index.
Shards - Since, there is no limit on the number of documents that reside in an index, indices are often horizontally partitioned as shards that reside on nodes in the cluster.
Max documents allowed in a shard = 2,147,483,519 (as of now)‍
Type - Logical partition of an index. Similar to a table in relational databases. ‍
Fields - Similar to a column in relational databases. ‍
Analyzers - Used while indexing/searching the documents. These contain “tokenizers” that split phrases/text into tokens and “token-filters”, that filter/modify tokens during indexing & searching.‍
Mappings - Combination of Field + Analyzers. It defines how your fields can be stored & indexed.

Inverted Index

ES uses Inverted Indexes under the hood. Inverted Index is an index which maps terms to documents containing them.

Let's say, we have 3 documents :

Food is great
It is raining
Wind is strong

An inverted index for these documents can be constructed as -

The terms in the dictionary are stored in a sorted order to find them quickly.

Searching multiple terms is done by performing a lookup on the terms in the index. It performs either UNION or INTERSECTION on them and fetches relevant matching documents.

An ES Index is spanned across multiple shards, each document is routed to a shard in a round--robin fashion while indexing. We can customize which shard to route the document, and which shard search-requests are sent to.

ES Index is made of multiple Lucene indexes, which in turn, are made up of index segments. These are write once, read many types of indices, i.e the index files Lucene writes are immutable (except for deletions).

Analyzers -

Analysis is the process of converting text into tokens or terms which are added to the inverted index for searching. Analysis is performed by an analyzer. An analyzer can be either a built-in or a custom.

We can define single analyzer for both indexing & searching, or a different search-analyzer and an index-analyzer for a mapping.

Building blocks of analyzer-

Character filters - receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters.
Tokenizers - receives a stream of characters, breaks it up into individual tokens.
Token filters - receives the token stream and may add, remove, or change tokens.

Some Commonly used built-in analyzers -

1. Standard -

Divides text into terms on word boundaries. Lower-cases all terms. Removes punctuation and stopwords (if specified, default = None).

Text: The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.

Output: [the, 2, quick, brown, foxes, jumped, over, the, lazy, dog’s, bone]

2. Simple/Lowercase -

Divides text into terms whenever it encounters a non-letter character. Lower-cases all terms.

Text: The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.

Output: [ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

3. Whitespace -

Divides text into terms whenever it encounters a white-space character.

Text: The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.

Output: [ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone.]

4. Stopword -

Same as simple-analyzer with stop word removal by default.

Text: The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.

Output: [ quick, brown, foxes, jumped, over, lazy, dog, s, bone]

5. Keyword / NOOP -

Returns the entire input string as it is.

Text: The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.

Output: [The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.]

Some Commonly used built-in tokenizers -

1. Standard -

Divides text into terms on word boundaries, removes most punctuation.

2. Letter -

Divides text into terms whenever it encounters a non-letter character.

3. Lowercase -

Letter tokenizer which lowercases all tokens.

4. Whitespace -

Divides text into terms whenever it encounters any white-space character.

5. UAX-URL-EMAIL -

Standard tokenizer which recognizes URLs and email addresses as single tokens.

6. N-Gram -

Divides text into terms when it encounters anything from a list of specified characters (e.g. whitespace or punctuation), and returns n-grams of each word: a sliding window of continuous letters, e.g. quick → [qu, ui, ic, ck, qui, quic, quick, uic, uick, ick].

7. Edge-N-Gram -

It is similar to N-Gram tokenizer with n-grams anchored to the start of the word (prefix- based NGrams). e.g. quick → [q, qu, qui, quic, quick].

8. Keyword -

Emits exact same text as a single term.

Make your mappings right -

Analyzers if not made right, can increase your search time extensively.

Avoid using regular expressions in queries as much as possible. Let your analyzers handle them.

ES provides multiple tokenizers (standard, whitespace, ngram, edge-ngram, etc) which can be directly used, or you can create your own tokenizer.

A simple use-case where we had to search for a user who either has “brad” in their name or “brad_pitt” in their email (substring based search), one would simply go and write a regex for this query, if no proper analyzers are written for this mapping.

CODE: https://gist.github.com/velotiotech/dc57ba572a0c882d060017f9271dca5b.js

This took 16s for us to fetch 1 lakh out of 60 million documents

Instead, we created an n-gram analyzer with lower-case filter which would generate all relevant tokens while indexing.

The above regex query was updated to -

CODE: https://gist.github.com/velotiotech/1285cc010dfa6999f6fc60208b62d94c.js

This took 109ms for us to fetch 1 lakh out of 60 million documents

Thus, previous search query which took more than 10-25s got reduced to less than 800-900ms to fetch the same set of records.

Had the use-case been to search results where name starts with “brad” or email starts with “brad_pitt” (prefix based search), it is better to go for edge-n-gram analyzer or suggesters.

Performance Improvement with Filter Queries -

Use Filter queries whenever possible.

ES usually scores documents and returns them in sorted order as per their scores. This may take a hit on performance if scoring of documents is not relevant to our use-case. In such scenarios, use “filter” queries which give boolean scores to documents.

CODE: https://gist.github.com/velotiotech/1968a9c9f3195a8b03f83617b1eb43fa.js

Above query can now be written as -

CODE: https://gist.github.com/velotiotech/ca4b5a33981141998f4d7a9b6b92cde0.js

This will reduce query-time by a few milliseconds.

Re-indexing made faster -

Before creating any mappings, know your use-case well.

ES does not allow us to alter existing mappings unlike “ALTER” command in relational databases, although we can keep adding new mappings to the index.

The only way to change existing mappings is by creating a new index, re-indexing existing documents and aliasing the new-index with required name with ZERO downtime on production. Note - This process can take days if you have millions of records to re-index.

To re-index faster, we can change a few settings -

1. Disable swapping - Since no requests will be directed to the new index till indexing is done, we can safely disable swap.
Command for Linux machines -

CODE: https://gist.github.com/velotiotech/deade6e5af7312eda1b97216baf1603c.js

2. Disable refresh_interval for ES - Default refresh_interval is 1s which can safely be disabled while documents are getting re-indexed.

3. Change bulk size while indexing - ES usually indexes documents in chunks of size 1k. It is preferred to increase this default size to approx 5 to 10K, although we need to find the sweet spot while reindexing to avoid load on current index.

4. Reset replica count to 0 - ES creates at least 1 replica per shard, by default. We can set this to 0 while indexing & reset it to required value post indexing.

Conclusion

ElasticSearch is a very powerful database for text-based searches. The Elastic ecosystem is widely used for reporting, alerting, machine learning, etc. This article just gives an overview of ElasticSearch mappings and how creating relevant mappings can improve your query performance & accuracy. Giving right mappings, right resources to your ElasticSearch cluster can do wonders.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Elasticsearch - Basic and Advanced Concepts

What is Elasticsearch?

Basic Concepts -

Index - Large collection of JSON documents. Can be compared to a database in relational databases. Every document must reside in an index.
Shards - Since, there is no limit on the number of documents that reside in an index, indices are often horizontally partitioned as shards that reside on nodes in the cluster.
Max documents allowed in a shard = 2,147,483,519 (as of now)‍
Type - Logical partition of an index. Similar to a table in relational databases. ‍
Fields - Similar to a column in relational databases. ‍
Analyzers - Used while indexing/searching the documents. These contain “tokenizers” that split phrases/text into tokens and “token-filters”, that filter/modify tokens during indexing & searching.‍
Mappings - Combination of Field + Analyzers. It defines how your fields can be stored & indexed.

Inverted Index

ES uses Inverted Indexes under the hood. Inverted Index is an index which maps terms to documents containing them.

Let's say, we have 3 documents :

Food is great
It is raining
Wind is strong

An inverted index for these documents can be constructed as -

The terms in the dictionary are stored in a sorted order to find them quickly.

Searching multiple terms is done by performing a lookup on the terms in the index. It performs either UNION or INTERSECTION on them and fetches relevant matching documents.

Analyzers -

We can define single analyzer for both indexing & searching, or a different search-analyzer and an index-analyzer for a mapping.

Building blocks of analyzer-

Character filters - receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters.
Tokenizers - receives a stream of characters, breaks it up into individual tokens.
Token filters - receives the token stream and may add, remove, or change tokens.

Some Commonly used built-in analyzers -

1. Standard -

Divides text into terms on word boundaries. Lower-cases all terms. Removes punctuation and stopwords (if specified, default = None).

Text: The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.

Output: [the, 2, quick, brown, foxes, jumped, over, the, lazy, dog’s, bone]

2. Simple/Lowercase -

Divides text into terms whenever it encounters a non-letter character. Lower-cases all terms.

Text: The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.

Output: [ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

3. Whitespace -

Divides text into terms whenever it encounters a white-space character.

Text: The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.

Output: [ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone.]

4. Stopword -

Same as simple-analyzer with stop word removal by default.

Text: The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.

Output: [ quick, brown, foxes, jumped, over, lazy, dog, s, bone]

5. Keyword / NOOP -

Returns the entire input string as it is.

Text: The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.

Output: [The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.]

Some Commonly used built-in tokenizers -

1. Standard -

Divides text into terms on word boundaries, removes most punctuation.

2. Letter -

Divides text into terms whenever it encounters a non-letter character.

3. Lowercase -

Letter tokenizer which lowercases all tokens.

4. Whitespace -

Divides text into terms whenever it encounters any white-space character.

5. UAX-URL-EMAIL -

Standard tokenizer which recognizes URLs and email addresses as single tokens.

6. N-Gram -

7. Edge-N-Gram -

It is similar to N-Gram tokenizer with n-grams anchored to the start of the word (prefix- based NGrams). e.g. quick → [q, qu, qui, quic, quick].

8. Keyword -

Emits exact same text as a single term.

Make your mappings right -

Analyzers if not made right, can increase your search time extensively.

Avoid using regular expressions in queries as much as possible. Let your analyzers handle them.

ES provides multiple tokenizers (standard, whitespace, ngram, edge-ngram, etc) which can be directly used, or you can create your own tokenizer.

CODE: https://gist.github.com/velotiotech/dc57ba572a0c882d060017f9271dca5b.js

This took 16s for us to fetch 1 lakh out of 60 million documents

Instead, we created an n-gram analyzer with lower-case filter which would generate all relevant tokens while indexing.

The above regex query was updated to -

CODE: https://gist.github.com/velotiotech/1285cc010dfa6999f6fc60208b62d94c.js

This took 109ms for us to fetch 1 lakh out of 60 million documents

Thus, previous search query which took more than 10-25s got reduced to less than 800-900ms to fetch the same set of records.

Had the use-case been to search results where name starts with “brad” or email starts with “brad_pitt” (prefix based search), it is better to go for edge-n-gram analyzer or suggesters.

Performance Improvement with Filter Queries -

Use Filter queries whenever possible.

CODE: https://gist.github.com/velotiotech/1968a9c9f3195a8b03f83617b1eb43fa.js

Above query can now be written as -

CODE: https://gist.github.com/velotiotech/ca4b5a33981141998f4d7a9b6b92cde0.js

This will reduce query-time by a few milliseconds.

Re-indexing made faster -

Before creating any mappings, know your use-case well.

ES does not allow us to alter existing mappings unlike “ALTER” command in relational databases, although we can keep adding new mappings to the index.

To re-index faster, we can change a few settings -

1. Disable swapping - Since no requests will be directed to the new index till indexing is done, we can safely disable swap.
Command for Linux machines -

CODE: https://gist.github.com/velotiotech/deade6e5af7312eda1b97216baf1603c.js

2. Disable refresh_interval for ES - Default refresh_interval is 1s which can safely be disabled while documents are getting re-indexed.

4. Reset replica count to 0 - ES creates at least 1 replica per shard, by default. We can set this to 0 while indexing & reset it to required value post indexing.

Conclusion

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings

Velotio is now R Systems

Subscribe to get the latest technology updates

Elasticsearch - Basic and Advanced Concepts

Snehal Shinde

What is Elasticsearch?

Basic Concepts -

Inverted Index

Analyzers -

Some Commonly used built-in analyzers -

Some Commonly used built-in tokenizers -

Make your mappings right -

Performance Improvement with Filter Queries -

Re-indexing made faster -

Conclusion

MORE POSTS BY THIS AUTHOR

Snehal Shinde

You may also like

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

Sagar Jaswani

Data Engineering: Beyond Big Data

Pratyush Pranav

Iceberg: Features and Hands-on (Part 2)

Abhishek Sharma

Elasticsearch - Basic and Advanced Concepts

What is Elasticsearch?

Basic Concepts -

Inverted Index

Analyzers -

Some Commonly used built-in analyzers -

Some Commonly used built-in tokenizers -

Make your mappings right -

Performance Improvement with Filter Queries -

Re-indexing made faster -

Conclusion

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

About Velotio

Subscribe to get the latest technology updates

Related Posts

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

Data Engineering: Beyond Big Data

Iceberg: Features and Hands-on (Part 2)

Data QA: The Need of the Hour

Iceberg - Introduction and Setup (Part - 1)

Confluent Kafka vs. Amazon Managed Streaming for Apache Kafka (AWS MSK) vs. on-premise Kafka

Mage: Your New Go-To Tool for Data Orchestration

The Data Lake Revolution: Unleashing the Power of Delta Lake

Unlocking the Potential of Knowledge Graphs: Exploring Graph Databases

Spatial Data Analytics : The What, Why, and How?

Apache Flink - A Solution for Real-Time Analytics

An Introduction to Stream Processing & Analytics

Modern Data Stack: The What, Why and How?

Best Practices for Kafka Security

Parallelizing Heavy Read and Write Queries to SQL Datastores using Spark and more!

ClickHouse - The Newest Data Store in Your Big Data Arsenal

How to Load Unstructured Data into Apache Hive

Building an ETL Workflow Using Apache NiFi and Hive

Unit Testing Data at Scale using Deequ and Apache Spark

BigQuery 101: All the Basics You Need to Know

Your Quintessential Guide to AWS Athena

Real Time Analytics for IoT Data using Mosquitto, AWS Kinesis and InfluxDB

Lessons Learnt While Building an ETL Pipeline for MongoDB & Amazon Redshift Using Apache Airflow

The Ultimate Beginner’s Guide to Jupyter Notebooks

Product Engineering

Data and AI

Cloud & DevOps

Strategy and Consulting