ElasticSearch is an open-source search engine built on top of Apache Lucene, responsible for searching and indexing. It stores data in the form of documents. Hence, you don't need to provide a schema to store your data.
However, internally ElasticSearch provides a schema called mapping to Lucene. This schema tells how to index data and what should be the data type. This mapping can be explicit or implicit.
In this blog, we will learn how ElasticSearch is able to process data very rapidly.
The inverted index is a data structure that supports a high-speed search for full texts. The inverted index is the reason behind this fast search that ElasticSearch provides.
How does it work? let's understand this by a simple example:
suppose we insert two documents:
Document 1: "It is a beautiful day"
Document 2: "What a beautiful flower"
An inverted index of the above documents would look like:-
Terms Document Position Frequency
It 1 1 1
is 1 2 1
a 1, 2 3, 2 1
beautiful 1, 2 4, 3 1
day 1 5 1
what 2 1 1
flower 2 4 1
Using this type of data structure, it becomes very easy to perform searching for ElasticSearch.
ElasticSearch indexes all data in every field, and every indexed field has an optimized data structure.
Analyzers are the algorithm that determines how a text field is transformed into terms in the inverted index. It first breaks the terms and then standardizes them. It is a three steps process:
Step 1: Character Filtering
It is a pre-process where the stream of characters is transformed by adding, removing, or updating characters.
Step 2: Tokenization
In this step, the stream of characters breaks down into terms, also known as tokens. For example, a stream can be tokenized by white space to generate individual works generated in output.
Step 3: Token Filters
In this final step, the tokens then filter and transformed into the given user standard.
The result of the analysis process is then put in the inverted index.
ElasticSearch Analyzers provides great support for improving search accuracy.