Elastic Search: Lucene on steroids

Turjoy • January 21, 2024

Core concepts

The core concepts that drive the elastic search are:

Natural Language Processing
Inverse Index Storage
Logical and Physical Shards
Master Slave Architecture

Apache Lucene - NLP Steps

Apache Lucene tames textual chaos! It breaks down your content into byte-sized pieces, then builds a map so you can find things lightning fast.

Let's dive into the captivating journey of transforming text into a searchable index, step by step:

1. Character Elimination:

We begin by gently removing common words like "the," "a," and "but."

These words, known as stop words, often act as signposts in language, but they offer little value for search.

By removing them, we make space for the true treasures to shine.

2. Tokenization:

Next, we carefully split the text into individual words or tokens, like skilled jewelers examining each precious gem.

This creates a clear inventory of all the valuable words within a document, ready for further exploration.

Example:

The document "The quick brown fox jumps over the lazy dog." becomes

["quick", "brown", "fox", "jumps", "over", "lazy", "dog"]

3. Stemming:

To uncover even more hidden connections, we reduce words to their root forms.

For example, "running" and "ran" both transform into "run," expanding our search possibilities.

This process is like tracing a gem back to its original mine, revealing its shared origins with other precious stones.

4. Reverse Indexing:

Now, we construct the ultimate search tool: the reverse index.

This is a magical map that reveals where each word resides within the vast collection of documents.

It's as if we've created a constellation of interconnected gems, illuminating their relationships and making them effortless to find.

Example:

Say we have document 5:

["decent" (position 1), "product" (position 2), "wrote" (position 3), "money" (position 4)]

Reverse index for "decent": [(5, 1)]

// Indicates "decent" appears in document 5 at position 1

Inverse Index

The inverted index is a data structure that allows efficient, full-text searches in the database.

Unlike traditional document-centric data structures, which focus on the content of each document, the inverted index is term-centric.

It organizes data based on individual terms, creating a map of where each term appears across the entire dataset.

How scalability is achieved

To organize this massive collection, Elasticsearch employs two harmonious concepts:

Logical and physical shards
Master slave architecture

A logical shard is a conceptual chunks of a dataset, created to distribute data across multiple servers.

A physical shard is the actual database instances (servers or nodes) that store the logical shards.

The Master is the primary server responsible for handling write operations and managing the overall database system.

The Slaves are secondary servers that maintain copies of the master's data, primarily used for read operations and failover support.

The Master Slave group create a logical shard.

But the master and slaves of a logical shard are distributed across different physical shards.

While storing a document in inverse index, a logical shard is chosen, decided by consistent hashing.

The document is stored and inverse index updated in the master server and replicated into slaves located in different physical devices(shards).

Elastic search uses consistent hashing to determine its rightful place within a logical shard.

While a query sent to get related documents, inter-sharding choose a handful of the physical shards to cover all possible master or slave replicas of inverse indices.

The shards are parallely processed to deliver the results fast.

As the collection of books grows, Elastic search effortlessly scales by adding more physical shards, seamlessly distributing the workload and maintaining performance harmony.

Write flow in Elastic Search

Document arrives and gets a unique ID.
A logical shard is selected for processing based on consistent hashing.
Document is cleaned up:
- Unnecessary characters are removed.
- Words are broken down into individual pieces (tokens).
Words are reduced to their basic forms (stemming).
Document and its transformed words (Vector Array) are sent to the next step.
Inverse Index of the logical shard's master server is populated.
Update of inverse index is replicated in slaves of same logical shard, in different physical shards.

Read Flow

Here's a simple breakdown of the read flow:

Query Comes In
One physical shard takes charge and steps up to lead the search.
The lead shard picks a few other shards.

The chosen physical shards, when combined, must encompass a master or slave replica of every single logical shard within the system.

This guarantees access to the full spectrum of knowledge, regardless of where it resides.

The chosen sections all look for answers at the same time, parallely, in the inverse indices.
Putting It All Together: Once everyone has found what they can, the lead shard gathers all the results to give you the best possible answer.

Know More

Inverted Index

Unlocking Full-Text Search: ElasticSearch