Post

AWS OpenSearch Full Documents Reindexing: When? Why? How?

We will explore AWS OpenSearch Service, how it was introduced and deep dive into challenge of documents reindexing.

OpenSearch Brief History: The Forking of Elasticsearch

Elasticsearch is an open-source search engine developed by Elastic NV. It became incredibly popular due to its scalability, distributed nature, and powerful search capabilities. However, in 2021, Elastic NV changed the licensing model of Elasticsearch from Apache 2.0 to a Server Side Public License (SSPL). This move was made to prevent cloud providers from offering Elasticsearch as a managed service without contributing back to the open-source community.

In response, AWS decided to fork the last Apache 2.0-licensed version of Elasticsearch and create a new service called Amazon OpenSearch Service. This fork not only preserved the open-source nature of the software but also allowed AWS to continue offering a managed search service with full control over its development.

Licensing

After the license change, Elasticsearch is now under the SSPL, which is not recognized as an open-source license by the Open Source Initiative (OSI). The SSPL imposes restrictions on how the software can be used, particularly for cloud services.

ES provides Elastic Cloud installation that can be installed into any public cloud provider resources or on-prem.

AWS OpenSearch: OpenSearch remains under the Apache 2.0 license, which is fully open-source. This means anyone can use, modify, and distribute the software without restrictions, making it more attractive for users who prefer open-source solutions.

Understanding AWS Elasticsearch: Internal Implementation of Indices and Reindexing

Amazon Elasticsearch Service (Amazon ES), now known as Amazon OpenSearch Service, is a managed service that makes it easy to deploy, operate, and scale Elasticsearch clusters in the AWS Cloud. It’s widely used for real-time application monitoring, log analytics, full-text search, and more. One of the core components of Elasticsearch is its indices, which serve as the primary structure for storing and querying data. This article delves into the internal implementation of indices in AWS Elasticsearch and provides a guide on how to reindex a database within this environment.

What is an Index in Elasticsearch?

An index in Elasticsearch is akin to a database in traditional relational databases. It contains a collection of documents that are stored and managed together. Each document is a JSON object, and each field within the document is a data point that can be searched and analyzed.

Key Components of an Elasticsearch Index:

  • Shards: An index is divided into smaller pieces called shards. Each shard is a self-contained, fully functional instance of Lucene, the underlying search engine for Elasticsearch.
  • Replicas: For high availability and fault tolerance, Elasticsearch allows you to create replica shards. These are copies of the primary shards that can serve search requests in case the primary shard fails.
  • Mappings: Mappings define the structure of the documents within an index, including the data types of fields and how they should be indexed and stored.
  • Internal Implementation of Indices in AWS Elasticsearch

AWS Elasticsearch handles indices similarly to a standard Elasticsearch deployment but with additional layers of management, security, and scaling capabilities provided by AWS.

Key Aspects of AWS Elasticsearch Indices:

  • Managed Clusters: AWS Elasticsearch manages the underlying infrastructure, including node provisioning, shard allocation, and index replication.
  • Scaling: Indices can be scaled horizontally by adjusting the number of shards and vertically by increasing the instance size or adding more nodes.
  • Security: AWS integrates Elasticsearch with other AWS services like AWS IAM, AWS KMS for encryption, and VPC for network isolation, ensuring that your indices are secure and accessible only to authorized users.
  • Snapshots: AWS Elasticsearch provides automated snapshots for indices, which are stored in Amazon S3 and can be used for backup and recovery purposes.

Why Reindexing is Important

Reindexing in Elasticsearch is the process of copying the data from one index to another. This is often necessary when you need to:

  • Change the structure of the index, such as altering mappings (changing types).
  • Improve performance by reconfiguring the number of shards or replicas (also merge/split indecies).
  • Upgrade Elasticsearch versions that might require data format changes.
  • Introduce new fields in documents and make them available for search

  • Reindexing can be a resource-intensive operation, and AWS Elasticsearch provides tools and best practices to ensure that it’s done efficiently without disrupting service availability.

Steps to Reindex in AWS Elasticsearch. Here’s a step-by-step guide to reindexing an index in AWS Elasticsearch:

Step 1: Create the Target Index

Before you start reindexing, you need to create the target index with the desired mappings, settings, and shard configuration.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
PUT /new-index
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  },
  "mappings": {
    "properties": {
      "field1": {
        "type": "text"
      },
      "field2": {
        "type": "date"
      }
    }
  }
}

Step 2: Use the Reindex API

Elasticsearch provides a _reindex API that allows you to copy data from the source index to the target index.

1
2
3
4
5
6
7
8
9
POST /_reindex
{
  "source": {
    "index": "old-index"
  },
  "dest": {
    "index": "new-index"
  }
}

Step 3: Monitor the Reindexing Process

Reindexing can take time depending on the size of your data. You can monitor the progress using the Task API.

1
GET /_tasks?detailed=true&actions=*reindex

Step 4: Update Aliases (Optional)

Once reindexing is complete, you may want to switch an alias to point to the new index.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
POST /_aliases
{
  "actions": [
    {
      "remove": {
        "index": "old-index",
        "alias": "my-alias"
      }
    },
    {
      "add": {
        "index": "new-index",
        "alias": "my-alias"
      }
    }
  ]
}

Step 5: Delete the Old Index (Optional)

After verifying that the new index is functioning correctly, you can delete the old index to free up resources.

1
DELETE /old-index

Also, during indexing it is possible to update structure of documents (by adding script section):

1
2
3
4
5
6
7
8
9
10
11
12
13
POST _reindex
{
   "source":{
      "index":"source"
   },
   "dest":{
      "index":"destination"
   },
   "script":{
      "lang":"painless",
      "source":"ctx._account.number++"
   }
}

No indices’ migration technic (acutal for smaller clusters)

If the Data volume is not high, once index template is updated, all new indices will be created from new template (with new fields available for search). For all existing indicies we need update index mapping (this will allow any update/adding documents to these indices to be in the latest schema).

For older documents we can run scripted update with defined predicate to update needed documents with new fields default values:

Step1: Patch search template

1
2
3
4
PUT _template/{index}
{
  ...full_body with new mappings
}

Step2: Add to each index new mapping

1
2
3
4
5
6
7
8
9
10
11
PUT /{
  index
}/_mappings
{
  "properties": {
    "new_field": {
      "null_value": false,
      "type": "boolean"

  }
}

Step3: Patch all documents with scripted update

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
POST /{index}/_update_by_query
{
  "query": {
    "bool": {
      "must_not": {
        "exists": {
          "field": "hidden"
        }
      }
    }
  },
  "script": {
    "source": "ctx._source.new_field = false",
    "lang": "painless"
  }
}

Best Practices for Reindexing in AWS Elasticsearch

Snapshot Before Reindexing: Always take a snapshot of your data before starting the reindexing process to safeguard against data loss. Monitor Cluster Health: Keep an eye on the cluster’s health during reindexing to avoid overwhelming the system. Use Aliases: Aliases can help minimize downtime by allowing you to switch indices without changing your application code. Test in Staging: Before reindexing in production, test the process in a staging environment to catch any potential issues.

Conclusion

Reindexing in AWS Elasticsearch is a crucial operation for maintaining and optimizing your search infrastructure. By understanding the internal implementation of indices and following best practices for reindexing, you can ensure that your Elasticsearch environment remains robust, scalable, and ready to meet the demands of your applications. AWS Elasticsearch, with its managed capabilities and tight integration with other AWS services, provides a powerful platform for deploying and managing your search workloads.

This post is licensed under CC BY 4.0 by the author.