Amazon CloudSearch | Best Practices Flashcards
Why is my domain in the “Processing” state?
Best Practices
Amazon CloudSearch | Analytics
A domain can be in one of three different states: “processing,” “active,” or “reindexing.” Normally, your domain will be in the “active” state, which indicates that no changes are currently being made, that the domain can be queried and updated, and that all previous changes are currently visible in the search results.
When a domain needs to be re-indexed, Amazon CloudSearch needs to rebuild the index entirely. However, the domain does not enter the “processing” state until you initiate reindexing. During this stage, the domain can still be queried and updated, but the configuration changes won’t be visible in search results until indexing is completed, and the domain’s status changes back to “active.”
You can also continue to upload document batches to your domain. However, if you submit a large volume of updates while your domain is in the “processing” state, it can increase the amount of time it takes for the updates to be applied to your search index. If this becomes an issue, slow down your update rate until the domain returns to the “active” state.
What are the best practices for bootstrapping data into CloudSearch?
Best Practices
Amazon CloudSearch | Analytics
After you’ve launched your domain, the next step is loading your data into Amazon CloudSearch. You’ll likely need to upload a single large dataset, and then make smaller updates or additions as new data comes in. The following guidelines will help make bootstrapping your initial data into CloudSearch quick and easy.
- Use the curl-v command line tool when preparing your script
During the upload of a dataset, the script you’ve written reads your data and uses it to create JSON or XML documents. We recommend preparing this script in advance, and using curl or another simple command line tool to see if you’re able to upload the documents that the script creates. The “-v” option in curl often provides more detailed information about syntax problems than the AWS SDK or Boto, which both suppress errors for production purposes. Curl displays more detailed error messages, which helps identify the sources of any issues.
- Use the UTF-8 character code
Make sure that all data is formatted in the UTF-8 character code format, and that any bad Unicode characters have been removed before uploading to CloudSearch. Illegal characters will cause the document upload to fail.
- Batch your documents
Batching your documents is perhaps the most important step in data bootstrapping. Submitting documents to CloudSearch individually is not only inefficient, but also leads to preventable errors.
A document batch is simply a collection of add and delete operations that represent the documents you want to add, update, or delete from your domain. Batches are described in either JSON or XML, and when you upload them to a domain, the data is indexed automatically, according to the domain’s indexing options. Since you’re billed for the total number of document batches uploaded to your search domain, it’s more cost-effective to upload your data in batches of 5 MB, the maximum allowed per upload. You can also upload batches in parallel to reduce the amount of time it takes to upload your data.
- Pre-scale
It’s also important to pre-scale your data before uploading it to CloudSearch. Pre-scaling involves selecting the appropriate instance type for the amount of data you wish to upload.
Choosing an instance with enough capacity to handle the size of your upload can help prevent errors and a high replication count. Although replication can help decrease search response time, it doesn’t increase the size of the data pipe or address core problems in data uploads.
CloudSearch will automatically scale up to larger instances as you send more data. Still, pre-selecting the appropriate instance type saves time later in the bootstrapping process, as scaling from one instance to another tends to be a slower process. Below is a sample script to pre-scale the domain for boostrapping and to restore the instance type after data is loaded.
Pre-scale before bootstrapping:
aws cloudsearch update-scaling-parameters –domain-name foo –scaling-parameters DesiredInstanceType=search.m3.2xlarge
aws cloudsearch index-documents –domain-name foo
Restore after data loading:
aws cloudsearch update-scaling-parameters –domain-name foo –scaling-parameters DesiredInstanceType=search.m1.small
aws cloudsearch index-documents –domain-name foo
What are some ways to avoid 504 errors?
Best Practices
Amazon CloudSearch | Analytics
If you’re seeing 504 errors or high replication counts, try moving to larger instance type. For example, if you’re having problems with m3.large, move up to m3.xlarge. If you continue to get 504 errors even after pre-scaling, start batching the data and increase the delay between retries.