DynamoDB — Parallel Scan vs Sequential Scan

Vaishnavi Abirami
4 min readMay 30, 2021

--

When and Why we need to use a Scan Operation?

DynamoDB Scans are generally used when we need to search item(s) whose field is neither a primary key nor a secondary key.

But in General, we should always prefer Query method over Scan. This is because Scan literally reads every item in a table. When it’s not possible (for example, when you’re looking for piece of data with a key that is unknown to you), and if it’s a frequently used pattern, consider adding a GSI (Global Secondary Index) to index that attribute and enable Query.

Sequential Scan:

By default, Scan operations proceed sequentially. Even though DynamoDB distributes items across multiple physical partitions, a Scan operation can only read one partition at a time. For this reason, the throughput of a Scan is constrained by the maximum throughput of a single partition.

A Scan operation performs eventually consistent reads by default, and it can return up to 1 MB (one page) of data. Therefore, a single Scan request can consume

(1 MB page size / 4 KB item size) / 2 (eventually consistent reads) = 128 read operations.

If the total number of scanned items exceeds the maximum data set size limit of 1 MB, the scan stops and results are returned to the user as a LastEvaluatedKey value to continue the scan in a subsequent operation.

A sequential Scan might not always be able to fully utilize the provisioned read throughput capacity. So parallel scan is needed there.

A single Scan operation reads up to the maximum of 1 MB of data. If LastEvaluatedKey is present in the response, you need to paginate the result set. In the following diagram, there are multiple workers or processes spawned to perform Scan where the LastEvaluatedKey is checked in every result set and if the LastEvaluatedKey is present then that is being used as the ExclusiveStartKey for the next subsequent Scan request.

Following is the code snippet written in NodeJS to accomplish Sequential Scan:

Parallel Scan:

A parallel scan can be the right choice if the following conditions are met:

  • The table size is 20 GB or larger.
  • The table’s provisioned read throughput is not being fully used to make sure that other applications that need to access the table is not throttled
  • Sequential Scan operations are too slow.

However the DynamoDB Scanis not suitable for all query patterns, and for more information on why scans are less efficient than queries please read about the performance implications of Scan in the AWS documentation.

How Parallel Scan Works?

Parallel Scans are used to maximize the use of table-level provisioning. It logically divides a table (or secondary index) into multiple logical segments, and use multiple application workers to scan these logical segments in parallel. Each application worker can be a thread in programming languages that support multithreading or an operating system process.

Source:Google

To perform a parallel scan, each worker issues its own Scan request with the following parameters:

  • Segment — A segment to be scanned by a particular worker. Each worker should use a different value for Segment.
  • TotalSegments — The total number of segments for the parallel scan. This value must be the same as the number of workers that your application will use.

How TotalSegments is calculated in general?

const totalItems = DynamoDB.describeTable({TableName:<table_name>})

const totalSegments = Math.max(Math.floor(totalItems / 1000), 1);

TotalSegments is also calculated based on the client resources and the amount of read throughput being utilised.

Increase the value for TotalSegments if you don't consume all of your provisioned throughput but still experience throttling in your Scan requests. Reduce the value for TotalSegments if the Scan requests consume more provisioned throughput than you want to use.

Following is the code snippet written in NodeJS to accomplish Parallel Scan:

Some of the improvements that could be made for a Scan:

ProjectionExpression for a Scan:

Amazon DynamoDB returns all the item attributes by default. To get only some, rather than all of the attributes, use a projection expression.

FilterExpression for a Scan:

A filter expression determines which items within the Scan results should be returned to you. All of the other results are discarded. A filter expression is applied after a Scan finishes but before the results are returned. Therefore, a Scan consumes the same amount of read capacity, regardless of whether a filter expression is present.

Reduce page size:

The Scan operation provides a Limit parameter that you can use to set the page size for your request. Each Scan request that has a smaller page size uses fewer read operations and creates a "pause" between each request. For example, suppose that each item is 4 KB and you set the page size to 40 items. A Scan request would then consume only 20 eventually consistent read operations or 40 strongly consistent read operations. A larger number of smaller Scan operations would allow your other critical requests to succeed without throttling.

--

--

No responses yet