DynamoDB Table Design Principles - Filtering
Abstract
We will continue exploring different technics that allow to extend Access Patterns to a Single Table.
Target DynamoDB Table Structure
NoSQL DynamoDB Table Structure is generated using MultiCloud Diagrams Framework
Access Patterns to implement
In the previous post DynamoDB Table Design Principles - One-To-Many Relationships in NoSQL , when we implemented One-To-Many relationship using SingleTable Design Pattern, and we had completed first 3 Data access patterns:
ID | Access Pattern | Access BY | Key Condition |
---|---|---|---|
A1 | Get user profile | username | PK = USER#alex AND BEGINS_WITH(SK, ‘PROFILE’)” |
A2 | Get orders for user | BY username | PK = USER#alex AND BEGINS_WITH(SK, ‘ORDER#’)” |
A3 | Get single order and order items | GSI: orderid | SK = ORDER#orderid |
Now we will continue Table design to follow all other requested Access Patterns:
ID | Access Pattern |
---|---|
A4 | Get orders for user by status |
A5 | Get open orders |
A5 | Get orders for user, shipped before 01.01.2023 |
A6 | Get orders for user, shipped Between 01.01.2023 - 01.02.2023 |
RDBMS-way for filtering
All this Access Pattern are analogue to WHERE operation in RDBMS, if to write in native-SQL this will look like:
1
SELECT * FROM ORDERS WHERE USERNAME = 'bob' AND ORDERDATE > GETDATE() - 14
Technic #4: Filter Expression
Both Query and Scan operations, support FilterExpressions:
1
2
3
4
5
dynamodb = boto3.client('dynamodb')
response = dynamodb.query(
TableName='UsersAndOrdersTable',
FilterExpression="Status = SHIPPED"
)
Problems:
- to use the query we need always provide the PK, we can not query entire dataset.
- more than that, because the Orders with same status are in different partitions we can not query them.
Scan FilterExpression
1
2
3
4
5
dynamodb = boto3.client('dynamodb')
response = dynamodb.scan(
TableName='UsersAndOrdersTable',
FilterExpression="Status = SHIPPED"
)
Scan with FilterExpression evaluation logic under the hood
Following diagram describes how FilterExpression
is evaluated in DynamoDB:
stateDiagram
[*] --> s1
state "Read all items" as s1
note left of s1
LIMIT in size of 1Mb
notes.
end note
state "Filter Items that do not match" as s2
state "Return results DataSet" as s3
note left of s3
Pros: less network transfer
Cons: All items are read
end note
s1 --> s2
s2 --> s3
s3 --> [*]
Detailed Problems with FilterExpression
in Query and Scan
If there is a 1GB of data in the Table, and predicate that is included to FilterExpression narrows the DataSet to only 1KB of Data, since there is a HARD limit of 1Mb per single read operation - a scan will perform 1000 requests/responses network rounds to a table to read all 1GB of data and filter it to only 1KB.
This is the worst case scenario when you can experience long response time or even timeouts.
Filtered Scan will save bandwidth on a wire, will allow less aggregation on application side. But it will not save the performance and will not add effective data access.
Technic #5: Composite Sort Key
In our Table we have Order Status
and Order CreatedAt
as a separated Attributes in the table:
Instread, we will add new attribute OrderStatusDate
that combines values of these both 2 Attributes (Status & CreatedAt) - A Composite Sort Key.
To provide A4 access pattern, we will introduce a new GSI with PartitionKey
of PK and SortKey
with OrderStatusDate:
This is how A4 query will look like:
ID | Access Pattern | Access BY | Key Condition |
---|---|---|---|
A4 | Get orders for user by status | GSI: | PK = USER#alex AND BEGINS_WITH(OrderStatusDate, ‘SHIPPED#’) |
More than that, Composite Sort Key allows to use both composite attributes when querying, reusing same GSI for more access patterns:
ID | Access Pattern | Access BY | Key Condition |
---|---|---|---|
A5 | Get orders for user, shipped before 01.01.2023 | GSI: | PK = USER#alex AND BEFORE(OrderStatusDate, ‘SHIPPED#01.01.2023’) |
A6 | Get orders for user, shipped Between 01.01.2023 - 01.02.2023 | GSI: | PK = USER#alex AND BETWEEN(OrderStatusDate, ‘SHIPPED#01.01.2023’, ‘SHIPPED#01.02.2023’) |
Technic #6: Sparse Index
This is more a DataWareHouse DB access pattern - query all Database and get all Orders that are with ‘PLACED’ status only.
1
SELECT * FROM ORDERS WHERE STATUS = 'PLACED'
Orders are partitions based on PK USER
- so there are multiple Orders with PLACED
status, but they are stored within different partitions. We need to provide a global Filter for entire Table not a single partition.
Let’s add one more attribute PlacedId
- it will be a marker attribute for Items that are in PLACED
status. The value can be UUID, Date, PK, etc.
Next we will introduce new GSI index view, to effectively query Order with PLACED status:
There are 4 advantages when applying sparce_index
:
- GSI index in size is limited to the number of Items that have SparceIndex key criteria
- Entities set in index is limited to ORDER - there is no need to store relations for other types (more efficient storage)
- once the item has changed its status its SparceIndex attribute should be dropped - after that it will not be returned in queries
- Projection can be used to limit number of fields (attributes) that are needed when requesting (more effective storage)
Conclusions
Key takeaways whe applying Filtering to DynamoDB:
- Avoid using FilterExpression in general (with assumption that it can be applied only on small DataSets).
- Use Composite Sort Key - that will allow to run optimal queries for both Attributes of the Key.
- If Composite Sort Key is sharing the same Partition Key as Table, include it as LSI on table planning time, otherwise introduce GSI.