Query Level Timeouts for Opensearch
We dissect the inclusion of timeouts at the query level within an Opensearch production environment and examine the discrepancy between this and the encountered socket-level time-out. I will demonstrate how I used the parameter cancel_after_time_interval
to successfully attain query level timeouts.
The Problem: Socket-Level Timeouts
Opensearch, by default, has no set response timeout duration. While running an operation it will wait indefinitely for a response from the shards, which can be problematic and lead to a bottleneck in production if any query takes an unusually long time to produce a response.
In my setup, despite my applications to establish a query-level timeout, I was still encountering socket-level timeouts, and my connections to the Opensearch load balancer were not being freed up as needed.
Solution: Query-Level Timeouts
A potential way forward was the application of query-level timeouts, which could be done using the timeout
parameter in the request body. This was based on the official Opensearch documentation which stated that a timeout field could be added in the body with a custom time to wait for a response — the default is no timeout.
Additionally, the operation could also define a timeout for how long it should wait for a response from active shards. This was defaulted to a 1-minute wait time. I implemented these changes and initiated the update on production.
Surprise: Still Socket Timeouts
Contrary to expectations, a socket timeout still occurred. I opted to run a trial on the Opensearch dev tools using the below query to confirm.
GET test-idx/_search?timeout=1micros
{
"size": 1
}
I also tried implementing the timeout directly in the query like so:
GET test-idx/_search
{
"size": 100,
"timeout": "1micros"
}
Regrettably, the response wasn’t affected by the newly imposed timeouts. The plot had indeed thickened.
The Findings: Debugging Query-Level Timeouts
Taking a deeper dive into the problem on https://github.com/opensearch-project/OpenSearch/issues/817, I found that the optional timeout parameter in the SearchRequest applies only to the individual child shard-level search requests and not at the parent search request. The search request to a node happens in batches based on the maxConcurrentRequestsPerNode
parameter. The timeout is also only honored in the query phase but not in the Fetch phase.
Up comes cancel_after_time_interval
to save the day. Opensearch included this parameter to address such a dilemma, and the time value provided here specifies when the search request will be cancelled. Interestingly, the request-level parameter for cancel_after_time_interval
trumps even the corresponding cluster setting.
I tried using cancel_after_time_interval
within a search query, and it worked!
GET test-idx/_search?cancel_after_time_interval=1micros
{
"size": 100
}
On hitting the request, I was met with a timeout exception, indicating a successful implementation of query-level timeout.
{
"error": {
"root_cause": [
{
"type": "task_cancelled_exception",
"reason": "Cancellation timeout of 1ms is expired"
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "test-idx",
"node": "3qgR865ZQaGVA5i-fXAiEg",
"reason": {
"type": "task_cancelled_exception",
"reason": "Cancellation timeout of 1ms is expired"
}
}
]
},
"status": 500
}
Api Doc Ref:
Debugged by Me, written by: ChatGpt :D