Query Level Timeouts for Opensearch

Savan Nahar
3 min readMar 8, 2024

--

We dissect the inclusion of timeouts at the query level within an Opensearch production environment and examine the discrepancy between this and the encountered socket-level time-out. I will demonstrate how I used the parameter cancel_after_time_interval to successfully attain query level timeouts.

Some random google image

The Problem: Socket-Level Timeouts

Opensearch, by default, has no set response timeout duration. While running an operation it will wait indefinitely for a response from the shards, which can be problematic and lead to a bottleneck in production if any query takes an unusually long time to produce a response.

In my setup, despite my applications to establish a query-level timeout, I was still encountering socket-level timeouts, and my connections to the Opensearch load balancer were not being freed up as needed.

Solution: Query-Level Timeouts

A potential way forward was the application of query-level timeouts, which could be done using the timeout parameter in the request body. This was based on the official Opensearch documentation which stated that a timeout field could be added in the body with a custom time to wait for a response — the default is no timeout.

Additionally, the operation could also define a timeout for how long it should wait for a response from active shards. This was defaulted to a 1-minute wait time. I implemented these changes and initiated the update on production.

Surprise: Still Socket Timeouts

Contrary to expectations, a socket timeout still occurred. I opted to run a trial on the Opensearch dev tools using the below query to confirm.

GET test-idx/_search?timeout=1micros
{
"size": 1
}

I also tried implementing the timeout directly in the query like so:

GET test-idx/_search
{
"size": 100,
"timeout": "1micros"
}

Regrettably, the response wasn’t affected by the newly imposed timeouts. The plot had indeed thickened.

The Findings: Debugging Query-Level Timeouts

Taking a deeper dive into the problem on https://github.com/opensearch-project/OpenSearch/issues/817, I found that the optional timeout parameter in the SearchRequest applies only to the individual child shard-level search requests and not at the parent search request. The search request to a node happens in batches based on the maxConcurrentRequestsPerNodeparameter. The timeout is also only honored in the query phase but not in the Fetch phase.

Up comes cancel_after_time_interval to save the day. Opensearch included this parameter to address such a dilemma, and the time value provided here specifies when the search request will be cancelled. Interestingly, the request-level parameter for cancel_after_time_interval trumps even the corresponding cluster setting.

I tried using cancel_after_time_interval within a search query, and it worked!

GET test-idx/_search?cancel_after_time_interval=1micros
{
"size": 100
}

On hitting the request, I was met with a timeout exception, indicating a successful implementation of query-level timeout.

{
"error": {
"root_cause": [
{
"type": "task_cancelled_exception",
"reason": "Cancellation timeout of 1ms is expired"
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "test-idx",
"node": "3qgR865ZQaGVA5i-fXAiEg",
"reason": {
"type": "task_cancelled_exception",
"reason": "Cancellation timeout of 1ms is expired"
}
}
]
},
"status": 500
}

Api Doc Ref:

Debugged by Me, written by: ChatGpt :D

--

--

Savan Nahar
Savan Nahar

Written by Savan Nahar

Building software that scales!

No responses yet