Query Level Timeouts for Opensearch

3 min readMar 8, 2024

We dissect the inclusion of timeouts at the query level within an Opensearch production environment and examine the discrepancy between this and the encountered socket-level time-out. I will demonstrate how I used the parameter cancel_after_time_interval to successfully attain query level timeouts.

The Problem: Socket-Level Timeouts

Opensearch, by default, has no set response timeout duration. While running an operation it will wait indefinitely for a response from the shards, which can be problematic and lead to a bottleneck in production if any query takes an unusually long time to produce a response.

In my setup, despite my applications to establish a query-level timeout, I was still encountering socket-level timeouts, and my connections to the Opensearch load balancer were not being freed up as needed.

Solution: Query-Level Timeouts

A potential way forward was the application of query-level timeouts, which could be done using the timeout parameter in the request body. This was based on the official Opensearch documentation which stated that a timeout field could be added in the body with a custom time to wait for a response — the default is no timeout.

Additionally, the operation could also define a timeout for how long it should wait for a response from active shards. This was defaulted to a 1-minute wait time. I implemented these changes and initiated the update on production.

Surprise: Still Socket Timeouts

Contrary to expectations, a socket timeout still occurred. I opted to run a trial on the Opensearch dev tools using the below query to confirm.

GET test-idx/_search?timeout=1micros
{
  "size": 1
}

I also tried implementing the timeout directly in the query like so:

GET test-idx/_search
{
  "size": 100,
  "timeout": "1micros"
}

Regrettably, the response wasn’t affected by the newly imposed timeouts. The plot had indeed thickened.

The Findings: Debugging Query-Level Timeouts

Taking a deeper dive into the problem on https://github.com/opensearch-project/OpenSearch/issues/817, I found that the optional timeout parameter in the SearchRequest applies only to the individual child shard-level search requests and not at the parent search request. The search request to a node happens in batches based on the maxConcurrentRequestsPerNodeparameter. The timeout is also only honored in the query phase but not in the Fetch phase.

Up comes cancel_after_time_interval to save the day. Opensearch included this parameter to address such a dilemma, and the time value provided here specifies when the search request will be cancelled. Interestingly, the request-level parameter for cancel_after_time_interval trumps even the corresponding cluster setting.

I tried using cancel_after_time_interval within a search query, and it worked!

GET test-idx/_search?cancel_after_time_interval=1micros
{
  "size": 100
}

On hitting the request, I was met with a timeout exception, indicating a successful implementation of query-level timeout.

{
  "error": {
    "root_cause": [
      {
        "type": "task_cancelled_exception",
        "reason": "Cancellation timeout of 1ms is expired"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "test-idx",
        "node": "3qgR865ZQaGVA5i-fXAiEg",
        "reason": {
          "type": "task_cancelled_exception",
          "reason": "Cancellation timeout of 1ms is expired"
        }
      }
    ]
  },
  "status": 500
}

Api Doc Ref:

Search

Search Introduced 1.0

opensearch.org

Debugged by Me, written by: ChatGpt :D