Concurrent Map Write Issue in Golang: A Deep-Dive

Savan Nahar
3 min readJan 19, 2024

--

Locks/mutex

In today’s post, we’ll delve into a particular concurrency issue we faced in our Go program — a concurrent map write issue. The fatal error caused the container to crash, thus causing a connection drop for all the clients of our application.

Fatal error which caused containers to crash

What makes this interesting is that the issue arose from a combination of two factors: the usage of RWMutex in a struct and the value receiver in our methods

Description of the problem

We have a service called RuleEngineService that evaluates rules. The key components within this struct are an RWMutex and a knowledgeLibraries map. Here's a simplified version of our RuleEngineService:

type RuleEngineService struct {
mx sync.RWMutex
knowledgeLibraries map[string]KnowledgeLibrary
}

func (r RuleEngineService) Evaluate(ctx context.Context, rule RuleConfig, source DataSource) (interface{}, error) {
// Implementation ...
}

The mutex is crucially intended to ensure that no two goroutines can write to the `knowledgeLibraries` map simultaneously. This concurrency control is paramount to prevent race conditions concurrent writes leading to unexpected behavior or crashes.

// This will allow multiple goroutines to read the map concurrently in threadsafe way
r.mx.RLock()
knowledgeLibrary := r.knowledgeLibraries[ruleHash]
r.mx.RUnlock()

if knowledgeLibrary == nil {
// If nil, acquire writer lock to write into map
r.mx.Lock()
// We need to check again to ensure that no one has written into the map while we were waiting for the lock.
knowledgeLibrary = r.knowledgeLibraries[ruleHash]
if knowledgeLibrary == nil {
knowledgeLibrary = buildKnowledgeLibrary(ctx, rule)
r.knowledgeLibraries[ruleHash] = knowledgeLibrary
}
r.mx.Unlock() // Unlock after done writing
}
knowledgeBase := knowledgeLibrary.NewKnowledgeBaseInstance(rule.RuleConfigName(), ruleEngineVersion)

However, our client code also participates in the spread of this concurrency issue, by making a copy of the `RuleEngineService` with the help of `rule_engine.New()` function and storing a pointer to this copy in a container, like so:

client := rule_engine.New()
c.Put(RuleClient, &client)

Why is this a Problem?

The locking mechanism might seem flawless at first glance. However, the problem arises with the way the Evaluatemethod receives the RuleEngineServicestruct. The current method is defined with a value receiver, implying it makes a copy of the struct and all its associated properties, operating on them within the called function.

Due to the value receiver, different goroutines each work with a copy of our RuleEngineService. Thus each goroutine has its own knowledgeLibraries and mx mutex, which is a complete violation of our initial goal to ensure a single mutex protecting concurrent access to the knowledgeLibraries map.

The Path to Resolution

The solution follows almost immediately from our discovery of the problem — leverage a pointer receiver in the Evaluate method:

func (r *RuleEngineService) Evaluate(ctx context.Context, rule RuleConfig, source DataSource) (interface{}, error) {
// Implementation ...
}

This revised method now operates on the address of the original RuleEngineService struct directly. As a result, the mutex and the map belong to the same, single instance of RuleEngineService, averting duplication and ensuring correct synchronization. With such a change, the global instance of RuleEngineService is locked appropriately when calling Evaluate, enforcing the correct order of read-write operations and thus preventing concurrent map write errors.

Another contributing factor towards our identified issue relates to the client code’s usage of RuleEngineService

The key is to ensure that the client code and, by extension, the rest of the application, interacts with the same RuleEngineService instance and the mutex within it. We need to modify the rule_engine.New() function such that it directly returns the original RuleEngineService reference. Therefore, the changes in rule_engine package could look like so:


type RuleEngine struct {
// Existing code...
}

// Note that New() now returns a pointer to RuleEngine
func New() *RuleEngine {
return &RuleEngine{
// Existing code...
}
}

Now, the original RuleEngineService can be retrieved directly, avoiding the creation of a copy:

client := rule_engine.New()
c.Put(RuleClient, client) // No need for & before client now, as client is already a pointer

Now, the client code singularly operates on the original RuleEngineService rather than dealing with copies. This fix means that the mutex object within RuleEngineService, when it's referenced in the client code, remains exclusive and consequently upholds the synchronization, thus eradicating the concurrent map write issues.

--

--

Savan Nahar
Savan Nahar

Written by Savan Nahar

Building software that scales!

Responses (1)