Concurrent Map Write Issue in Golang: A Deep-Dive
In today’s post, we’ll delve into a particular concurrency issue we faced in our Go program — a concurrent map write issue. The fatal error caused the container to crash, thus causing a connection drop for all the clients of our application.
What makes this interesting is that the issue arose from a combination of two factors: the usage of RWMutex
in a struct and the value receiver in our methods
Description of the problem
We have a service called RuleEngineService
that evaluates rules. The key components within this struct are an RWMutex
and a knowledgeLibraries
map. Here's a simplified version of our RuleEngineService
:
type RuleEngineService struct {
mx sync.RWMutex
knowledgeLibraries map[string]KnowledgeLibrary
}
func (r RuleEngineService) Evaluate(ctx context.Context, rule RuleConfig, source DataSource) (interface{}, error) {
// Implementation ...
}
The mutex is crucially intended to ensure that no two goroutines can write to the `knowledgeLibraries` map simultaneously. This concurrency control is paramount to prevent race conditions concurrent writes leading to unexpected behavior or crashes.
// This will allow multiple goroutines to read the map concurrently in threadsafe way
r.mx.RLock()
knowledgeLibrary := r.knowledgeLibraries[ruleHash]
r.mx.RUnlock()
if knowledgeLibrary == nil {
// If nil, acquire writer lock to write into map
r.mx.Lock()
// We need to check again to ensure that no one has written into the map while we were waiting for the lock.
knowledgeLibrary = r.knowledgeLibraries[ruleHash]
if knowledgeLibrary == nil {
knowledgeLibrary = buildKnowledgeLibrary(ctx, rule)
r.knowledgeLibraries[ruleHash] = knowledgeLibrary
}
r.mx.Unlock() // Unlock after done writing
}
knowledgeBase := knowledgeLibrary.NewKnowledgeBaseInstance(rule.RuleConfigName(), ruleEngineVersion)
However, our client code also participates in the spread of this concurrency issue, by making a copy of the `RuleEngineService` with the help of `rule_engine.New()` function and storing a pointer to this copy in a container, like so:
client := rule_engine.New()
c.Put(RuleClient, &client)
Why is this a Problem?
The locking mechanism might seem flawless at first glance. However, the problem arises with the way the Evaluate
method receives the RuleEngineService
struct. The current method is defined with a value receiver, implying it makes a copy of the struct and all its associated properties, operating on them within the called function.
Due to the value receiver, different goroutines each work with a copy of our RuleEngineService
. Thus each goroutine has its own knowledgeLibraries
and mx
mutex, which is a complete violation of our initial goal to ensure a single mutex protecting concurrent access to the knowledgeLibraries
map.
The Path to Resolution
The solution follows almost immediately from our discovery of the problem — leverage a pointer receiver in the Evaluate
method:
func (r *RuleEngineService) Evaluate(ctx context.Context, rule RuleConfig, source DataSource) (interface{}, error) {
// Implementation ...
}
This revised method now operates on the address of the original RuleEngineService
struct directly. As a result, the mutex and the map belong to the same, single instance of RuleEngineService
, averting duplication and ensuring correct synchronization. With such a change, the global instance of RuleEngineService
is locked appropriately when calling Evaluate
, enforcing the correct order of read-write operations and thus preventing concurrent map write errors.
Another contributing factor towards our identified issue relates to the client code’s usage of RuleEngineService
The key is to ensure that the client code and, by extension, the rest of the application, interacts with the same RuleEngineService
instance and the mutex within it. We need to modify the rule_engine.New()
function such that it directly returns the original RuleEngineService
reference. Therefore, the changes in rule_engine
package could look like so:
type RuleEngine struct {
// Existing code...
}
// Note that New() now returns a pointer to RuleEngine
func New() *RuleEngine {
return &RuleEngine{
// Existing code...
}
}
Now, the original RuleEngineService
can be retrieved directly, avoiding the creation of a copy:
client := rule_engine.New()
c.Put(RuleClient, client) // No need for & before client now, as client is already a pointer
Now, the client code singularly operates on the original RuleEngineService
rather than dealing with copies. This fix means that the mutex object within RuleEngineService
, when it's referenced in the client code, remains exclusive and consequently upholds the synchronization, thus eradicating the concurrent map write issues.