As a follow up to my post on writing the Splunk metric proxy in Go, this blog post will look at how we go about optimizing Go code through a real life example with our production ingest service. This is a generalizable pattern that we follow to optimize various parts of our codebase.
Splunk Infrastructure Monitoring is an advanced monitoring platform built for modern applications, built on streaming analytics and alerting for time series data such as application metrics (“How many requests did my application get?”) and system level metrics (“How much network traffic is my Linux machine using?”). The data we consume from our users is both high-volume and high-resolution, going through our ingest service before hitting other parts of Splunk.
Image credit: From the Go docs
What is pprof?
Pprof is the standard way to profile go applications that comes built in to go. The pprof package can expose via HTTP both memory and CPU usage across your application, as well as the running command line and contention information.
How to enable pprof:
You can enable pprof by adding the line import _ “net/http/pprof” into your application and starting the default http server. Another option, which is what we do at Splunk in order to explicitly control where the handler is attached, is to enable specific pprof endpoints explicitly in your code with something like the following.
import "github.com/gorilla/mux" import "net/http/pprof" var handler *mux.Router // ... handler.PathPrefix("/debug/pprof/profile").HandlerFunc(pprof.Profile) handler.PathPrefix("/debug/pprof/heap").HandlerFunc(pprof.Heap)
What to execute:
curl http://ingest58:6060/debug/pprof/profile > /tmp/ingest.profile go tool pprof ingest /tmp/ingest.profile (pprof) top7
What this does:
Go includes a local pprof tool to visualize the output of pprof. The endpoint /debug/pprof/profile will collect a 30 second profile. What I do above is first save that output to a local file, then run pprof on the saved output. Note that the last parameter of pprof can be a URL directly, rather than a local file. The command top7 will show the 7 functions that consume the most CPU.
Result:
12910ms of 24020ms total (53.75%) Dropped 481 nodes (cum <= 120.10ms) Showing top 30 nodes out of 275 (cum >= 160ms) flat flat% sum% cum cum% 1110ms 4.62% 4.62% 2360ms 9.83% runtime.mallocgc 940ms 3.91% 8.53% 1450ms 6.04% runtime.scanobject 830ms 3.46% 11.99% 830ms 3.46% runtime.futex 800ms 3.33% 15.32% 800ms 3.33% runtime.mSpan_Sweep.func1 750ms 3.12% 18.44% 750ms 3.12% runtime.cmpbody 720ms 3.00% 21.44% 720ms 3.00% runtime.xchg 580ms 2.41% 23.86% 580ms 2.41% runtime._ExternalCode
What this means:
None of these functions are directly my own code. However, mallocgc, scanobject, and mSpan_Sweep all point to CPU usage during garbage collection. I could go deeper into each of these functions, but rather than optimize Go’s garbage collector itself, it’s a better use of time to instead optimize my use of Go’s garbage collector. In this case, creation of objects on the heap.
What to execute:
curl http://ingest58:6060/debug/pprof/heap > /tmp/heap.profile go tool pprof -alloc_objects /tmp/ingest /tmp/heap.profile (pprof) top3
What this does:
Note that the endpoint I download is a similar URL to the profiler above, but instead ends in /heap. This will give me a summary of heap usage on the machine. I again save it to a file to later compare. The parameter -alloc_objects will visualize the number of allocated objects across the lifetime of the application.
Result:
4964437929 of 7534904879 total (65.89%) Dropped 541 nodes (cum <= 37674524) Showing top 10 nodes out of 133 (cum >= 321426216) flat flat% sum% cum cum% 853721355 11.33% 11.33% 859078341 11.40% github.com/signalfuse/sfxgo/ingest/tsidcache/tsiddiskcache.(*DiskKey).EncodeOld 702927011 9.33% 20.66% 702927011 9.33% reflect.unsafe_New 624715067 8.29% 28.95% 624715067 8.29% github.com/signalfuse/sfxgo/ingest/bus/rawbus.(*Partitioner).Partition
What this means:
This means 11.33% of my object allocations happened during the function EncodeOld of the object DiskKey, which was expected behavior. However, it was very unexpected that Partition() would take 8.29% of my total allocations since it should just be doing basic math, so I will focus my search there.
What to execute:
(pprof) list Partitioner.*Partition
What this does:
This prints the source lines of the function I’m interested in, along with counts inside the function of which statements caused heap allocations. This is one of many commands you can execute inside pprof. Another very useful one is peek to see callers and callees of functions. You can get a full list by typing “help” and experimenting with what you see.
Result:
Total: 11323262665 ROUTINE ======================== github.com/signalfuse/sfxgo/ingest/bus/rawbus.(*Partitioner).Partition in /opt/jenkins/workspace/ingest/gopath/src/github.com/signalfuse/sfxgo/ingest/bus/rawbus/partitioner.go 927405893 927405893 (flat, cum) 8.19% of Total . . 64: if ringSize == 0 { . . 65: return 0, ErrUnsetRingSize . . 66: } . . 67: var b [8]byte . . 68: binary.LittleEndian.PutUint64(b[:], uint64(message.Key.(*partitionPickingKey).tsid)) 239971917 239971917 69: logherd.Debug2(log, "key", message.Key, "numP", numPartitions, "Partitioning") . . 70: murmHash := murmur3.Sum32(b[:]) . . 71: . . 72: // 34026 => 66 . . 73: setBits := uint(16) . . 74: setSize := uint32(1 << setBits) . . 75: shortHash := murmHash & (setSize - 1) . . 76: smallIndex := int32(shortHash) * int32(k.ringSize) / int32(setSize) 687433976 687433976 77: logherd.Debug3(log, "smallIndex", smallIndex, "murmHash", murmHash, "shortHash", shortHash, "Sending to partition") . . 78: return smallIndex, nil . . 79:} . . 80:
What this means:
This shows calling debug logging is causing variables to escape from the stack to the heap. Since this debug logging isn’t strictly needed, I could just remove those lines. But first, let’s confirm this hypothesis. The function “logherd.Debug2” is a wrapper that looks like the following and exists to remove the heap overhead of logrus WithField objects in the event debug isn’t even enabled.
// Debug2 to logger 2 key/value pairs and message. Intended to save the mem alloc that WithField creates func Debug2(l *logrus.Logger, key string, val interface{}, key2 string, val2 interface{}, msg string) { if l.Level >= logrus.DebugLevel { l.WithField(key, val).WithField(key2, val2).Debug(msg) } }
It appears from this pprof inspection that passing the integers into Debug2 is causing allocations, but let’s confirm this hypothesis.
What to execute:
go build -gcflags='-m' . 2>&1 | grep partitioner.go
What this does:
The -m parameter to build will print to stderr optimization decisions made by the compiler. This includes if the compiler is able to allocate a variable on the stack or if it must move it to the heap. Go will move variables to the heap if the compiler can’t guarantee that a pointer to the variable won’t live somewhere else.
Result:
./partitioner.go:63: &k.ringSize escapes to heap ./partitioner.go:62: leaking param: k ./partitioner.go:70: message.Key escapes to heap ./partitioner.go:62: leaking param content: message ./partitioner.go:70: numPartitions escapes to heap ./partitioner.go:77: smallIndex escapes to heap ./partitioner.go:77: murmHash escapes to heap ./partitioner.go:77: shortHash escapes to heap ./partitioner.go:68: (*Partitioner).Partition b does not escape ./partitioner.go:71: (*Partitioner).Partition b does not escape
What this means:
The important parts here are on line 77 showing that smallIndex, murmHash, and shortHash all escape to the heap. Because the compiler is doing short lived heap allocations, I’m creating more objects on the heap than I really need.
What to write:
func BenchmarkPartition(b *testing.B) { r := rand.New(rand.NewSource(0)) k := partitionPickingKey{} msg := sarama.ProducerMessage { Key: &k, } p := Partitioner{ ringSize: 1024, ringName: "quantizer.ring", } num_partitions := int32(1024) for i := 0; i < b.N; i++ { k.tsid = r.Int63() part, err := p.Partition(&msg, num_partitions) if err != nil { panic("Error benchmarking") } if part < 0 || part >= num_partitions { panic("Bench failure") } } }
What this does:
This benchmark simply creates B.N objects and does a basic check on the return value to make sure it’s not simply optimized away. Writing benchmarks are a recommended “before” step when programmers go on optimization hunts, to ensure you’re progressing in the correct direction.
What to execute:
go test -v -bench . -run=_NONE_ -benchmem BenchmarkPartition
What this does:
The bench argument to test runs benchmarks that match the regex “.” and -benchmem will also have benchmarks track heap usage on average per iteration. By passing _NONE_ to -run, I am saving myself a bit of time by telling test to only run unit tests with the “_NONE_” string inside them. In other words, run none of the unit tests but all of the benchmarks.
Result:
PASS BenchmarkPartition-8 10000000 202 ns/op 64 B/op 4 allocs/op
What this means:
Each loop of my benchmark took on average 202ns and, more importantly, allocated 4 objects per operation.
What to write:
@@ -66,7 +65,6 @@ func (k *Partitioner) Partition(message *sarama.ProducerMessage, numPartitions i } var b [8]byte binary.LittleEndian.PutUint64(b[:], uint64(message.Key.(*partitionPickingKey).tsid)) - logherd.Debug2(log, "key", message.Key, "numP", numPartitions, "Partitioning") murmHash := murmur3.Sum32(b[:]) // 34026 => 66 @@ -74,7 +72,6 @@ func (k *Partitioner) Partition(message *sarama.ProducerMessage, numPartitions i setSize := uint32(1 << setBits) shortHash := murmHash & (setSize - 1) smallIndex := int32(shortHash) * int32(k.ringSize) / int32(setSize) - logherd.Debug3(log, "smallIndex", smallIndex, "murmHash", murmHash, "shortHash", shortHash, "Sending to partition") return smallIndex, nil }
What this does:
My fix here was to remove the log lines. I had added them during testing/debugging and they never got removed. In this case, just removing the log lines is a simple enough solution.
What to execute:
go build -gcflags='-m' . 2>&1 | grep partitioner.go
Result:
./partitioner.go:62: &k.ringSize escapes to heap ./partitioner.go:61: leaking param: k ./partitioner.go:61: (*Partitioner).Partition message does not escape ./partitioner.go:67: (*Partitioner).Partition b does not escape ./partitioner.go:68: (*Partitioner).Partition b does not escape
What this means:
Notice that smallIndex, murmHash, and shortHash no longer give messages about escaping to the heap.
What to execute:
go test -v -bench . -run=_NONE_ -benchmem BenchmarkPartition
Result:
PASS BenchmarkPartition-8 30000000 40.5 ns/op 0 B/op 0 allocs/op ok github.com/signalfuse/sfxgo/ingest/bus/rawbus 1.267s
What this means:
Notice that now each operation takes only 40ns, but more importantly, there are no allocations per operation. Since it was heap usage I was trying to optimize, this is the important part for me.
Pprof is a very useful tool to dissect performance issuing with go code. By combining this exploration with go’s built in benchmarks, you can get real numbers about how much code changes help or hurt. Unfortunately, performance regressions tend to creep up over time. This makes a next step, and exercise for the reader, to store benchmark results in a database so you can view performance per commit over time.
(Jack presented a talk on this topic at the GoSF meetup. Slides embedded below.)
Thanks,
Jack Lindamood
Get visibility into your entire stack today with a free 14-day trial of Splunk Infrastructure Monitoring.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.