Modeling and Unifying DevOps Data Part 2: Code

By Jeremy Hicks

How do you come to grips with all of the code engineers are committing, pushing, merging, and deploying within your organization? Have you started even looking at that data? If not, you’re missing out on a crucial source of productivity, security, and Software Development Life Cycle (SDLC) data. But how can you get a handle on all of that code-related activity? By leveraging a data model, all of the code related data produced from any number of code versioning systems like GitHub, GitLab, etc can be unified and streamlined within a source of data!

This post is part of a series of posts devoted to modeling DevOps data into a common set of mappings in pursuit of recommendations by the NSA for securing CI/CD systems regardless of tooling. As shown by the popularity of common information models for security, a data model for a particular domain can be highly effective in reducing complexity and increasing visibility into the murky waters of a “data lake.” Specifically, this post is about the commonalities in the “Code” section of the SDLC – so let's dive right in and take a look.

Commits, Code Pushing, and Contribution!

Collaborating on code within a software project requires writing code, but also requires some system to manage versioning. Tools like GitHub, GitLab, and many others serve this purpose all over the software industry. But while there is some variety in tools, code versioning systems generally contain a great deal of similarity. Because all of these tools rely on the git concepts of code commits, code pushes, and pull or merge requests the commonalities of those sorts of objects can provide the basis for a code data model.

Commonalities Between Code Versioning Tools:

Every object will have a SHA hash type identifier: commit hash, latest hash, earliest hash
Every code push and pull/merge request will contain a list of commits
Every object will have a list of files added, files removed, and/or files modified
Every object will have a timestamp and an associated user
Every object will have a commit branch, repository name, and repository organization or project
Every commit and pull/merge request object will have a message/description
Every pull/merge request will have reviewers, review status, and merge status
Every pull/merge request will have an identifier such as a PR number

These commonalities are true of GitHub, GitLab, BitBucket, or really any code versioning system built on top of git concepts. Data Models for this kind of data can use these commonalities as fields to map data from disparate sources into a more unified and usable data source. For example if GitHub calls their commit hash a `commit{}.id`, and GitLab calls it a `commits{}.id` both of those fields can be mapped simply to `commit_hash` within the data model. Similarly, because all pushes contain a list of commits included in a push, that list can be unified to a field name of `commits_list` regardless of if it is named differently in events from different vendors.

But why is this useful, you may ask? Would you want to know if someone, especially an unrecognized account, was pushing code directly to your main branches? That seems like a pretty important question to be able to answer right? Now, even if you’re using multiple vendors, you can easily find that sort of data using the unified data model fields!

^{Figure 1-1. A simple Splunk search shows you who’s pushing code to main branches. It looks like those service accounts are fine… But someone should check in on what that shiftyEyesJones account has been doing!}

This code level data model can even be tied together with the work planning data model (see Part 1 of this series). Code versioning tools can reference issue or ticket numbers in their commit messages and pull/merge requests fairly easily. Some tools like GitLab, GitHub, and BitBucket provide specific solutions for referencing planning objects. But even if your tool does not, that data can be inferred in other ways by explicitly referencing and parsing the work object’s number in a code commit or pull request description.

Having alignment between work planning and code data using data models provides all sorts of value in terms of knowing what is being worked on, when, and where. With just a quick search it’s easy to show a graph over time of how many pull or merge requests have been for planned work tickets. Digging even further can help establish how much planned vs unplanned work was done and how much bug fixing or feature development is happening!

^{Figure 1-2. Teams can easily reference work planning tickets (issueNumber) in their Pull/Merge Requests allowing even deeper insight into code. This chart shows how many work tickets/issues were referenced in PRs for each of the listed repositories.}

Data Models (for Code): Getting Answers!

By aligning the fields and data from your code versioning tools you’ll start to see immediate value. Questions of all sorts immediately bubble to mind around productivity, developer experience, security, and even customer interactions. A quick list of questions that can now be easily answered include:

Who is working on a given repository? Are they working in specific branches or on specific files?
What was the most recent pull or merge request addressing? Who worked on it? Did it have an associated ticket?
Where are our engineering resources most focused in the code? What repositories see the most activity?
When did we merge code for a specific issue that was raised? Who reviewed and approved that code?
Was our repo breached by an unexpected account? Were they able to merge changes into the main branch?
How long is it taking to go from first commit to merged code? Is that time improving?
How long is it taking for PRs to receive their first review?
Has our OSS repo seen many contributions from outside contributors? If so, is that number increasing?

These questions and many more can be answered quickly by leveraging the data model for code. The value is especially large when this code data model is paired with the work planning data model for referencing work planning across code and on into future parts of the data model like code release and pipeline runs (coming in future installments of this blog series).

As I’ve stated before, the linking of the various SDLC components (planning to code, test, and release in this case) is perhaps the greatest value of a data model for DevOps. In this blog we’ve demonstrated how linking common fields like “issueNumber” from work planning to “commit_list” from code pushes and PRs creates a steel thread that can unlock previously unexplored relationships in the data. That commit hash data even provides a way to connect code through the build, test, deployment, and monitoring phases of the Software Development Life Cycle. This sort of connection even works backwards as well as forwards. For example, by knowing what the latest commit hash is of a build running in production — by including it as a log or metric attribute for example — operations and development teams have a precise definition of what code is being monitored, what PR it came from, and who committed it. Security or DevSecOps teams can also make use of this information and know if the currently deployed code contains the commits which fixed an important vulnerability.

Next Steps

Want to hear more? Interested in the next steps of a DevOps data model that helps unify the data between Jenkins, GitHub, GitLab and other deployment pipeline tools? Excited about DevSecOps and how DevOps code, build, and pipeline data can enhance your security posture? Tune in next time to find out more about a data model for DevOps pipelines and how they can help enhance your business resiliency, security, and observability.

Interested in how Splunk does Data Models? You can sign up to start a free trial of Splunk Cloud Platform and start Splunking today!

This blog post was authored by Jeremy Hicks, Staff Observability Field Innovation Engineer at Splunk with special thanks to: Doug Erkkila, David Connett, Chad Tripod, and Todd DeCapua for collaborating, brainstorming, and helping to develop the concept of a DevOps data model.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.