How do you come to grips with all of the code engineers are committing, pushing, merging, and deploying within your organization? Have you started even looking at that data? If not, you’re missing out on a crucial source of productivity, security, and Software Development Life Cycle (SDLC) data. But how can you get a handle on all of that code-related activity? By leveraging a data model, all of the code related data produced from any number of code versioning systems like GitHub, GitLab, etc can be unified and streamlined within a source of data!
This post is part of a series of posts devoted to modeling DevOps data into a common set of mappings in pursuit of recommendations by the NSA for securing CI/CD systems regardless of tooling. As shown by the popularity of common information models for security, a data model for a particular domain can be highly effective in reducing complexity and increasing visibility into the murky waters of a “data lake.” Specifically, this post is about the commonalities in the “Code” section of the SDLC – so let's dive right in and take a look.
Collaborating on code within a software project requires writing code, but also requires some system to manage versioning. Tools like GitHub, GitLab, and many others serve this purpose all over the software industry. But while there is some variety in tools, code versioning systems generally contain a great deal of similarity. Because all of these tools rely on the git concepts of code commits, code pushes, and pull or merge requests the commonalities of those sorts of objects can provide the basis for a code data model.
These commonalities are true of GitHub, GitLab, BitBucket, or really any code versioning system built on top of git concepts. Data Models for this kind of data can use these commonalities as fields to map data from disparate sources into a more unified and usable data source. For example if GitHub calls their commit hash a `commit{}.id`, and GitLab calls it a `commits{}.id` both of those fields can be mapped simply to `commit_hash` within the data model. Similarly, because all pushes contain a list of commits included in a push, that list can be unified to a field name of `commits_list` regardless of if it is named differently in events from different vendors.
But why is this useful, you may ask? Would you want to know if someone, especially an unrecognized account, was pushing code directly to your main branches? That seems like a pretty important question to be able to answer right? Now, even if you’re using multiple vendors, you can easily find that sort of data using the unified data model fields!
Figure 1-1. A simple Splunk search shows you who’s pushing code to main branches. It looks like those service accounts are fine… But someone should check in on what that shiftyEyesJones account has been doing!
This code level data model can even be tied together with the work planning data model (see Part 1 of this series). Code versioning tools can reference issue or ticket numbers in their commit messages and pull/merge requests fairly easily. Some tools like GitLab, GitHub, and BitBucket provide specific solutions for referencing planning objects. But even if your tool does not, that data can be inferred in other ways by explicitly referencing and parsing the work object’s number in a code commit or pull request description.
Having alignment between work planning and code data using data models provides all sorts of value in terms of knowing what is being worked on, when, and where. With just a quick search it’s easy to show a graph over time of how many pull or merge requests have been for planned work tickets. Digging even further can help establish how much planned vs unplanned work was done and how much bug fixing or feature development is happening!
Figure 1-2. Teams can easily reference work planning tickets (issueNumber) in their Pull/Merge Requests allowing even deeper insight into code. This chart shows how many work tickets/issues were referenced in PRs for each of the listed repositories.
By aligning the fields and data from your code versioning tools you’ll start to see immediate value. Questions of all sorts immediately bubble to mind around productivity, developer experience, security, and even customer interactions. A quick list of questions that can now be easily answered include:
These questions and many more can be answered quickly by leveraging the data model for code. The value is especially large when this code data model is paired with the work planning data model for referencing work planning across code and on into future parts of the data model like code release and pipeline runs (coming in future installments of this blog series).
As I’ve stated before, the linking of the various SDLC components (planning to code, test, and release in this case) is perhaps the greatest value of a data model for DevOps. In this blog we’ve demonstrated how linking common fields like “issueNumber” from work planning to “commit_list” from code pushes and PRs creates a steel thread that can unlock previously unexplored relationships in the data. That commit hash data even provides a way to connect code through the build, test, deployment, and monitoring phases of the Software Development Life Cycle. This sort of connection even works backwards as well as forwards. For example, by knowing what the latest commit hash is of a build running in production — by including it as a log or metric attribute for example — operations and development teams have a precise definition of what code is being monitored, what PR it came from, and who committed it. Security or DevSecOps teams can also make use of this information and know if the currently deployed code contains the commits which fixed an important vulnerability.
Want to hear more? Interested in the next steps of a DevOps data model that helps unify the data between Jenkins, GitHub, GitLab and other deployment pipeline tools? Excited about DevSecOps and how DevOps code, build, and pipeline data can enhance your security posture? Tune in next time to find out more about a data model for DevOps pipelines and how they can help enhance your business resiliency, security, and observability.
Interested in how Splunk does Data Models? You can sign up to start a free trial of Splunk Cloud Platform and start Splunking today!
This blog post was authored by Jeremy Hicks, Staff Observability Field Innovation Engineer at Splunk with special thanks to: Doug Erkkila, David Connett, Chad Tripod, and Todd DeCapua for collaborating, brainstorming, and helping to develop the concept of a DevOps data model.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.