Modeling and Unifying DevOps Data

By Jeremy Hicks

“How can we turn our DevOps data into useful DevSecOps data? There is so much of it! It can come from anywhere! It’s in all sorts of different formats!”

While these statements are all true, there are some similarities in different parts of the DevOps lifecycle that can be used to make sense of and unify all of that data. How can we bring order to this data chaos? The same way scientists study complex phenomena — by making a conceptual model of the data. Modeling common ideas from the various segments of the DevOps lifecycle from planning through release will increase observability of your DevOps processes and enable DevSecOps practices.

Being able to make use of all of our DevOps data is imperative to any work in the field of DevSecOps. The ability to understand and utilize data from any tool or vendor is so important it is called out in the first paragraph of the NSA Defending Continuous Integration/Continuous Delivery (CI/CD) Environments document. The focus of that document is on integrating “security best practices into typical software development and operations (DevOps) Continuous Integration/Continuous Delivery (CI/CD) environments, without regard for the specific tools being adapted” which is a prime concern for DevSecOps practices.

This series of blog posts on modeling DevOps data will help establish common concepts and links in various stages of the SDLC that can be leveraged in a DevOps Data Model. That data model can then be used to unify DevOps data as a whole and make it more straightforward to use with other more traditional security focused data models (e.g. the Splunk Common Information Model - CIM). So let’s start by breaking down some elements and commonalities in various stages of the DevOps lifecycle starting with Work Planning, then moving through Code, Build, and Release in follow-up blog posts.

Work Planning, Issues, Tickets, Bears, Oh My!

Planning work for software projects is usually concerned with Projects, Epics, Issues, Tickets, and so on. Regardless of the tool, Agile work planning platforms all use some common concepts due to the nature of Agile methodologies and their use in software organizations. Epics, Issues, Tasks, Tickets and the like are the objects which form the basis of most Agile workflows be they Sprint, Kanban, or somewhere in between. These objects allow us to determine some commonalities within the data regardless of that data’s source.

Commonalities between Agile Work Planning Tools:

Every object (“Idea”, “Issue”, “Ticket”, “Filing”, “Card” or whatever the object is called) has an identifier. This could be an “TicketId”, “IssueNumber”, etc
Every object will have a submission date and a submission user
Every object will have a current status E.G. Open, Closed, Assigned/Unassigned, etc (even if status is only completed or not)
Every object will have a date for when it was last updated and when used with current status will provide a date recording when the object was created, assigned, closed, etc
Every object will have a subject and often a description
Every object will have a url where it exists
Every object associated with a code change should contain a reference to the repository name those changes reside in

These commonalities are true of Jira, Github Projects, Trello, or what have you. Data Models for this kind of data can use these commonalities as fields to map data from disparate sources into a more unified and usable data source. For example if Jira calls the object identifier an “IssueID”, GitHub calls it an “IssueNumber”, and Trello calls it a “CardNumber” all three of those fields can be mapped simply to “issueNumber” within the data model. Now, searches for objects across each of those tools can be found using the same “issueNumber” field. Adding a unified repository identification field such as “repository_name” provides the flexibility to easily see all of the work objects for a given repository during a specified period of time.

^{Figure 1-1. A simple Splunk search can show the work done in the past year in various repos and using various tools. Additions to the search could include using a unified field like status_current to only see tickets that have been closed in the last year, or that are currently open.}

What’s more, when planning work in code repositories, adding the code repository name, branch, and other details to your planning objects becomes incredibly useful! Some tools provide specific solutions for referencing planning objects, others do not, but that data can also be inferred in other ways by referencing and parsing the work object’s number in a code commit or pull request description.

With that repository data referencing planning objects in code level events like commit messages and pull requests, a line can be traced from work to code, from inception to delivery, and from idea to reality. Even further, the commonalities of code data can be harnessed to trace the Software Development Life Cycle (SDLC) through build, and release. Repository name, issue ids, commit hashes, branches, merge events, and other code specific data provides the linkages required to know what code is being built, released, and even running in your software environment.

Data Models, What Are They Good For?

By creating these sorts of data model mappings for just the work planning data listed above, tracking the Who, What, Where, When, and Why of a piece of work all become more straight forward and totally tool agnostic. Regardless of if Team A uses Jira and Team B uses GitHub the data is now unified and usable! This is immediate value provided with just planning data and can answer questions like:

Who submitted an issue? Who was creating a fix assigned to?
What was the issue concerned with?
Where will a fix or change need to be implemented?
When did that assignment happen and when did the fix go in to close the ticket?
Was the issue or ticket resolved faster or slower than our average time to close a submitted issue?
How long does it take to get an issue assigned? Is that time improving?

These are all fairly straightforward questions to ask and answer once this sort of DevOps planning data is within your organization’s tool box.

Adding data from other stages of the DevOps tool chain can even start to paint you a more complete picture of how your software organization functions. By utilizing the commonalities in the data it becomes possible to trace and derive metrics from each stage of the software lifecycle including such details as time to review, time to test, and total time spent from idea to released code.

^{Figure 1-2. A unified model for thinking about DevOps opens the door to detailed analysis of the Software Development Lifecycle from work planning to released code across arbitrary tools or platforms.}

The linking of the various SDLC components (planning to code, test, and release in this case) is perhaps the greatest value of a data model for DevOps. But, using commonalities like “IssueNumber”, Repository Name, branch, commit hash, and merge hash as linking fields across the entire lifecycle of software development from work planning through to operation and monitoring provides a deeper understanding of what is going on in our DevOps processes and feeds directly into DevSecOps practices which require this sort of unified data to thrive and protect the organization’s software.

Next Steps

Want to hear more? Want to know what sort of data GitLab, GitHub, BitBucket and other code related tools provide and what that data has in common? Excited about DevSecOps and how DevOps code data meets Security? Tune in next blog post to learn more about the commonalities of code related DevOps data and how unifying that data can build on top of agile planning data to enhance your business resiliency, security, and observability.

Interested in how Splunk does Data Models? You can sign up to start a free trial of Splunk Cloud Platform and see it for yourself today!

This blog post was authored by Jeremy Hicks, Staff Observability Field Innovation Solutions Engineer at Splunk with special thanks to: Doug Erkkila, David Connett, Chad Tripod, and Todd DeCapua for collaborating, brainstorming, and helping to develop the concept of a DevOps data model.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.