Site reliability engineering continues to gain traction in software development and IT. SRE is at the crossroads of software development and IT operations. In Ben Treynor’s words, SRE is “what happens when you ask a software engineer to design an operations function.”
Site reliability engineering is a way for developers to actively build services and functions to improve the resilience of people, processes and technical systems. SRE lives somewhat in the shadows – contributing greatly to the team’s overall productivity and the reliability of the team’s applications and infrastructure. If constantly improving the efficiency and resilience of the software delivery lifecycle appeals to you, then you should look at working in SRE.
So, we’ve put together this SRE interview guide — perfect for both candidates and hiring managers — so you’ll be prepared for your next SRE interview.
A site reliability engineer is essentially the perfect mix of a software developer and a traditional IT operations organization.
SRE inherently feeds into a forward-thinking, efficient DevOps culture. By taking the time to identify reliability concerns and building a team dedicated to addressing them, you’ve already started to shift reliability and testing further left into the development lifecycle. Additionally, SRE helps feed IT concerns and information back into the development teams – leading to faster, more resilient software development.
SRE helps break the stereotype that developers don’t take accountability for the services they build. Along with DevOps methodologies, SRE helps bridge the gap between IT and developers. And, even if your team still believes in the “throw-it-over-the-wall” mentality between traditional IT and development, SRE teams can still retroactively add value to your systems. By running tests in production and continuously adding new functionality dedicated to resilience, SRE teams constantly find new ways to make people, processes and technology better.
The first question you need to ask yourself is, “Do I want to work as an SRE?” To answer that question, you need to know what you’re getting into. Even before you start interviewing for that next SRE role, you should understand the common responsibilities of a site reliability engineer, including these:
While every engineering and IT organization is built differently, there are a few common questions you can expect during an SRE interview. These questions and explanations will help you prepare when heading into an SRE interview.
The answer to this question will vary from team to team. Generally, this is an opportunity for you to highlight:
Some organizations will have dedicated DevOps teams where others will simply follow DevOps methodologies. You’ll appease the interviewer as long as you’re thoughtful about the way you’ve used SRE in the past and how you see it contributing to overall reliability and efficiency in IT and software development in the future.
(Read more in our DevOps vs. Platform Engineering vs. SRE comparison.)
Like most other job interviews, it’s important to show why you’re excited about the role. SRE isn’t always viewed as the most luxurious role, and many developers will shy away from it. So, it’s important to speak to why you’re excited about building services that improve system reliability and lead to greater customer and employee happiness.
Being part of an SRE team should excite you because you’ll be able to make a large impact that affects everyone from product managers to end users.
At first, this seems like a simple question — but beware: it’s a loaded one. The interviewer wants to determine your ability to analyze your deployment pipeline and make intelligent decisions for changing it. SRE teams are crucial for:
Being able to determine where your team can make the biggest improvements to resilience without drastically affecting employee productivity or process will show that you’re able to problem-solve at a high level.
This is an excellent technical question to determine how you’ve set up monitoring and alerting tools and how you’ve helped define the “healthy” state of a system in the past.
If you want to join an SRE team, you’ll need to understand how you can leverage both internal and external outputs to determine overall system health. Then, you should be able to translate that information into insights and action for IT and engineering teams.
This is a quick yet obvious question. Of course, the interviewer wants to know if you’re familiar with the languages and technical systems you’ll need to use in order to do your job.
Because of SRE’s involvement in so many aspects of the engineering organization and business, it’s important that you can identify human bottlenecks in productivity. With this question, the interviewer is trying to determine how you would go about solving issues between cross-functional teams. Most of the time, it’s as simple as finding ways to improve the communication and visibility across different departments – helping people find the information they need when they need it.
Being a steward for on-call efficiency and quality of life will likely be a core responsibility for any site reliability engineer. So, for any SRE interview, it’s likely you’ll need to show how you would go about setting up a humane on-call experience. What can you do to improve the on-call experience?
Make sure you address this question from the viewpoint that on-call isn’t simply about processes and tooling — but that people need to be a core focus when setting up your on-call rotations and alert rules.
Being an SRE can be one of the most fulfilling roles you’ll ever have on an engineering team. You should have the autonomy to make organizational changes and run experiments that lead to greater reliability in the system. And, many times, you’ll find yourself in a position where you can make the lives of customers and colleagues much better.
You can also expect to learn more in a number of IT and software development disciplines, improving your knowledge of the entire software delivery lifecycle and making you a better developer.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.