Edtech.com's Summary

WGU is hiring a Senior Manager, Site Reliability Engineering to lead the SRE function ensuring critical systems and services remain reliable, scalable, and resilient. This role directs SRE teams in infrastructure design and operations, drives incident management, automation, and collaborates with engineering and product teams to enhance system reliability and performance.

Highlights

Lead and mentor SRE teams to foster ownership and continuous improvement.
Define and implement service reliability standards including SLOs, SLIs, and SLAs.
Manage incident response processes and conduct root cause analysis.
Drive automation with infrastructure as code and CI/CD pipelines.
Oversee monitoring and observability platforms such as Prometheus, Grafana, and Datadog.
Require expertise with cloud platforms (AWS, GCP, Azure) and container orchestration (Kubernetes).
Minimum 8+ years software engineering experience and 3+ years leading technical teams.
Bachelor's or Master's degree in Computer Science, Engineering, or equivalent experience.
Salary range between $170,400 and $281,200, with additional benefits including bonuses, insurance, retirement plan, and paid leave.
Collaborate across engineering, product, and security teams to embed reliability throughout the development lifecycle.

Senior Manager, Site Reliability Engineering Full Description

Senior Manager, Site Reliability Engineering

Salt Lake City Office

Full time

If you’re passionate about building a better future for individuals, communities, and our country—and you’re committed to working hard to play your part in building that future—consider WGU as the next step in your career.

Driven by a mission to expand access to higher education through online, competency-based degree programs, WGU is also committed to being a great place to work for a diverse workforce of student-focused professionals. The university has pioneered a new way to learn in the 21st century, one that has received praise from academic, industry, government, and media leaders. Whatever your role, working for WGU gives you a part to play in helping students graduate, creating a better tomorrow for themselves and their families.

The salary range for this position takes into account the wide range of factors that are considered in making compensation decisions including but not limited to skill sets; experience and training; licensure and certifications; and other business and organizational needs.

At WGU, it is not typical for an individual to be hired at or near the top of the range for their position, and compensation decisions are dependent on the facts and circumstances of each case. A reasonable estimate of the current range is:

Grade: Management Technical 715Pay Range: $ - $

Job Description

The Senior Manager of Site Reliability Engineering (SRE) leads the function responsible for ensuring that critical systems and services are reliable, scalable, and resilient. The role combines technical leadership with organizational management, directing SRE teams in designing, implementing, and operating infrastructure that supports business needs. This position defines service reliability standards, drives incident response practices, oversees automation initiatives, and partners with other engineering and product teams to balance reliability with delivery velocity. This position’s main objective is to improve reliability, performance, and operational efficiency to ensure our students and faculty are delighted with the fully online educational experience.

Primary Responsibilities

Leads and mentors SRE teams, creating an environment that encourages ownership, collaboration, and continuous improvement.
Establishes the SRE vision, goals, and operational strategies in alignment with organizational objectives.
Defines reliability roadmaps and communicate priorities to engineering and executive stakeholders.
Develops, drives, and supports Service Level Objectives (SLOs), Indicators (SLIs), and Agreements (SLAs) across systems.
Directs incident management processes, including response coordination, root cause analysis, and follow-up actions.
Implements practices that reduce downtime and ensure systems meet availability, scalability, and performance expectations.
Drives adoption of infrastructure as code, CI/CD pipelines, and automated testing to improve operational efficiency.
Oversees monitoring, alerting, and observability systems that provide insight into service health.
Evaluates and implements emerging tools that enhance service reliability and reduce manual toil.
Collects and evaluates system and application data to improve the performance and reliability of the environment proactively.
Partners with software engineering, security, and product teams to integrate reliability into all development lifecycle phases.
Provides senior leadership and other stakeholders with transparent reporting on reliability trends, risks, and improvement initiatives.
Fosters a culture of blameless postmortems and shared accountability for uptime and performance.
Promotes best practices for resilience, scalability, and disaster recovery.
Regularly assesses and improves reliability processes and team workflows.
Stays informed of evolving technologies and practices in SRE, DevOps, AI, Machine Learning, and cloud infrastructure.
Performs other related duties as assigned.

This job description includes a general representation of job requirements rather than a comprehensive inventory of all required responsibilities or work activities. The contents of this document or related job requirements may change at any time with or without notice.

Qualifications

Knowledge, Skills, and Abilities

Strong understanding of distributed systems, cloud-native architectures, and infrastructure design.
Deep familiarity with cloud service providers (AWS, GCP, Azure) and their reliability and security best practices.
Knowledge of software development lifecycles, DevOps principles, and SRE practices such as SLOs, SLIs, and error budgets.
Understanding of networking, storage, and systems performance concepts.
Knowledge of compliance, data security, and regulatory requirements relevant to system reliability and operations.

Skills

Technical proficiency in infrastructure as code, automation frameworks, and modern programming/scripting languages (Python, Go, Bash, etc.).
Expertise in monitoring, logging, and observability platforms (Prometheus, Grafana, Datadog, Splunk, etc.).
Skilled in incident management, root cause analysis, and postmortem processes.
Strong leadership and people management skills, with experience developing and scaling technical teams.
Effective communication skills, including the ability to explain technical concepts to both engineers and executives.
Strong problem-solving, prioritization, and decision-making skills under pressure.

Abilities

Ability to balance short-term operational needs with long-term reliability and scalability goals.
Ability to foster a culture of reliability, accountability, and continuous improvement within technical teams.
Ability to collaborate across engineering, product, and business teams to align reliability efforts with strategic goals.
Ability to anticipate system weaknesses and proactively design resilience into infrastructure and applications.
Ability to lead through influence, driving adoption of SRE practices across the organization.
Ability to adapt to evolving technologies, industry practices, and organizational needs.

Education

Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field, or equivalent professional experience.

Experience

8+ years of experience in Software Engineering/Development with some knowledge of SRE
3+ years of experience managing or leading technical teams, preferably in a reliability or infrastructure-focused capacity.
Proven track record of delivering reliable, scalable systems in complex environments.
Strong expertise with cloud platforms such as AWS, GCP, or Azure.
Hands-on experience with Kubernetes, container orchestration, and microservices architectures.
Proficiency with infrastructure as code and automation tools (Terraform, Ansible, Pulumi, etc.).
Solid programming or scripting ability in Python, Go, Java, JavaScript, and/or Bash.
Deep understanding of monitoring, logging, and observability systems (e.g., New Relic, Grafana, Datadog, Splunk, Dynatrace).
Experience implementing and managing SLOs, SLIs, and SLAs to measure and improve service reliability.
Leadership Qualifications
Demonstrated ability to build, mentor, and lead high-performing engineering teams.
Strong communication skills with the ability to engage technical teams and executive leadership.
Ability to balance immediate operational demands with long-term reliability strategy.
Experience fostering a blameless culture of incident management and continuous improvement.
Strategic mindset with the ability to align technical priorities to business goals.

*At WGU, it is not typical for an individual to be hired at or near the top of the range for their position, and compensation decisions are dependent on the facts and circumstances of each case. A reasonable estimate of the current range is:

*Pay Range: $170,400.00 - $281,200.00

Experience in lieu of education

An equivalent combination of training, experience, credentials, or accomplishments demonstrating the ability to perform the essential functions of this job may substitute for education degree requirements.

Position & Application Details
Full-Time Regular Positions (classified as regular and working 40 standard weekly hours): This is a full-time, regular position (classified for 40 standard weekly hours) that is eligible for bonuses; medical, dental, vision, telehealth and mental healthcare; health savings account and flexible spending account; basic and voluntary life insurance; disability coverage; accident, critical illness and hospital indemnity supplemental coverages; legal and identity theft coverage; retirement savings plan; wellbeing program; discounted WGU tuition; and flexible paid time off for rest and relaxation with no need for accrual, flexible paid sick time with no need for accrual, 11 paid holidays, and other paid leaves, including up to 12 weeks of parental leave.

How to Apply: If interested, an application will need to be submitted online. Internal WGU employees will need to apply through the internal job board in Workday.

Additional Information

Disclaimer: The job posting highlights the most critical responsibilities and requirements of the job. It’s not all-inclusive.

Accommodations: Applicants with disabilities who require assistance or accommodation during the application or interview process should contact our Talent Acquisition team at recruiting@wgu.edu.

Equal Employment Opportunity: All qualified applicants will receive consideration for employment without regard to any protected characteristic as required by law.

Original Job Description

Senior Manager, Site Reliability Engineering

Salt Lake City Office

Full time

Job Description

Primary Responsibilities

Leads and mentors SRE teams, creating an environment that encourages ownership, collaboration, and continuous improvement.
Establishes the SRE vision, goals, and operational strategies in alignment with organizational objectives.
Defines reliability roadmaps and communicate priorities to engineering and executive stakeholders.
Develops, drives, and supports Service Level Objectives (SLOs), Indicators (SLIs), and Agreements (SLAs) across systems.
Directs incident management processes, including response coordination, root cause analysis, and follow-up actions.
Implements practices that reduce downtime and ensure systems meet availability, scalability, and performance expectations.
Drives adoption of infrastructure as code, CI/CD pipelines, and automated testing to improve operational efficiency.
Oversees monitoring, alerting, and observability systems that provide insight into service health.
Evaluates and implements emerging tools that enhance service reliability and reduce manual toil.
Collects and evaluates system and application data to improve the performance and reliability of the environment proactively.
Partners with software engineering, security, and product teams to integrate reliability into all development lifecycle phases.
Provides senior leadership and other stakeholders with transparent reporting on reliability trends, risks, and improvement initiatives.
Fosters a culture of blameless postmortems and shared accountability for uptime and performance.
Promotes best practices for resilience, scalability, and disaster recovery.
Regularly assesses and improves reliability processes and team workflows.
Stays informed of evolving technologies and practices in SRE, DevOps, AI, Machine Learning, and cloud infrastructure.
Performs other related duties as assigned.

Qualifications

Knowledge, Skills, and Abilities

Strong understanding of distributed systems, cloud-native architectures, and infrastructure design.
Deep familiarity with cloud service providers (AWS, GCP, Azure) and their reliability and security best practices.
Knowledge of software development lifecycles, DevOps principles, and SRE practices such as SLOs, SLIs, and error budgets.
Understanding of networking, storage, and systems performance concepts.
Knowledge of compliance, data security, and regulatory requirements relevant to system reliability and operations.

Skills

Technical proficiency in infrastructure as code, automation frameworks, and modern programming/scripting languages (Python, Go, Bash, etc.).
Expertise in monitoring, logging, and observability platforms (Prometheus, Grafana, Datadog, Splunk, etc.).
Skilled in incident management, root cause analysis, and postmortem processes.
Strong leadership and people management skills, with experience developing and scaling technical teams.
Effective communication skills, including the ability to explain technical concepts to both engineers and executives.
Strong problem-solving, prioritization, and decision-making skills under pressure.

Abilities

Ability to balance short-term operational needs with long-term reliability and scalability goals.
Ability to foster a culture of reliability, accountability, and continuous improvement within technical teams.
Ability to collaborate across engineering, product, and business teams to align reliability efforts with strategic goals.
Ability to anticipate system weaknesses and proactively design resilience into infrastructure and applications.
Ability to lead through influence, driving adoption of SRE practices across the organization.
Ability to adapt to evolving technologies, industry practices, and organizational needs.

Education

Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field, or equivalent professional experience.

Experience

8+ years of experience in Software Engineering/Development with some knowledge of SRE
3+ years of experience managing or leading technical teams, preferably in a reliability or infrastructure-focused capacity.
Proven track record of delivering reliable, scalable systems in complex environments.
Strong expertise with cloud platforms such as AWS, GCP, or Azure.
Hands-on experience with Kubernetes, container orchestration, and microservices architectures.
Proficiency with infrastructure as code and automation tools (Terraform, Ansible, Pulumi, etc.).
Solid programming or scripting ability in Python, Go, Java, JavaScript, and/or Bash.
Deep understanding of monitoring, logging, and observability systems (e.g., New Relic, Grafana, Datadog, Splunk, Dynatrace).
Experience implementing and managing SLOs, SLIs, and SLAs to measure and improve service reliability.
Leadership Qualifications
Demonstrated ability to build, mentor, and lead high-performing engineering teams.
Strong communication skills with the ability to engage technical teams and executive leadership.
Ability to balance immediate operational demands with long-term reliability strategy.
Experience fostering a blameless culture of incident management and continuous improvement.
Strategic mindset with the ability to align technical priorities to business goals.

*Pay Range: $170,400.00 - $281,200.00

Experience in lieu of education

How to Apply: If interested, an application will need to be submitted online. Internal WGU employees will need to apply through the internal job board in Workday.

Additional Information

Disclaimer: The job posting highlights the most critical responsibilities and requirements of the job. It’s not all-inclusive.

Accommodations: Applicants with disabilities who require assistance or accommodation during the application or interview process should contact our Talent Acquisition team at recruiting@wgu.edu.

Equal Employment Opportunity: All qualified applicants will receive consideration for employment without regard to any protected characteristic as required by law.

Let me find your next job.

Thanks - you're signed up.

Senior Manager, Site Reliability Engineering

Edtech.com's Summary

Senior Manager, Site Reliability Engineering Full Description

Software Engineer (Level 5)

Software Engineer (Professional Services Team)

Software Engineer in Test

Senior Engineer

Staff Software Engineer