Lead Site Reliability Engineer

Edtech.com's Summary

McGraw Hill LLC. is hiring a Lead Site Reliability Engineer. The role involves leading a team to build and maintain reliable, high-capacity, and high-performing core infrastructure services, focusing on cloud engineering, observability, DevSecOps, and resiliency engineering to support millions of learners worldwide. The position emphasizes automation, cloud expertise, and team leadership in a DevOps environment.

Highlights

Lead a team to build and support scalable and reliable infrastructure services.
Collaborate with product development teams using a DevOps model to improve automation, reliability, and performance.
Design and maintain infrastructure as code using Terraform, CloudFormation, or AWS CDK.
Utilize extensive AWS cloud services (e.g., ECS, EKS, Lambda, VPC, S3, CloudWatch).
Experience with container orchestration, Kubernetes (EKS), Helm, and service mesh technologies (Istio/Linkerd).
Implement and manage CI/CD pipelines and GitOps tools like ArgoCD or FluxCD, GitHub Actions, or GitLab CI.
Lead observability efforts including telemetry with NewRelic, DataDog, and CloudWatch.
Participate in on-call rotations and lead incident triage when necessary.
Required skills include software engineering experience, infrastructure automation, container orchestration, troubleshooting diverse technologies, and strong communication.
Qualifications: Bachelor's degree in Computer Science or related field, or equivalent experience; proven expertise in large-scale cloud infrastructure and Zero Downtime deployments.

Lead Site Reliability Engineer Full Description

Overview

Lead Site Reliability Engineer

Impact the Moment

Are you looking for a job that makes a positive difference in your life and in the lives of learners and educators across the globe? McGraw-Hill Education believes in inspiring educators and unlocking the potential in every student.

We are hiring a Lead Site Reliability Engineer who will lead a team to build and support reliable, high capacity, and high-performing core infrastructure services in support of our mission to reimagine learning for millions of students and learners worldwide.

This position is full-time. If you love to build developer tools and automation, know AWS services inside out, have complex and distributed system experience, like engineering software solutions to solve cloud-related problems, and understand the human side of managing humans, then you will thrive in this position

About you:

You are a strong communicator, both verbally and written
You have proven experience determining the prioritization with by asking questions to drive toward clear acceptance criteria
You take ownership for the quality of your individual work but also take pride in what you deliver as a team; your focus is on a successful outcome for all.
You automate first and by default.
You have proven experience in building and managing large-scale systems and tools in AWS in repeatable and maintainable ways(Zero Downtime).
You enjoy writing production software in at least one typed programming language and have built infrastructure using Terraform or CloudFormation. You code like a Software Engineer, but you think like a Product Manager.

Our cloud stack includes:

Cloud: AWS (CloudFront, S3, EC2, ECS, SES, SQS, SNS, Load Balancing, VPC, Config, Systems Manager, Lambda, API Gateway, DB services and many more).
Perform OS hardening, patching, upgrades, and lifecycle management
Cloud (OCI cloud know how a plus. (Exacs, OCI Compute, Load Balancers, Networking, VCN, Object storage)
Infrastructure as Code: Terraform
Programming: Python, Golang, Bash , Ansible
Containers: AWS ECS, EKS, OKE
Kubernetes : Must have Kubernetes experience, EKS or managed their own Kubernetes clusters
Orchestration: Deep EKS expertise, Helm, Karpenter, CoreDNS, and Service Mesh (Istio/Linkerd).
Infrastructure Terraform, Terragrunt, or AWS CDK;
VPC networking and IAM (IRSA).CI/CD & GitOpsArgoCD or FluxCD; GitHub Actions or GitLab CI.
CI/CD & GitOps :ArgoCD or FluxCD; GitHub Actions or GitLab CI.
Security: Rapid7, WAF, OPA/Gatekeeper
Web: Apache httpd, Apache Tomcat, Angular
Config Management and provisioning: Ansible, Packer
Telemetry: NewRelic, CloudWatch, DataDog
DevSecOps: Artifactory, Jenkins, CircleCI, SonarQube, Jfrog X-Ray, Control Tower, GitHub Enterprise and more

Your contributions:

Cloud Engineering

Collaborate with product development teams in a DevOps model, designing, deploying, and managing automation tools to enhance predictability and accelerate time to market
Identify the highest-impact opportunities to optimize existing systems; ensuring "right-sized" solutions in consideration of technical and business constraints
Drive initiatives to enhance system reliability and performance
Ensure repeatability, traceability, and transparency of our infrastructure automation (infrastructure-as-code, monitoring-as-code)
Participate in continual learning of the AWS ecosystem, game day scenarios, and professional conferences
Actively monitor AWS costs, using optimization tools to maximize ROI while meeting Service Level Objectives.

Observability Engineering

Ownership of reliability, uptime, system security, cost, operations, capacity, resiliency, and performance-analysis thereof
Leads initiatives to improve the reliability and stability of applications and platforms using data-driven analytics to improve service levels
Ensure that the architecture and deployment models are adequately designed to meet SLA commitments
Serve as the primary point of contact during major incidents for your application, and demonstrate the ability to identify and resolve issues that trigger on-call alarms.
Maintain and enhance telemetry systems to improve visibility into application performance and business metrics, ensuring operational workloads are effectively managed
Develop, communicate, collaborate, and monitor standard processes to promote the long-term health and sustainability of operational development tasks

DevSecOps

Support healthy software development practices, including complying with agile software development methodology, building standards for code reviews, work packaging, and continuous delivery
Partner with CyberSecurity and develop plans and automation to respond to new risks and vulnerabilities

Resiliency Engineering

Collaborate with dev teams to identify failure points and blast radius of systems
Validate the effectiveness of monitoring and observability configurations
Coordinate failure injection testing
Observe and document steady state production levels, growth patterns
Plan and forecast for seasonal growth, communicate trend lines with leadership, enhance infrastructure scaling plans to accommodate 2x planned load
Coordinate improvements of existing software and infrastructure to meet resiliency goals

Mentor and nurture engineers across varying levels of experience; foster growth by setting high-reaching goals, and providing support to achieve them.

Ability to expand and collaborate across different levels and stakeholder groups.

Documents and shares knowledge within the organization via internal forums and communities of practice.

Good to have Kubernetes experience, EKS or managed their own Kubernetes clusters

Must have used terraform to create infrastructure within AWS. Must bring an automation-first mindset to the team.
On-call participation required. Person will lead triage bridges when necessary
Will be expected to monitoring customer experience, application metrics like golden signals/KPIs and infrastructure health.
Needs to work proactively across team boundaries on a daily basis.

Qualifications

Experience as a software engineer, with practical experience developing, debugging, and deploying enterprise applications
Experience with infrastructure automation technologies, preferably Terraform
Experience in container/container-fleet-orchestration technologies, preferably EKS or ECS
Versatility with troubleshooting diverse sets of hosting technologies: web server platforms, application platforms, operating systems, network components, virtualization technologies, storage, and database platforms.
Experience with continuous-deployment based software development lifecycles (e.g. CI/CD)
Experience with application caching strategies and high concurrency workloads
Strong communication, problem solving, root cause analysis and systems engineering skills
Ability to design and manage escalation response plans from monitoring, react, respond, remediate and retrospect in culturally aligned (proactive, customer focused, collaborative, data-driven) ways.
Demonstrated expertise building and managing highly scaled production infrastructure in the cloud
BS Degree in Computer Science (or related technical field and/or equivalent industry experience)

Your contribution to the team includes:

Lead cross-functional, CloudOps and Engineering teams supporting foundational infrastructure services involving Config Management, Continuous Integration, Continuous delivery, DevSecOps tools, AMI management(Linux/Windows), User management, Networking, and Security services delivering meaningful impacts to the business.
Lead initiatives involving system design and provisioning, reliability, observability and monitoring, self-service tool development, cost optimization, incident response and chaos engineering.
Ownership of reliability, telemetry, security, cost, operations, and performance analysis thereof.
Plan and implement future technology roadmap of core services
Help implement effective engineering practices and processes.
Mentor and nurture engineers across varying levels of experience; foster growth by setting high-reaching goals, providing support as needed to achieve them.
Identifying highest-impact opportunities to optimize existing systems; ensuring "right-sized" solutions in consideration of technical and business constraints

Why work for us?
At McGraw Hill, we believe in creating a workplace where employees feel valued, supported, and empowered to make a difference. As part of our team, you'll enjoy opportunities to work on cutting-edge technologies and impactful projects, a collaborative and inclusive work environment, access to professional development and growth opportunities, and competitive compensation and benefits packages.

McGraw Hill recruiters always use a "@mheducation.com or @careers.mheducation.com" email address and/or from our Applicant Tracking System, iCIMS. Any variation of this email domain should be considered suspicious. Additionally, McGraw Hill recruiters and authorized representatives will never request sensitive information in email.

50382

McGraw Hill uses an automated employment decision tool (AEDT) to assist in the screening process by recommending candidates with "like skills" based on resume and job data. To request an alternative screening process, please select "Opt-Out" when asked to "Consent to use of Automated Employment Decision Tools" during the application.

Original Job Description