Senior Site Reliability Engineer

Edtech.com's Summary

Ellucian is hiring a Senior Site Reliability Engineer to ensure the reliability, performance, and cost-efficiency of their production systems. The role focuses on leveraging DataDog for observability, managing DevOps practices, conducting incident management and root cause analysis, and optimizing costs across cloud infrastructure and services.

Highlights

Own and improve system reliability, availability, and performance for production environments
Design, implement, and manage monitoring, alerting, and observability using DataDog
Lead incident response efforts including troubleshooting, mitigation, and post-incident reviews
Perform detailed root cause analysis (RCA) and drive permanent resolutions
Partner with engineering and DevOps teams to build scalable, resilient infrastructure
Automate operational processes for efficiency and risk reduction
Analyze and optimize infrastructure and application costs
Define and manage SLIs/SLOs to meet reliability targets
Continuously improve deployment, monitoring, and operational practices
Required skills include 5+ years experience in Site Reliability Engineering or DevOps, strong hands-on expertise with DataDog, experience with cloud platforms (AWS, Azure, GCP), proficiency in CI/CD and Infrastructure as Code (Terraform), experience with Docker and Kubernetes, scripting ability (Python, Bash), and proven cloud cost optimization
Preferred qualifications include experience with cost management tools, cloud security and compliance, supporting high-availability customer-facing systems, and strong collaboration and communication skills
Benefits include comprehensive health coverage (medical, dental, vision), flexible time off, Thrive Flex Lifestyle Account, 401k with match & BrightPlan, parental leave, charitable days, telemedicine, wellness programs, diversity and inclusion initiatives, employee referral bonuses, and professional development opportunities

Senior Site Reliability Engineer Full Description

About Ellucian

Ellucian powers innovation for higher education, partnering with approximately 3,000 customers across 50 countries, serving more than 21 million students. Ellucian's AI-powered platform, trained on the richest dataset available in higher education, drives efficiency, personalized experiences, and strengthened engagement for all students, faculty and staff. Fueled by decades of experience with a singular focus on the unique needs of learning institutions, the Ellucian platform features best-in-class SaaS capabilities and delivers insights needed now and into the future. These solutions and services span the entire student lifecycle, including data-rich tools for student recruitment, enrollment, and retention to workforce analytics, fundraising, and alumni engagement. Ellucian's innovative solutions, vast ecosystem of partners and user community of more than 45,000 provides best practices leading to greater institutional success and achieving better student outcomes.

About the Opportunity

We are seeking a Senior Site Reliability Engineer (SRE) to ensure the reliability, performance, and cost-efficiency of our production systems. This role requires deep expertise in DataDog for observability and will focus on DevOps practices, incident management, root cause analysis, and cost optimization across cloud infrastructure and services.

Where You Will Make an Impact

Own and improve system reliability, availability, and performance for production environmentsDesign, implement, and manage monitoring, alerting, and observability using DataDog (required)Lead incident response efforts, including troubleshooting, mitigation, and post-incident reviewsPerform detailed root cause analysis (RCA) and drive permanent resolutionsPartner with engineering and DevOps teams to build scalable, resilient infrastructureAutomate operational processes to improve efficiency and reduce riskAnalyze and optimize infrastructure and application costsDefine and manage SLIs/SLOs to meet reliability targetsContinuously improve deployment, monitoring, and operational practices

What You Will Bring

5+ years of experience in Site Reliability Engineering, DevOps, or similar rolesMandatory: Strong, hands-on expertise with DataDog (APM, logs, metrics, dashboards, alerting)Experience with cloud platforms (AWS, Azure, or GCP)Proficiency in DevOps practices and tools (CI/CD, Infrastructure as Code such as Terraform)Strong troubleshooting skills and experience conducting root cause analysis in distributed systemsExperience with containers and orchestration (Docker, Kubernetes)Scripting or programming experience (Python, Bash, or similar)Proven ability to analyze and optimize cloud costs

Preferred Qualifications

Experience with cost management tools (e.g., AWS Cost Explorer, Azure Cost Management)Familiarity with cloud security and compliance best practicesExperience supporting high-availability, customer-facing systemsStrong collaboration and communication skills

What Success Looks Like

Improved system reliability and reduced incident frequencyFaster incident detection and resolution (MTTR)Effective, actionable observability driven by DataDogMeasurable cost savings and optimized infrastructure usage
What makes #Ellucianlife

Comprehensive health coverage: medical, dental, and visionFlexible time offThrive Flex Lifestyle Account (LSA) that allows you to contribute towards your health, financial or learning interests401k w/ match & BrightPlan - to help you save for the futureParental Leave5 charitable days to support the community that supports usTelemedicineWellness Headspace Care (mental health)Wellbeats (virtual fitness classes)RethinkCare & Wellthy- caregiver supportDiversity and inclusion programs which provide access to internal employee resource groupsEmployee referral bonuses to encourage the addition of great new people to the teamWe Foster a learning culture with: Education Assistance ProgramProfessional development opportunities

#LI-RB1
#LI-Remote

Original Job Description

About Ellucian

About the Opportunity

Where You Will Make an Impact

What You Will Bring

Preferred Qualifications

What Success Looks Like

#LI-RB1
#LI-Remote

Let me find your next job.

Thanks - you're signed up.

Senior Site Reliability Engineer

Edtech.com's Summary

Senior Site Reliability Engineer Full Description

VP, Software Engineering - Core Systems

Principal Software Engineer

Senior Software Engineer (AI-First, Full-Stack, AWS, Kubernetes)

Part-Time Evaluator, Software Engineering, Computer Science & AI

Cloud Database Engineer