Frontline Education logo

Frontline Education

Sr. Reliability and Resilience Engineer

🇺🇸 Remote - US

🕑 Full-Time

💰 $125K - $160K

💻 Data Science

🗓️ December 8th, 2025

MongoDB MySQL PostgreSQL

Edtech.com's Summary

Frontline Education is hiring a Sr. Reliability and Resilience Engineer (Data Management) responsible for reducing operational noise and improving system resilience by analyzing operational data and driving long-term corrective actions. This role bridges Data Management Operations and Engineering to enhance availability and stability of data platforms through proactive reliability initiatives.

Highlights
  • Analyze on-call alerts and operational tickets to identify systemic issues and opportunities for optimization.
  • Lead efforts to reduce alert fatigue and improve monitoring signal quality.
  • Collaborate with Operations and Engineering teams to translate problems into project work within an Agile Scrum-Ban framework.
  • Build dashboards and reports summarizing alert activity and incident trends.
  • Drive long-term preventative solutions by managing root cause analysis lifecycle and incident response.
  • Required technical expertise includes MS SQL Server, MySQL, PostgreSQL (AWS Aurora), MongoDB, Linux, PowerShell, Snowflake, and cloud technologies.
  • Bachelor’s degree in a technical field or equivalent experience, with 5-8+ years in reliability engineering, SRE, or related roles.
  • Strong skills in incident response, root cause analysis, monitoring best practices, and project management across complex initiatives.
  • Salary range: $125,000 to $160,000 plus performance-based incentives.
  • Supports AI-driven organizational goals with opportunities for continuous learning and career growth.

Sr. Reliability and Resilience Engineer Full Description

Sr Reliability and Resilience Engineer

Location: United States

Description

Sr. Data Management Reliability & Resilience Engineer  
Location: Remote, Hybrid to Wayne, PA; Hybrid to Naperville, IL
 
How You’ll Contribute to Our Mission
 
Frontline has a dynamic career growth opportunity for a Sr. Reliability & Resilience Engineer (Data Mgmt.)  to help reduce operational noise, strengthen system resilience, and improve the long-term stability of our diverse data environments. Acting as a strategic bridge between Data Management Operations and Data Management Engineering, this individual will focus on identifying recurring operational issues such as alert fatigue, systemic support requests, and systemic incidents and providing sustainable long-term solutions. 
 
This role is responsible for analyzing operational data, identifying systemic reliability risks, and driving long-term corrective actions using our Engineering agile processes. The Sr. Reliability & Resilience Engineer (Data Mgmt.) will play a key role in improving availability, reducing incident frequency, and continuously advancing the reliability of Frontline’s data platforms and services. 
 
How You’ll Drive Success 
  • Conduct deep analysis of on-call alerts to identify noise patterns and opportunities for optimization. 
  • Lead initiatives to reduce alert fatigue and improve monitoring signal quality. 
  • Analyze repeated incidents and operational tickets to uncover systemic problems. 
  • Partner with Operations to translate pain points into Engineering project work. 
  • Drive long-term corrective actions emerging from RCAs. 
  • Build dashboards and reports summarizing alert activity and incident trends. 
  • Identify opportunities to automate high-volume operational tasks. 
  • Collaborate with Engineering Services to design and implement preventative solutions. 
  • Work within an Agile Scrum-Ban framework to deliver resiliency initiatives. 
  • Contribute to cross-team incident and problem reviews. 
Critical Performance Objectives: 
Within the first 3 months of hire you will: 
  • Develop a deep understanding of Frontline’s Data Management workflows. 
  • Complete a baseline alert fatigue analysis. 
  • Complete a baseline data management team incident analysis.  
  • Build initial dashboards summarizing alert and incident trends. 
  • Identify repeat operational issues suitable for long-term remediation. 
  • Establish strong relationships with Operations and Engineering teams. 
After 3 months you should: 
  • Lead data-driven problem identification and improvement initiatives. 
  • Manage the Data Management Teams LTCA lifecycle from RCA to deployment. 
  • Participating in the Incident Management response for the data management team. 
  • Support incident reviews and propose stability-focused solutions. 
  • Participate fully in the Agile Scrum-Ban process. 
  • Recommend alert tuning, automation, and monitoring improvements. 
Long-Term objectives: 
  • Reduce alert fatigue by implementing sustained monitoring improvements. 
  • Drive down recurring incidents through long-term preventative engineering. 
  • Implement proactive reliability programs strengthening platform resilience. 
  • Maintain documentation for alert tuning and reliability improvements. 
  • Lead multi-pillar reliability initiatives across Frontline environments. 
  • Participate in on-call rotation.             

What You Bring to Help Us Grow
  • Strong understanding of database systems, cloud technologies, and operational monitoring. 
  • Deep knowledge of monitoring and alerting best practices. 
  • Experience with incident response and root cause analysis. 
  • Strong analytical and problem-solving skills with focus on preventative actions. 
  • Fast learner, detail oriented, decisive, capable in fast-paced environments. 
  • Self-directed with urgency, focus, and discipline. 
  • Excellent communication skills; able to influence cross-functional teams. 
  • Strong project management ability across multiple complex initiatives. 
  • While deep expertise in every technology listed is not expected, the individual must demonstrate strong expertise in several of them and be willing to develop a solid working knowledge across the broader technology stack over time: MS SQL Server, Microsoft Operating systems, MySQL, PostgreSQL (AWS Aurora), MongoDB (Atlas), Linux Operating systems, Active Directory, SSIS, OLTP data environments, Data Archiving, Snowflake, Wait Statistics, Solarwinds DPA/DPM, PowerShell, MySQL, DB2, Agile Methodologies, Amazon cloud database technologies, VMWare, Cloud Automation. 

Education and Experience: 
  • Bachelor’s degree in a technical discipline or equivalent experience. 
  • 5–8+ years of experience in operations, reliability engineering, SRE, database administration, or similar roles. 
  • Proven experience using operational data to drive long-term improvements. 
  • Experience supporting or designing highly available, reliable systems. 

Our Mission, Our People, Our Purpose
At Frontline Education, we’re reimagining what’s possible by becoming an AI-first organization, transforming how we think, work, and serve the educators who shape our schools every day. By using AI in thoughtful, practical ways, we’re creating tools that help educators save time, gain insights, and focus more on what matters most — their students.
 
As part of our team, you’ll be expected and empowered to build and apply AI skillsets that grow with you, because at Frontline Education, technology amplifies what matters most: the human drive to learn, improve, and make a difference.
 
How We Support Growth, Balance, and Well-Being
  • Personalized Time Off: Take time when it’s needed most — whether that’s a family vacation, a reset day, or simply time to rest and refocus.
  • Paid Sick Time: Separate, dedicated sick leave to care for yourself or loved ones.
  • Volunteer Time Off: Paid time to give back and support causes that matter to you.
  • Ten Paid Holidays: Enjoy meaningful moments and traditions throughout the year.
  • Our Philosophy: We believe time away from work helps you bring your best self to it.
Continuous Learning and Growth
  • World-Class Learning Access: Explore thousands of on-demand courses through platforms like LinkedIn Learning.
  • Leadership & Technical Skill Building: Develop new capabilities and chart your own professional path.
  • AI Empowerment: Use OpenAI tools to build fluency with emerging technology and harness AI as a creative partner for innovation and problem-solving.
  • Tuition Reimbursement: Invest in formal education to advance your skills and career.
  • Ongoing Learning Culture: Participate in company-led webinars on AI, inclusion, and industry trends—designed to inspire curiosity and continuous improvement.
Health, Happiness, and Purpose
  • Wellness Initiatives: Company-sponsored programs that support physical, mental, and emotional well-being.
  • Employee Assistance Program (EAP): Confidential support for you and your family’s needs.
  • Comprehensive Benefits: Health and financial benefits that support your happiness and future.
  • A Culture That Cares: At Frontline Education, we want every team member to learn, grow, and thrive—personally, professionally, and purposefully 

Compensation & Benefits
The salary range for this position is $125K-$160K and commensurate with your experience, skills, and internal equity. In addition to base salary, you will be eligible for performance-based incentives aligned to individual, team, and company results.
 
You’ll also have access to a comprehensive benefits package designed to support your well-being and future, including healthcare coverage, retirement savings with company match, employee stock purchase opportunities where applicable, and the time-off, wellness, and learning programs outlined above. Specific details will be shared during the interview process.
 
Inclusion, Belonging & Equal Opportunity
Frontline Education is an equal opportunity/affirmative action employer. We aspire to have an inclusive workplace and strongly encourage suitably qualified applicants from a wide range of backgrounds to apply and join our team.
 
Interview Process & Data Privacy
As part of our interview process, Frontline uses video conferencing tools that include photo capture and may include automated transcription features. A screenshot or photo will be taken at the start of the interview for internal identification and record-keeping purposes only, and transcription may be used to support notetaking and evaluation consistency. These materials are used solely by our recruiting and hiring teams, stored securely, and not shared outside the hiring process. Candidates may opt out of the transcription at any time by notifying their recruiter in advance. Frontline processes this information in accordance with applicable data privacy laws and only for legitimate business purposes related to recruitment and hiring.
 
Our Privacy Policy: Your privacy is important to us. Click here to read our general Privacy Statement and click here to read our Applicant Privacy Statement.