Cambium Learning Group logo

Cambium Learning Group

Senior Manager, Site Reliability

🇺🇸 Remote - US

🕑 Full-Time

💰 TBD

💻 Information Technology

🗓️ April 16th, 2024

AWS Infrastructure SaaS
Overview:

As the Manager of Site Reliability, you will play a crucial role in ensuring the stability, performance, and security of our SaaS applications. You will lead a team of skilled professionals responsible for maintaining and enhancing the reliability of our systems through robust observability, monitoring, threat detection, and mitigation strategies. The ideal candidate will bring extensive experience in managing complex SaaS environments and a deep understanding of best practices in site reliability engineering.

Job Responsibilities

Team Leadership:

  • Lead and mentor a team of site reliability engineers to ensure a high level of expertise and efficiency.
  • Drive initiatives to enhance the technical skills and efficiency of the team.
  • Foster a culture of collaboration, innovation, and continuous improvement.

Hands-On Technical Leadership:

  • Actively contribute to the design, implementation, and maintenance of observability, monitoring, and security systems.
  • Lead by example, working hands-on to troubleshoot issues and optimize system performance.

Observability and Monitoring:

  • Develop and implement comprehensive observability and monitoring strategies to proactively identify and address potential issues before they impact system performance.
  • Collaborate with development leadership to improve performance and scalability of services developed by providing relevant and actionable metrics in early stages of development.
  • Utilize industry-leading tools and practices to maintain visibility into the health and performance of our systems.

Threat Detection and Mitigation:

  • Design and implement robust security measures to detect and mitigate potential threats to our SaaS infrastructure.
  • Stay informed about the latest cybersecurity threats and trends, and implement proactive measures to safeguard our systems.

Incident Response:

  • Actively participate in incident response activities, leading the team to quickly resolve and learn from incidents.
  • Develop and maintain incident response plans to ensure a rapid and effective response to any service interruptions or security incidents.
  • Conduct post-incident analyses to identify root causes and implement preventive measures.

Infrastructure Optimization:

  • Collaborate with cross-functional teams to optimize the performance and scalability of our infrastructure.
  • Implement automation and efficiency improvements to enhance overall system reliability.

Job Requirements

  • Bachelor's degree in Computer Science, Information Technology, or a related field.
  • Proven hands-on experience (5+ years) in a site reliability engineering or similar role.
  • Leadership experience (3+ years) with a focus on technical mentorship and skill development.
  • In-depth knowledge of observability tools, monitoring systems, and security best practices.
  • Proven leadership and team management skills.
  • Excellent problem-solving and communication abilities.
  • In-depth experience with AWS.

To learn more about our organization and the exciting work we do, visit https://www.lexialearning.com/

An Equal Opportunity Employer

We are dedicated to fostering a culture that celebrates unique backgrounds, ideas, and experiences. All qualified applicants will receive consideration for employment without discrimination on the basis of race, color, age, religion, sex, gender, gender identity/expression, sexual orientation, national origin, protected veteran status, or disability.