Edtech.com's Summary

Chess.com is hiring an Engineering Lead, Systems Operations. This role involves leading and mentoring a systems operations team, defining strategic infrastructure initiatives, and driving the hybrid cloud migration roadmap while ensuring scalable, reliable, and secure system performance. The position includes responsibilities for capacity planning, automation, incident response, and cross-team collaboration to support millions of users globally.

Highlights

Lead and mentor a team of 5-8 system operations engineers, focusing on technical guidance and career development.
Define and execute multi-year SysOps strategy including multi-regional infrastructure supporting millions of concurrent sessions.
Manage the hybrid cloud migration integrating bare-metal datacenters with cloud services to optimize performance and cost.
Establish on-call policies and incident response procedures prioritizing team health and high availability SLAs.
Implement monitoring, observability, and alerting systems to proactively address performance issues.
Partner on infrastructure-as-code and CI/CD pipeline implementation to enhance deployment quality and speed.
Oversee capacity planning, load testing, resource allocation, and infrastructure budget optimization.
Champion security protocols and risk assessments ensuring compliance with industry standards.
Required skills include UNIX/Linux expertise, cloud platforms (GCP, AWS, Azure), Terraform or CloudFormation, configuration management tools (Ansible, Chef, Puppet), Kubernetes/Docker, monitoring tools (Datadog, Prometheus), and database performance knowledge.
Experience managing technical teams, strong communication skills, networking fundamentals, and a proven ability to build scalable, reliable systems are essential.

Engineering Lead, Systems Operations Full Description

Engineering Lead, Systems Operations
Engineering

Remote

About Us

Chess.com is one of the largest gaming sites in the world and the #1 platform for playing, learning, and enjoying chess.

We are a team of 600+ fully remote people in 60+ countries working hard to serve the global chess community. We are here to support 200M+ chess players worldwide with the best possible product, content, and tools to serve the community!

We are a tech company. A gaming company. A content company. And we do it all with passion and commitment to the game. Above all we prize our mission-driven, flat, life-celebrating, no-corporate culture, and we look forward to meeting you and learning more about what you can bring to the team.

About You

You are passionate about building and managing infrastructure. It brings you joy to learn new technologies and use them to help reach challenging product goals. You have solid experience deep diving into Linux internals, as well as the future-oriented skills of managing Cloud/Kubernetes ecosystems. You are humble with a sense of humor and eager to be a part of a like minded team of people. You have been working in or dreamed of working in the gaming industry and are ready to turn your talents towards chess!

What You’ll do

Lead and mentor a team of 5-8 system operations engineers, providing technical guidance, career development, and performance management while demonstrating adaptive leadership styles and fostering a teachable culture of continuous learning
Define and execute the multi-year SysOps strategy with clear prioritization of critical initiatives, including multi-regional infrastructure architecture capable of handling millions of concurrent sessions across global data centers
Own the hybrid cloud migration roadmap, partnering with leadership to integrate bare-metal datacenter resources with cloud services for optimal performance and cost efficiency, delivering value through time-to-market optimization
Establish on-call rotation policies and incident response procedures with strong focus on work-life balance, ensuring rapid resolution of critical system issues while maintaining team health and high availability SLAs
Drive the implementation of monitoring, observability, and alerting systems that reach the right people at the right time, proactively identifying and resolving performance bottlenecks before they impact users and preventing organizational surprises
Partner with engineering leadership to implement infrastructure-as-code practices and establish deployment pipelines that support continuous integration and delivery, emphasizing quality with high first-time-right rates and low rework
Oversee capacity planning, load testing, and resource allocation strategies across distributed computing environments, demonstrating excellent time management and execution velocity while managing infrastructure budget and cost optimization
Champion security protocols and risk assessment procedures for infrastructure components and data protection with unwavering integrity, ensuring compliance with industry standards and earning trust across the organization
Collaborate with product and engineering leaders to design scalable solutions for high-traffic applications, valuing others' time by simplifying cross-team workflows and ensuring clear presentation of technical concepts to varied audiences
Lead automation initiatives that deliver measurable value to both internal and external customers, reducing manual operational overhead and improving system reliability through scripting and configuration management
Build authentic relationships with cross-functional teams and stakeholders, ensuring transparent communication of system health and aligning SysOps priorities with business objectives through excellent listening and presentation skills
Recruit, retain, and develop top engineering talent by understanding individual motivations and aligning team goals with personal drivers, fostering an inclusive culture where growth mindset principles guide decision-making and risk-taking
Demonstrate focus on commitments by managing distractions effectively, maintaining a strong track record of successful execution, and accumulating wins that build credibility and trust across the organization

Required Qualifications

5+ years of experience in system operations, DevOps, or infrastructure engineering roles with demonstrated excellence in execution and velocity
2+ years of experience managing technical teams, including hiring, performance management, and career development with proven ability to identify and adapt leadership styles
Strong proficiency with UNIX/Linux operating systems and command-line administration
Deep experience with cloud platforms (GCP, AWS, or Azure) and infrastructure-as-code tools (Terraform, CloudFormation, or similar)
Hands-on experience with configuration management systems (Ansible, Chef, Puppet, or similar)
Solid understanding of networking fundamentals, protocols (TCP/IP, HTTP/HTTPS, DNS), and network troubleshooting
Experience with containerization and orchestration technologies (Docker, Kubernetes, or similar)
Proficiency with monitoring and observability tools (Datadog, Prometheus, Grafana, ELK stack, or similar)
Experience with relational and NoSQL databases, including performance optimization and scaling strategies
Excellent communication skills with proven ability to reach the right stakeholders, present complex technical concepts clearly, and listen effectively to understand diverse perspectives
Strong prioritization and time management skills, with ability to distinguish critical work from nice-to-have initiatives
Demonstrated integrity in decision-making, earning respect and trust from peers, direct reports, and senior leadership
Proven track record of building and scaling reliable systems and high-performing teams with high-quality outcomes and low maintenance costs
Growth mindset with ability to share ideals and risks positively, avoid fixed mindset behaviors, and remain teachable in all situations
Ability to understand what motivates individuals and teams, aligning work with intrinsic drivers to maximize engagement

Preferred Skills

Experience managing bare-metal server infrastructure and datacenter operations at scale
Strong background in server-side automation and scripting languages (Python, Go, Bash, or similar)
Experience designing high-availability architectures and disaster recovery strategies with focus on delivering customer value
Experience with game server infrastructure or real-time application hosting at scale
Knowledge of database administration and optimization for high-concurrency applications
Experience building and optimizing CI/CD pipelines and deployment automation that balance velocity with quality
Proven success with capacity planning, performance testing, and infrastructure cost optimization
Experience managing remote, distributed teams across multiple time zones while valuing team members' time and work-life balance
Track record of fostering inclusive team cultures, developing engineering talent, and mentoring others on leadership approaches
Demonstrated commitment to continuous learning with awareness of when to teach and when to learn from others
History of making technical decisions without compromising personal or organizational values
Ability to simplify complex infrastructure challenges and make them easier for other teams to understand and engage with
Continuous learning mindset with interest in emerging infrastructure technologies and willingness to share knowledge across the organization
Strong collaboration and communication skills working in a fully distributed team
Sense of ownership and responsibility

About the Opportunity

This is a full-time position
We are 100% remote (always have been, always will be!)

Original Job Description

Engineering Lead, Systems Operations
Engineering

Remote

About Us

Chess.com is one of the largest gaming sites in the world and the #1 platform for playing, learning, and enjoying chess.

About You

What You’ll do

Lead and mentor a team of 5-8 system operations engineers, providing technical guidance, career development, and performance management while demonstrating adaptive leadership styles and fostering a teachable culture of continuous learning
Define and execute the multi-year SysOps strategy with clear prioritization of critical initiatives, including multi-regional infrastructure architecture capable of handling millions of concurrent sessions across global data centers
Own the hybrid cloud migration roadmap, partnering with leadership to integrate bare-metal datacenter resources with cloud services for optimal performance and cost efficiency, delivering value through time-to-market optimization
Establish on-call rotation policies and incident response procedures with strong focus on work-life balance, ensuring rapid resolution of critical system issues while maintaining team health and high availability SLAs
Drive the implementation of monitoring, observability, and alerting systems that reach the right people at the right time, proactively identifying and resolving performance bottlenecks before they impact users and preventing organizational surprises
Partner with engineering leadership to implement infrastructure-as-code practices and establish deployment pipelines that support continuous integration and delivery, emphasizing quality with high first-time-right rates and low rework
Oversee capacity planning, load testing, and resource allocation strategies across distributed computing environments, demonstrating excellent time management and execution velocity while managing infrastructure budget and cost optimization
Champion security protocols and risk assessment procedures for infrastructure components and data protection with unwavering integrity, ensuring compliance with industry standards and earning trust across the organization
Collaborate with product and engineering leaders to design scalable solutions for high-traffic applications, valuing others' time by simplifying cross-team workflows and ensuring clear presentation of technical concepts to varied audiences
Lead automation initiatives that deliver measurable value to both internal and external customers, reducing manual operational overhead and improving system reliability through scripting and configuration management
Build authentic relationships with cross-functional teams and stakeholders, ensuring transparent communication of system health and aligning SysOps priorities with business objectives through excellent listening and presentation skills
Recruit, retain, and develop top engineering talent by understanding individual motivations and aligning team goals with personal drivers, fostering an inclusive culture where growth mindset principles guide decision-making and risk-taking
Demonstrate focus on commitments by managing distractions effectively, maintaining a strong track record of successful execution, and accumulating wins that build credibility and trust across the organization

Required Qualifications

5+ years of experience in system operations, DevOps, or infrastructure engineering roles with demonstrated excellence in execution and velocity
2+ years of experience managing technical teams, including hiring, performance management, and career development with proven ability to identify and adapt leadership styles
Strong proficiency with UNIX/Linux operating systems and command-line administration
Deep experience with cloud platforms (GCP, AWS, or Azure) and infrastructure-as-code tools (Terraform, CloudFormation, or similar)
Hands-on experience with configuration management systems (Ansible, Chef, Puppet, or similar)
Solid understanding of networking fundamentals, protocols (TCP/IP, HTTP/HTTPS, DNS), and network troubleshooting
Experience with containerization and orchestration technologies (Docker, Kubernetes, or similar)
Proficiency with monitoring and observability tools (Datadog, Prometheus, Grafana, ELK stack, or similar)
Experience with relational and NoSQL databases, including performance optimization and scaling strategies
Excellent communication skills with proven ability to reach the right stakeholders, present complex technical concepts clearly, and listen effectively to understand diverse perspectives
Strong prioritization and time management skills, with ability to distinguish critical work from nice-to-have initiatives
Demonstrated integrity in decision-making, earning respect and trust from peers, direct reports, and senior leadership
Proven track record of building and scaling reliable systems and high-performing teams with high-quality outcomes and low maintenance costs
Growth mindset with ability to share ideals and risks positively, avoid fixed mindset behaviors, and remain teachable in all situations
Ability to understand what motivates individuals and teams, aligning work with intrinsic drivers to maximize engagement

Preferred Skills

Experience managing bare-metal server infrastructure and datacenter operations at scale
Strong background in server-side automation and scripting languages (Python, Go, Bash, or similar)
Experience designing high-availability architectures and disaster recovery strategies with focus on delivering customer value
Experience with game server infrastructure or real-time application hosting at scale
Knowledge of database administration and optimization for high-concurrency applications
Experience building and optimizing CI/CD pipelines and deployment automation that balance velocity with quality
Proven success with capacity planning, performance testing, and infrastructure cost optimization
Experience managing remote, distributed teams across multiple time zones while valuing team members' time and work-life balance
Track record of fostering inclusive team cultures, developing engineering talent, and mentoring others on leadership approaches
Demonstrated commitment to continuous learning with awareness of when to teach and when to learn from others
History of making technical decisions without compromising personal or organizational values
Ability to simplify complex infrastructure challenges and make them easier for other teams to understand and engage with
Continuous learning mindset with interest in emerging infrastructure technologies and willingness to share knowledge across the organization
Strong collaboration and communication skills working in a fully distributed team
Sense of ownership and responsibility

About the Opportunity

This is a full-time position
We are 100% remote (always have been, always will be!)

Let me find your next job.

Thanks - you're signed up.

Engineering Lead, Systems Operations

Edtech.com's Summary

Engineering Lead, Systems Operations Full Description

Senior Application Developer

Salesforce Administrator

Lead IT Network Engineer

Business Systems Analyst People Technology

Executive Coordinator for Information Technology