Senior SRE - Distributed Systems & Cloud Infrastructure
About Us
Chess.com is one of the largest gaming sites in the world and the #1 platform for playing, learning, and enjoying chess.
We are a team of 600+ fully remote people in 60+ countries working hard to serve the global chess community. We are here to support 200M+ chess players worldwide with the best possible product, content, and tools to serve the community!
We are a tech company. A gaming company. A content company. And we do it all with passion and commitment to the game. Above all we prize our mission-driven, flat, life-celebrating, no-corporate culture, and we look forward to meeting you and learning more about what you can bring to the team.
About You
- You’re a passionate member of the Chess.com community, with an acute understanding of our users and their needs.
- You have advanced expertise in distributed systems and several years of experience integrating and optimizing cloud-native services using Kubernetes, Golang, and TypeScript at scale.
- You excel at deep-diving into both application code and core system internals to optimize performance and architect robust solutions.
- You thrive in globally distributed teams, are humble, humorous, and take strong ownership of your work.
- You’re enthusiastic about tackling the complexities of high-traffic, data-intensive environments and are eager to push the limits of infrastructure reliability and scalability for Chess.
What you'll do
Architect & Optimize Infrastructure:
- Lead the design and optimization of cloud-native services using Kubernetes, Terraform, and GitOps tools like ArgoCD.
- Develop high-performance integration patterns and manage scalable, distributed systems handling extensive data volumes.
Deep Performance Tuning:
- Dive into Golang and TypeScript codebases to identify and resolve performance bottlenecks at scale.
- Optimize infrastructure and application code to achieve aggressive performance and reliability targets, with a focus on chess programming at the bits level.
Collaboration & Best Practices:
- Work closely with development teams to refine cloud service integration architectures and implement best practices.
- Monitor and enhance system reliability and performance through effective collaboration and innovative solutions.
Incident Response & Operational Excellence:
- Participate in incident response for critical infrastructure issues, ensuring rapid resolution and minimal downtime.
- Drive improvements in infrastructure reliability, scalability, and operational efficiency.
Infrastructure & Automation:
- Utilize Terraform and Kubernetes to manage and scale our cloud infrastructure, ensuring robust, automated deployment processes.
Required Skills
High-Scale Cloud Operations:
- 5+ years of experience managing and scaling large-scale, cloud-native distributed systems.
- Deep understanding of Kubernetes, Terraform, and GitOps practices.
- Expert in observability practices and ability to support incident response / on call.
Advanced Development in Golang:
- Extensive experience in high-performance service development with Golang
- Proven ability to profile and optimize applications for high throughput and reliable operation.
Distributed Systems Expertise:
- Strong knowledge of distributed systems design, failure modes, and robust architectural principles.
- Experience with data modeling and indexing strategies to support efficient service operations.
Performance Optimization:
- Demonstrated experience improving system reliability and performance through deep code-level and architectural analysis.
Communication & Collaboration:
- Excellent written and verbal communication skills.
- Experience working in globally distributed teams.
Preferred Skills
Chess Programming:
- Experience in chess programming, including bit-level manipulations and optimizations.
- C/C++ Experience
Observability & Cloud Practices:
- Familiarity with modern observability tools and practices.
- Hands-on experience with Kubernetes and cloud-native workflows.
About the Opportunity
- This is a full-time opportunity
- We are 100% remote (work from anywhere!)
You can learn more about us here: