Chess.com logo

Chess.com

Senior SRE - Distributed Systems & Cloud Infrastructure

🇺🇸 Remote - US

🕑 Full-Time

💰 TBD

💻 Information Technology

🗓️ October 7th, 2025

Golang Kubernetes Terraform

Edtech.com's Summary

Chess.com is hiring a Senior SRE - Distributed Systems & Cloud Infrastructure. The role involves architecting and optimizing cloud-native infrastructure, enhancing system performance through deep code and architectural analysis, and collaborating with development teams to improve reliability and scalability for a large-scale, data-intensive environment.

Highlights
  • Lead design and optimization of Kubernetes-based cloud infrastructure using Terraform and GitOps tools like ArgoCD.
  • Deeply tune and optimize Golang and TypeScript codebases for high performance and reliability.
  • Participate in incident response and drive operational excellence to minimize downtime.
  • Manage scalable distributed systems handling extensive data volumes.
  • Required skills include 5+ years of experience with cloud-native distributed systems, Kubernetes, Terraform, GitOps practices, and Golang development.
  • Strong expertise in distributed systems design, failure modes, and performance optimization.
  • Experience working in globally distributed teams with excellent communication abilities.
  • Preferred experience with chess programming, including bit-level optimizations and C/C++.
  • Familiarity with observability tools and modern cloud workflows is advantageous.
  • This is a full-time remote position supporting Chess.com's mission to serve 200M+ chess players worldwide.

Senior SRE - Distributed Systems & Cloud Infrastructure Full Description

Senior SRE - Distributed Systems & Cloud Infrastructure

About Us
Chess.com is one of the largest gaming sites in the world and the #1 platform for playing, learning, and enjoying chess.

We are a team of 600+ fully remote people in 60+ countries working hard to serve the global chess community. We are here to support 200M+ chess players worldwide with the best possible product, content, and tools to serve the community!

We are a tech company. A gaming company. A content company. And we do it all with passion and commitment to the game. Above all we prize our mission-driven, flat, life-celebrating, no-corporate culture, and we look forward to meeting you and learning more about what you can bring to the team.

About You
  • You’re a passionate member of the Chess.com community, with an acute understanding of our users and their needs.
  • You have advanced expertise in distributed systems and several years of experience integrating and optimizing cloud-native services using Kubernetes, Golang, and TypeScript at scale.
  • You excel at deep-diving into both application code and core system internals to optimize performance and architect robust solutions.
  • You thrive in globally distributed teams, are humble, humorous, and take strong ownership of your work.
  • You’re enthusiastic about tackling the complexities of high-traffic, data-intensive environments and are eager to push the limits of infrastructure reliability and scalability for Chess.

What you'll do

Architect & Optimize Infrastructure:
  • Lead the design and optimization of cloud-native services using Kubernetes, Terraform, and GitOps tools like ArgoCD.
  • Develop high-performance integration patterns and manage scalable, distributed systems handling extensive data volumes.
Deep Performance Tuning:
  • Dive into Golang and TypeScript codebases to identify and resolve performance bottlenecks at scale.
  • Optimize infrastructure and application code to achieve aggressive performance and reliability targets, with a focus on chess programming at the bits level.
Collaboration & Best Practices:
  • Work closely with development teams to refine cloud service integration architectures and implement best practices.
  • Monitor and enhance system reliability and performance through effective collaboration and innovative solutions.
Incident Response & Operational Excellence:
  • Participate in incident response for critical infrastructure issues, ensuring rapid resolution and minimal downtime.
  • Drive improvements in infrastructure reliability, scalability, and operational efficiency.
Infrastructure & Automation:
  • Utilize Terraform and Kubernetes to manage and scale our cloud infrastructure, ensuring robust, automated deployment processes.

Required Skills

High-Scale Cloud Operations:
  • 5+ years of experience managing and scaling large-scale, cloud-native distributed systems.
  • Deep understanding of Kubernetes, Terraform, and GitOps practices.
  • Expert in observability practices and ability to support incident response / on call.
Advanced Development in Golang:
  • Extensive experience in high-performance service development with Golang
  • Proven ability to profile and optimize applications for high throughput and reliable operation.
Distributed Systems Expertise:
  • Strong knowledge of distributed systems design, failure modes, and robust architectural principles.
  • Experience with data modeling and indexing strategies to support efficient service operations.
Performance Optimization:
  • Demonstrated experience improving system reliability and performance through deep code-level and architectural analysis.
Communication & Collaboration:
  • Excellent written and verbal communication skills.
  • Experience working in globally distributed teams.

Preferred Skills

Chess Programming:
  • Experience in chess programming, including bit-level manipulations and optimizations.
  • C/C++ Experience
Observability & Cloud Practices:
  • Familiarity with modern observability tools and practices.
  • Hands-on experience with Kubernetes and cloud-native workflows.

About the Opportunity
  • This is a full-time opportunity
  • We are 100% remote (work from anywhere!)

You can learn more about us here: