Software Engineer/Site Reliability Engineer
Job details
We are a team to design, develop, maintain, and improve software for various ventures projects, i.e., projects that are adjacent to our core businesses and are bootstrapped fast with a lean team. You will be actively involved in the design of various components behind scalable applications, from frontend UI to backend infrastructure. What you'll be doing
- Ensure entire stack is healthy: hardware, software, application and network are operating at optimal performance
- Perform deep dives into both systemic and latent reliability issues; partnering with other software and DevOps engineers across the organization to design, implement and roll out fixes
- Perform and run blameless RCAs on incidents and outages aggressively looking for answers that will prevent the incident from ever happening again
- Continuously improve availability, reliability, and observability and reduce the burden of human toil with tooling and automation
- Define SLA/SLOs for different services partnering with product engineers
- Represent the SRE team in system design reviews and operational readiness exercises for new and existing services
- Experience coding in Ruby and/or Go
- Familiar with GitOps principles and tools (Github Actions, Docker, Kubernetes)
- Experience in designing, analyzing, and troubleshooting large-scale distributed systems
- Curiosity about finding root causes in incidents and outages
- Ability to develop alignment to cultivate relationships and driving impact
- Mindset in designing fault tolerance system architecture
- Comfort with being uncomfortable in ambiguous situations
- Involvement with incident management and response
- Desire to grow expertise, inform, and educate others
- Capable to pick up various technologies, a fast learner and have a get things done" mentality
- Humble to embrace better ideas from others, eager to make things better, open to challenges and possibilities
- Familiar with cloud platforms and micro-service based architecture (AWS is big plus)
- Familiar with monitoring tools (e.g. NewRelic, Datadog, and/or OpenTelemetry)
- Familiar with IaC tools (e.g. Terraform, Spacelift)
- Experience in designing resilient system architecture
- Experience in optimizing performance of large-scale production system
- Experience in promoting site reliability engineering practices
Apply safely
To stay safe in your job search, information on common scams and to get free expert advice, we recommend that you visit SAFERjobs, a non-profit, joint industry and law enforcement organization working to combat job scams.