Site Reliability Engineer
Job details
Who are Tyk, and what do we do? The Tyk API Management platform is helping to drive the connected world and power new products and services. We’re changing the way that organisations connect any number of their systems and services. Whether internal, external, public or highly encrypted systems, Tyk helps businesses drive value across the retail, finance, telecoms, healthcare, or media industries (to name just a few!) If you’ve banked online, used an app to check the news, or perhaps even driven a connected car, API’s, and by extension, Tyk, make that possible. Founded in 2015 with offices in London – UK, London – Ontario, Atlanta and Singapore, we have many thousands of users of our B2B platform across the globe. Brands using Tyk range from Lotte, Bell, Dominos, Starbucks, to RBS and Societe Generale. We have a varied user base hailing from every continent – even Antarctica. Our Mission Tyk is on a mission to connect every system in the world. We’ve started by building an API Management platform. Total flexibility, default remote, radical responsibility We offer unlimited paid holidays and remote working from anywhere in the world, for everyone, Why? Tyk was founded on the principle of offering flexibility and autonomy to our employees, we believe this allows our employees to achieve their best results. It also means we can build the best possible team, location and working hours are no barrier. If this sounds like an environment that you believe could work for you then read on to find out more. The role: At Tyk, we’re obsessed with building software that solves problems. We count on our Site Reliability Engineers (SREs) to empower users with a rich feature set, high availability, and stellar performance level to pursue their missions. Our customer base is growing, so we’re seeking an experienced SRE to optimise, automate, and improve our performance, using insights from massive-scale data in real time. We want an original thinker, a challenger, a technical legend, an opinionated collaborator who wants to make things better. Here’s what you’ll be getting up to:
- Proactive Monitoring : Ensure our production Cloud environment operates within defined SLAs through vigilant monitoring and proactive issue resolution.
- Alerting and Monitoring : Collaborate with Senior SRE to identify opportunities for building proactive alerting and monitoring systems; implement solutions to enhance system reliability.
- Performance Metrics: Contribute to defining key performance metrics for Cloud services, enabling performance improvements and success measurement.
- Solutions Development : Propose and develop solutions to maintain and enhance key performance indicators (KPIs) across our Cloud infrastructure.
- Data Analysis: Gather and analyse metrics from operating systems and applications to optimise system performance and expedite fault resolution.
- Innovation : Drive innovation by optimising system and infrastructure performance, anticipating customer needs, and proactively addressing scaling demands.
- Scalability : Work closely with commercial functions to optimise our platform for scalability and meet growing customer demands.
- Cloud Infrastructure : Analyse and ensure the automation, scalability, and efficient management of our Cloud infrastructure.
- Automation : Execute automation for known cloud operations tasks and create new automation solutions to streamline processes.
- Software Development : Design, write, and deliver software and automation solutions to enhance the availability, scalability, latency, and efficiency of our PaaS services.
- Root Cause Analysis : Participate in blame-free root cause analysis meetings to promote learning and continuous system improvement in the event of production system incidents.
- Documentation : Create and contribute to policies and runbooks to ensure that operational processes are well-documented and consistently followed.
- On-call Support : Provide on-call support, ensuring our Cloud services follow a 24/7 model by promptly responding to alerts, meeting SLAs, and automating root cause analysis.
- Upgrades and Migrations : Plan and execute software upgrades, including Kubernetes versions. Manage and communicate migrations from Classic Cloud to the new Cloud platform.
- Strong collaboration skills
- Launching and operating production Kubernetes clusters
- Designing and operating infrastructure on AWS and other providers
- Operating MongoDB (or other document database) clusters
- Operating Redis (or other key-value storage) clusters
- Administering Linux servers
- Maintaining distributed software
- Operating Prometheus and Grafana
- Operating logging collection and analysis system
- Kubernetes & containers (proficient)
- Go and/or Python (advanced)
- AWS (proficient)
- Linux (proficient)
- Terraform and IaC in general (proficient)
- Helm (familiar)
- MongoDB (or similar)
- Redis (or similar)
- Monitoring & logging
- Grasp of networking concepts (subnets, routing, peering, load balancing, NAT, etc.)
- Common networking protocols (DNS, TCP/IP, TLS, UDP)
- Everyone has unlimited paid holidays.
- We have total flexibility in hours, as we believe creativity flows better when our people are given freedom to decide when they are most productive. Everyone is unique after all.
- Employee share scheme
- Generous maternity and paternity leave
- Volunteering Days
- Company retreats
- Employee Wellbeing platform
- It’s ok to screw up!
- The only stupid idea, is the untested one!
- Trust starts with you – make it count!
- Assume best intent!
- Make things better!
Apply safely
To stay safe in your job search, information on common scams and to get free expert advice, we recommend that you visit SAFERjobs, a non-profit, joint industry and law enforcement organization working to combat job scams.