Job details
About the role: We are seeking a skilled and passionate Engineer to join our team to build and operate a Public Sector runtime platform. As a Reliability Engineer, you will be responsible for designing, deploying, and managing Kubernetes-based infrastructure and solutions that power our platform, to ensure the stability, scalability, and performance of our runtime platform. Job Responsibilities:
- Develop automation and processes to enable and constantly improve the deployment and management of runtime at scale (either namespaces or Kubernetes clusters).
- Monitor and troubleshoot Kubernetes clusters, identifying and resolving performance bottlenecks, security vulnerabilities, and other operational issues.
- Stay updated with the latest Kubernetes developments, best practices, and industry trends, and recommend relevant improvements to our platform.
- Collaborate with development teams to containerize applications and deploy them on Kubernetes, ensuring best practices for scalability, availability, and performance.
- Develop automation and processes to enable and constantly improve the deployment and management of applications on the runtime platform.
- Participate in on-call rotations and respond to incidents in a timely manner, conducting post-incident reviews and implementing preventive measures.
- Monitor services to identify bottlenecks, forecast system behaviour and scale infrastructure as needed.
- Implement comprehensive monitoring solutions to provide real-time insights into application and infrastructure health
- Efficiently manage incidents and outages, minimizing MTTR
- Build automation around system health assessment and self-remediation
- Bachelor's degree or Diploma in Computer Science, Engineering, or a related field (or equivalent experience).Proven experience as a Reliability Engineer or similar role, with a strong background in containerization, orchestration, and cloud-native technologies.
- In-depth understanding of Kubernetes architecture, components, and operational best practices.
- Hands-on experience with containerization technologies like Kubernetes, especially AWS EKS, and Helm.
- Proficiency in scripting and automation using tools like Bash, Python, or similar.
- Solid understanding of networking, security, and storage concepts in Kubernetes.
- Ability to troubleshoot and resolve complex technical issues related to Kubernetes and containerized applications.
- Experience with integrating Kubernetes with AWS cloud technologies, such as Secrets Manager, Load Balancers, etc.
- Strong communication and collaboration skills, with the ability to work effectively in cross-functional teams.
- Experience with CI/CD tools (Jenkins, GitLab CI/CD, ArgoCD) and version control systems (Git).
- Experience in Error Budgets to balance reliability with the pace of innovation
- Familiarity with other cloud platforms (GCP, Azure), and infrastructure-as-code (Terraform) is advantageous
- Certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) are a plus
- Experience with observability and monitoring tools (Prometheus, Grafana, ELK Stack) is a plus
- Experience with pager app is a plus
- Experience with automate testing tools (testkube, ginkgo) is a plus
- Experience with implementing and maintaining Kubernetes operator using Go is a plus
- Experience with service mesh technologies is a plus
- Experience with Chaos Engineering is a plus
- Excellent problem-solving mindset and strong analytical abilities
- Clear and effective communication skills
- Adaptability and continuous learning mindset
Apply safely
To stay safe in your job search, information on common scams and to get free expert advice, we recommend that you visit SAFERjobs, a non-profit, joint industry and law enforcement organization working to combat job scams.