Architect - Site Reliability Engineering [T500-13216]
Job details
The Architect – Site Reliability Engineering provides technical leadership in support of Inspire’s initiatives in cloud computing with a focus on improving efficiency, reducing toil, and increasing uptime and availability of Inspire’s cloud platforms. This individual will collaborate with peers to shape cloud application and infrastructure design, mature production readiness reviews, enhance build/test/release automation, mature observability practices and approach, and enhance platform resiliency, scalability, and recovery capabilities. The successful candidate will be comfortable engaging a wide variety of technical partners and stakeholders, takes a data-driven and analytical approach to problem resolution and identifying areas of opportunity, is self-driven, and has a passion for continuous improvement. Primary Responsibilities and Essential Functions:
- Engage in and strengthen application and cloud services development lifecycle—from inception, design, deployment, operation, to refinement. Work closely with application and platform teams to ensure software releases are well designed, planned, implemented, released, and monitored.
- Design, motivate, guide, and support the creation of software, systems, and processes to increase product reliability and organizational efficiency while optimizing resource use and cloud spend.
- Champion and support reliability practices across the software development lifecycle through activities like architecture reviews, code reviews, creating platforms and frameworks, and capacity planning.
- Work with senior engineering and testing team members to build tools and recommend testing strategies for problem prevention, detection, and chaos testing.
- Mature SRE practices through activities such as establishing error budgets, providing guidance and refinement to SRE dashboards, and enhancing capabilities to proactively detect anomalies.
- Provide design guidance and recommendations for platform improvements based on production incident analysis and root cause investigation outputs.
- Improve service reliability through blameless post-incident reviews and use of code, automation, or AI to respond to or prevent future problem recurrence.
- Recognize automation opportunities, provide design, and support implementation / development of tools to automate routine, time-consuming, or manual jobs and processes.
- Periodically assess current SRE practices and tools and provide recommendations for enhancements and improvement
- Train, guide, and mentor teammates on SRE practices and principles
- Design and execute strategies that ensure the scalability and the elasticity of the infrastructure.
- Code-level debugging on issues escalated to the team.
- Minimum 8 years of experience as platform architect with advanced knowledge in the following key areas: containers, deployment architecture, benchmarking, design, and network engineering.
- Minimum 4 years of combined experience serving in either a DevOps, SRE, Systems, and/or software development role.
- Hands-on experience in establishing and maturing SRE practices, program, and roadmap
- Extensive experience with public cloud technologies and cloud-native architectures and solutions. (Azure highly preferred)
- Experience with Infrastructure-as-Code (IAC), DevOps, and CI/CD practices and tool chains (Terraform, Gitlab, ArgoCD, Jenkins)
- Experience with configuration management tools (Ansible, Chef, and Packer)
- Experience with container technology and orchestration (Kubernetes, Docker)
- Experience with Observability and Monitoring practices and tools (OpenTelemetry, New Relic, OpsRamp, Prometheus, Grafana, Elastic Stack, Splunk, DynaTrace)
- Deep understanding of microservice architectures, application servers, network, and databases
- Excellent understanding of scalability processes and techniques
- Hands-on experience designing and administering high availability and high-performance environments, as well as managing large-scale deployments of traffic-heavy applications.
- Ability to understand and support multiple, complex systems and not shy away from the challenge of improving them.
- Comfortable with technical refactoring and creating technical designs to accommodate architectural evolution over time.
- The willingness to try new technologies and make them harmonize with existing systems to achieve better operations overall.
- Excellent communication and collaboration skills.
Apply safely
To stay safe in your job search, information on common scams and to get free expert advice, we recommend that you visit SAFERjobs, a non-profit, joint industry and law enforcement organization working to combat job scams.