Architect- Site Reliability Engineering

Full time at a Laimoon Verified Company in India

Posted on April 22, 2024

Job details

The Architect Site Reliability Engineering provides technical leadership in support of Inspires initiatives in cloud computing with a focus on improving efficiency, reducing toil, and increasing uptime and availability of Inspires cloud platforms. This individual will collaborate with peers to shape cloud application and infrastructure design, mature production readiness reviews, enhance build/test/release automation, mature observability practices and approach, and enhance platform resiliency, scalability, and recovery capabilities. The successful candidate will be comfortable engaging a wide variety of technical partners and stakeholders, takes a data-driven and analytical approach to problem resolution and identifying areas of opportunity, is self-driven, and has a passion for continuous improvement.Primary Responsibilities and Essential Functions:Engage in and strengthen application and cloud services development lifecyclefrom inception, design, deployment, operation, to refinement. Work closely with application and platform teams to ensure software releases are well designed, planned, implemented, released, and monitored.Design, motivate, guide, and support the creation of software, systems, and processes to increase product reliability and organizational efficiency while optimizing resource use and cloud spend.Champion and support reliability practices across the software development lifecycle through activities like architecture reviews, code reviews, creating platforms and frameworks, and capacity planning.Work with senior engineering and testing team members to build tools and recommend testing strategies for problem prevention, detection, and chaos testing.Mature SRE practices through activities such as establishing error budgets, providing guidance and refinement to SRE dashboards, and enhancing capabilities to proactively detect anomalies.Provide design guidance and recommendations for platform improvements based on production incident analysis and root cause investigation outputs.Improve service reliability through blameless post-incident reviews and use of code, automation, or AI to respond to or prevent future problem recurrence.Recognize automation opportunities, provide design, and support implementation / development of tools to automate routine, time-consuming, or manual jobs and processes.Periodically assess current SRE practices and tools and provide recommendations for enhancements and improvementTrain, guide, and mentor teammates on SRE practices and principlesDesign and execute strategies that ensure the scalability and the elasticity of the infrastructure.Code-level debugging on issues escalated to the team.Minimum Experience:Minimum 8 years of experience as platform architect with advanced knowledge in the following key areas: containers, deployment architecture, benchmarking, design, and network engineering.Minimum 4 years of combined experience serving in either a DevOps, SRE, Systems, and/or software development role.Hands-on experience in establishing and maturing SRE practices, program, and roadmapExtensive experience with public cloud technologies and cloud-native architectures and solutions. (Azure highly preferred)Experience with Infrastructure-as-Code (IAC), DevOps, and CI/CD practices and tool chains (Terraform, Gitlab, ArgoCD, Jenkins)Experience with configuration management tools (Ansible, Chef, and Packer)Experience with container technology and orchestration (Kubernetes, Docker)Experience with Observability and Monitoring practices and tools (OpenTelemetry, New Relic, OpsRamp, Prometheus, Grafana, Elastic Stack, Splunk, DynaTrace)Deep understanding of microservice architectures, application servers, network, and databasesExcellent understanding of scalability processes and techniquesHands-on experience designing and administering high availability and high-performance environments, as well as managing large-scale deployments of traffic-heavy applications.Ability to understand and support multiple, complex systems and not shy away from the challenge of improving them.Comfortable with technical refactoring and creating technical designs to accommodate architectural evolution over time.The willingness to try new technologies and make them harmonize with existing systems to achieve better operations overall.Excellent communication and collaboration skills.

Apply safely

To stay safe in your job search, information on common scams and to get free expert advice, we recommend that you visit SAFERjobs, a non-profit, joint industry and law enforcement organization working to combat job scams.

See All Architect Jobs

Architect- Site Reliability Engineering

Job details

Apply safely

Hiring company

Confidential

Jobs

Courses

Location

Follow us

Home India Architect- Site Reliability Engineering

Home India Architect- Site Reliability Engineering

Architect- Site Reliability Engineering

Job details

Apply safely

Hiring company

Confidential

Why are you reporting this job?

Laimoon Job Alert fresh jobs directly from websites*

Jobs

Courses

Location

Follow us