Início Índia Site Reliability Engineering Principal

Início Índia Site Reliability Engineering Principal

Site Reliability Engineering Principal

Full time na a Laimoon Verified Company no India
Publicado em April 23, 2024

Detalhes do emprego

Cloud Platforms:Advanced proficiency in one or more cloud platforms such as AWS, Azure, or Google Cloud Platform (GCP), including expertise in services such as EC2, S3, RDS, and VPC networking.Container Orchestration: Strong experience with container orchestration platforms such as Kubernetes, including deployment, scaling, and management of containerized applications.Configuration Management and Automation: Proficiency in configuration management tools such as Ansible, Puppet, or Chef, with a strong emphasis on automation and infrastructure as code (IaC) practices.Monitoring and Observability: Hands-on experience with monitoring and observability tools such as Splunk, Prometheus, Grafana, ELK stack (Elasticsearch, Logstash, Kibana), or similar solutions for real-time system monitoring, logging, tracing, and alerting.Continuous Integration/Continuous Deployment (CI/CD): Experience with CI/CD pipelines and tools such as Jenkins, GitLab CI/CD, CircleCI, or Travis CI, including automated testing, deployment, and rollback strategies.Infrastructure as Code (IaC):Proficiency in IaC tools such as Terraform or CloudFormation for provisioning and managing infrastructure resources declaratively.Scripting and Automation: Strong scripting skills in languages such as Python, Shell, or Go for automating repetitive tasks, managing configurations, and orchestrating deployments.Databases and Datastores: Experience with relational databases (e.g., PostgreSQL, MySQL) and NoSQL databases (e.g., MongoDB, Cassandra), time series databases Including performance tuning, replication, and high availability configurations.Security Best Practices: Familiarity with security best practices for cloud environments, including identity and access management (IAM), encryption, network security, and compliance standards such as PCI-DSS and GDPR.Version Control Systems: Proficiency in version control systems such as Git, including branching strategies, code reviews, and collaboration workflows.Synthetic Monitoring: Experience with synthetic monitoring tools such as New Relic Synthetics, Datadog Synthetics, or Selenium for simulating user interactions and monitoring application performance from external locations.Network Understanding:Strong understanding of networking, distributed systems, microservices architecture, and other relevant architectural concepts.Analytical Skills:Excellent problem-solving skills and the ability to troubleshoot complex issues in production environments.Responsibilities:Efficient Lifecycle Management:You will be enhancing application and cloud service lifecycles.Reliable Software Improvement: Boost software dependability for organizational efficiency.Expert Guidance in Reliability: Provide expert direction on reliability practices.Robust Testing Development: Develop effective testing strategies and tools.Adaptable SRE Solutions Implementation: Implement flexible solutions to enhance system stability.Dashboard Development Leadership: Lead comprehensive SRE Dashboard creation.Optimized Performance Testing Deployment: Deploy specialized tests for peak system performance.Swift Incident Resolution: Resolve production incidents promptly to minimize disruptions.Continuous Service Enhancement: Enhance service reliability through proactive measures.Proactive Anomaly Management:Identify and address anomalies before they impact operations.Automated Dashboard Setup:Streamline dashboard provisioning for efficient operations.Precise Code Debugging:Investigate and resolve issues at the code level efficiently.Seamless Release Integration: Integrate SRE practices seamlessly into the release cycle.Efficient Process Automation: Automate repetitive tasks to save time and resources.Dynamic SRE Solutions Enhancement: Assess and enhance SRE solutions for optimal performance.Collaborative SRE Implementation:Work with teams to implement and refine SRE practices.Proactive System Enhancement: Improve system resilience through proactive initiatives.Effective SRE Training Delivery: Deliver training sessions for widespread SRE knowledge.Scalability Strategy Planning:Design strategies for scalable infrastructure growth.Proactive Improvements: Spend at least 50% of your time on proactive improvements to system reliability and resilienceTraining: Conduct SRE training sessionsNice to have:Previous FedEx experienceMasters degreeDomain knowledge in logistics, finance, or supply chain ATS

Apply safely

To stay safe in your job search, information on common scams and to get free expert advice, we recommend that you visit SAFERjobs, a non-profit, joint industry and law enforcement organization working to combat job scams.

Share this job
See All Site Jobs
Feedback Feedback