Site Reliability Engineer
Job details
About us:Intuitive.Cloud is one of the fastest-growing (INC 5000, CRN) Cloud & SDx solution and services companies supporting enterprise customers on a global scale. Intuitive is an "Engineering Company" delivering measurable value and key business outcomes.Intuitive Superpowers:- DataOps & AI/ML- Cloud Native, AppSecOps, DevSecOps- Cloud Migration & Transformation- Cloud FinOps- Cybersecurity (App/Data/Infra) & GRC- SDx & Digital WorkspaceWe are proud to partner with some of the world's leading enterprises and serve 200+ customers across different industry verticals. We have achieved many milestones along the way, including being recognized as a top-10 fast-growth 150 IT company in the Americas by CRN in 2022 and being named one of America's fastest-growing private companies by INC 5000 in 2022. That's not all! Even CIO Review awarded us as the Most Promising Cloud Migration Company and Artificial Intelligence Solutions Provider in 2022.About the job:Title - Site Reliability EngineerStart date: ImmediatePosition Type: Full TimeWork Timing: US (Eastern Time Zone).Location: Remote across IndiaJob Description:We are seeking an experienced Site Reliability Engineer (SRE) to enhance operational efficiency, reliability, and observability across infrastructure and application landscapes. This role focuses on integrating advanced monitoring platforms, defining key performance metrics, and establishing comprehensive monitoring solutions to ensure system health and performance. The SRE will work closely with cross-functional teams to implement alerting mechanisms, improve scalability, and drive the adoption of best practices in observability and reliability engineering.Roles and Responsibilities:Observability Platform IntegrationLead the transition to modern monitoring platforms, ensuring seamless integration with existing systems.Define and implement observability strategies to enhance visibility into infrastructure and applications.Collaborate with stakeholders to identify critical workloads and performance metrics.Monitoring and AlertingDevelop and implement monitoring solutions for applications, databases, and infrastructure, capturing metrics such as availability, performance, and resource utilization.Establish alerting frameworks to detect anomalies, performance bottlenecks, and security incidents.Integrate monitoring and alerting with ITSM tools for streamlined incident management.Performance Metrics and SLAsDefine and track Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) for key business systems.Work with stakeholders to align metrics with business expectations and operational goals.Automation and ScalabilityLeverage scripting and automation tools to streamline deployment of monitoring agents and configuration updates.Optimize monitoring platforms for scalability and efficiency, ensuring they can accommodate evolving business needs.Dashboard DevelopmentDesign and maintain dashboards to provide real-time insights into system performance and health.Ensure dashboards are intuitive and actionable, enabling teams to monitor critical metrics effectively.Cloud Infrastructure PerformanceDeep understanding of cloud infrastructure and servicesDiagnose, troubleshoot, and optimize performance issues in cloud services, including compute, storage, and networking components.Implement monitoring and tuning practices specific to cloud-native environments to ensure reliability and scalability.Documentation and TrainingDevelop comprehensive documentation for monitoring tools, configurations, and processes.Conduct training sessions to ensure teams are proficient in utilizing observability platforms and interpreting metrics.Continuous ImprovementContinuously evaluate and enhance monitoring and observability solutions to meet changing organizational needs.Incorporate feedback from stakeholders to refine alerting thresholds, dashboards, and metrics.Mandatory Skills:Performance Monitoring:Expertise with modern observability platforms - Sumo LogicExperience with Azure native monitoring solutions and practicesDeep understanding of Azure infrastructure and services, including diagnosing and tuning performance issues with such services.Strong knowledge of monitoring methodologies for infrastructure, applications, and databases.Experience in monitoring/integrating observability platforms with Active Directory Domain Controllers, PeopleSoft Applications and Order Entry Systems (KPS).Experience with log management, metric collection, and alerting configuration.Ability to define and track SLAs, SLOs, and SLIs for business-critical systems.Experience in monitoring network, application, and database performance metrics.Strong understanding of network and security device monitoring, including SNMP, syslog, and NetFlow.Hands-on experience in application performance monitoring for enterprise platforms like ERP or custom applications.Familiarity with containerized environments and Kubernetes monitoring.Automation Skills:Experience with scripting languages (e.g., Python, Bash, PowerShell) to automate monitoring setup and management.Familiarity with infrastructure automation tools like Ansible and Terraform.Communication and Collaboration:Strong collaboration skills to work with cross-functional teams and stakeholders.Ability to communicate technical concepts to both technical and non-technical audiences.Incident Management:Familiarity with ITSM tools (e.g., ServiceNow) for incident and problem management.Proven experience in integrating alerting mechanisms with incident management workflows. PRB
Apply safely
To stay safe in your job search, information on common scams and to get free expert advice, we recommend that you visit SAFERjobs, a non-profit, joint industry and law enforcement organization working to combat job scams.