Principal Lead- Observability and Incident Management

business First Abu Dhabi Bank
location_on Abu Dhabi
work full-time
attach_money USD 60.000 - 120.000
4 days ago USD 60.000 - 120.000

Description

Overall objectives Ensure proactive detection diagnosis and resolution of service health issues across all IT environments Establish a modern observability function that delivers full visibility into the critical services applications and infra layers Own and lead the major incident management process ensuring rapid containment clear communication and structured resolution Drive actionable insights through metrics and logs (MTL) and ensure system health telemetry is used to improve availability performance and user experience Support operational risk reduction and continuous improvement through RCA trend reporting and resilience engineering Job scope Role specific responsibilities Monitoring and observability engineering Alerting noise reduction and event correlation Incident management Poset incident review and RCADashboarding and health visibility Service reliability metrics General functional responsibilities Define the observability architecture strategy ensuring scalability data security and cost optimisation Collaborate with app infra and security teams to ensure instrumentation coverage and logging compliance Maintain operational documentation runbooks escalation matrices and incident playbooks Drive blameless culture of improvement and incident learning Align monitoring practices with regulatory and compliance obligations Represent the observability and incident management function at governance forums Engage with vendors Saa S providers and cloud platforms to ensure integration with internal monitoring and incident workflows Coach and mentor monitoring and incident managers to raise maturity across people processes and tooling Qualifications : Core competencies required Deep expertise in monitoring platforms e.g. ELK App Dynamics Grafana Elastic Datadog APM synthetic monitoring and log aggregation Solid understanding of distributed systems microservices and hybrid cloud environments Strong command of SRE telemetry pipelines SLI/SLO and alerting strategies Experience running 24/7 incident command processes leading war rooms managing comms to executives and driving post-mortems Ability to align observability practices to business-critical services and customer impact not just infra health Mastery of ITIL event management and incitement management with ITSM platforms like Service Now Calm decisive leadership in high pressure scenarios excellent cross functional coordination and communication skills Overall 15 years of technology experience is desirable Remote Work : No Employment Type : Full-time #J-18808-Ljbffr

Posted: 25th August 2025 3.05 pm

Application Deadline: N/A

Apply Now

No related jobs found

Browse All Jobs