Lead / Senior Site Reliability Engineer

Full time at Appspace in Malaysia

Posted on May 8, 2024

Job details

Your Role as a Lead / Senior Site Reliability Engineer: Our Cloud Operations team seeks a Lead / Senior Site Reliability Engineer who is passionate about problem-solving, automating, and maintaining Appspace’s Cloud Platform to support the needs of our Engineering and Customer Care teams. The ideal candidate will have a considerable amount of experience in site reliability engineering and/or AIOps, and evolving operations into leveraging more automation to scale cloud-native platforms. You will work closely with a global team of cloud, engineering, product, and service professionals to improve our platform’s resiliency and scalability, which directly improves our customers’ experience with Appspace. With this role, you can grow your capabilities as a Lead / Senior Site Reliability Engineer given the large-scale size of our cloud platform combined with our smaller-sized Cloud Operations team, which means you will have opportunities to work on all Cloud Infrastructure, end-to-end. This is a mission-critical role for Appspace, therefore while we offer flex time, it should be scheduled ahead of time, otherwise shift engagement is mandatory outside lunch and break times. On-Call coverage will be required weekly during a limited window of US daytime hours over the weekend. This role highly prefers candidates who can attend our Kuala Lumpur office at least 2 days per week. This is your opportunity to be part of an awesome company that is rapidly growing and defining the modern workplace experience market! A Day in the Life of a Lead / Senior Site Reliability Engineer: For this role, you will play a key role in maintaining our cloud platform, which includes an assortment of Kubernetes, Microservices, MongoDB, RabbitMQ, MySQL, Windows Server VM Infrastructure, Orchestration Engines, CI/CD and Monitoring platforms. Your day will consist of:

Executing projects that rollout new platform maintenance features, automate tasks, or other big picture changes to improve our customers’ experience on our Cloud Platform.
Deploying new features and releases of our software into Kubernetes via Helm, so strong experience in Kubernetes and Helm is a must.
Troubleshooting performance issues or errors thrown by the cloud platform or application , and either resolving the underlying cause, or forwarding your research to Engineering to address in the product.
Mentoring others towards technical and procedural success and providing some daily operational to Kuala Lumpur-based Cloud Operations and IT team membersActioning Request Tickets from other teams in support of their needs to enable and prepare for upcoming releases.
Monitoring and maintaining our Platform’s, uptime, resiliency and performance , looking for improvement opportunities, and proactively taking action to solve any negative trends before they become issues.
Lead, Participate, or Execute within the incident management process when alerts fire, and quickly ascertain root cause, resolve the issue, and find new and creative solutions to prevent recurrence.Configure, Monitor, Research, and Evaluate workload performances both on Google Cloud Platform and Microsoft Azure Clouds.
Collaborating with our Development and Quality Assurance teams to address issues in the product and platform, particularly around recurring problems.
Documenting new or updating existing processes and procedures to share knowledge and improve on standardized approaches to solution.

What You’ll Need:

Must have a passion for life-long learning.
Must communicate well and adapt to working well with others across different countries and cultures.
Strong background in Containers, Kubernetes, Helm, Linux, Python coding, and some experience with Windows Server OS and MacOS are a must.
Experience with Google Cloud Platform and Microsoft Azure required.
Expert-level troubleshooting experience and the ability to reason through a process workflow to identify a fault or odd behavior (i.e., spending time following log trails).
Experience with administering MySQL & MongoDB preferred.
Experience with administering message brokering systems like RabbitMQ preferred.
Must be flexible on occasionally attending “off-hour” meetings (we’re a global team supporting a global customer base!).
No travel required for this role.

Nice to Haves:

Experience with Build pipeline tools and the Atlassian suite (JIRA, Confluence, Bitbucket/Git, Bamboo, Octopus).
Experience with monitoring and alerting platforms, especially StackDriver.
Experience with HashiCorp Terraform.
Experience with IIS.

#LI - Hybrid The Perks of Working for Appspace: For all our KL based team members, we offer a variety of benefits from competitive salaries, medical, dental and vision coverage, mental health resources, a 14 week maternity leave program and transport/parking allowance. Additional perks include:

20 Days PTO
Flexible work schedules
Remote work opportunities
Paid company holidays
1/2 Day Fridays
Appspace Quiet Fridays (No non-essential internal meetings scheduled)
A casual dress work environment

#J-18808-Ljbffr

Apply safely

To stay safe in your job search, information on common scams and to get free expert advice, we recommend that you visit SAFERjobs, a non-profit, joint industry and law enforcement organization working to combat job scams.

See All Lead Jobs

Lead / Senior Site Reliability Engineer

Job details

Apply safely

Hiring company

Appspace

Jobs

Courses

Location

Follow us

Home Malaysia Lead / Senior Site Reliability Engineer

Home Malaysia Lead / Senior Site Reliability Engineer

Lead / Senior Site Reliability Engineer

Job details

Apply safely

Hiring company

Appspace

Why are you reporting this job?

Laimoon Job Alert fresh jobs directly from websites*

Jobs

Courses

Location

Follow us