HPC Linux Support Engineer
تفاصيل الوظيفة
Role: HPC Linux Systems Engineer Location: Remote working; where you currently reside.Salary: Open for discussionType: Permanent, full-time (Work hours for India will be (9:30pm-8am) morning initially.Our client is a global IT solutions and managed services provider that focuses on helping organizations digitally transform their operations. With expertise in cloud computing, data center technologies, networking, and security, our client partners with leading technology vendors to offer innovative and customized IT solutions. They serve a wide range of industries, including healthcare, finance, and education, providing services such as IT consulting, system integration, and ongoing support.We are looking for a motivated, HPC/Linux Systems Engineer to join their new HPC, AI & Quantum business unit. Role & Responsibilities:If you do not have these skills. There is potential to be trained but I will determine your viability based on your foundation, personality, and situation. I am nothing like other employers. I believe in autonomy and creating a fun work environment with everyone I choose to bring into my circle of fortune; we will be doing a lot of fun stuff.Assessing Customer Needs: Collaborate with customers to understand their HPC system requirements and challenges. HPC Cluster Implementation: Implement and deploy HPC solutions, manage the full stack, and ensure smooth handover.HPC Stack Support: Resolve technical issues escalated through the support desk, troubleshoot complex systems, and ensure customer satisfaction.Documentation and Training: Create comprehensive training materials and guides for effective HPC system usage.Monitoring & Reporting: Conduct ongoing monitoring and generate regular system health, utilization, and status reports.Test & Demonstration: Support the development and operation of HPC system test and demonstration infrastructure.Qualifications & Experience:. Ubuntu - 5 years minimum, Prefer Debian based OS skills, Enterprise OS's are not used in HPC.Slurm - 1 yearInfiniband 1 yearParallel File systems: Luster, FSX, etc. Root cause analysis is the main skill: If you have worked in a production environment, you should know what do if a server has gone down. What are the logical steps when a server is down. Servers do not go down in production without a cause. Before you reboot there are steps. You should know them, and after reboot you know its going to happen again. So what are you going to do to catch it. Before you apply. This should be basics. Dont apply if you dont have root cause analysis under your belt. I will give you standard production type issues that a seasoned engineer will know what to do. In HPC Support thats what you will be doing all day.Practical HPC technology experience.Proficient in Linux, HPC scheduling, and configuration management tools such as Slurm, GPUs, Ansible, and Terraform, root cause analysis regardless of what it is.Please do not apply without strong Linux Ubuntu. I will test you. PRB
Apply safely
To stay safe in your job search, information on common scams and to get free expert advice, we recommend that you visit SAFERjobs, a non-profit, joint industry and law enforcement organization working to combat job scams.