Information Technology

Senior HPC Systems Engineer (Remote)

Remote
Work Type: Full Time

RedLine Performance Solutions (RedLine) has been in the HPC solutions engineering services business for 26 years and is consistently determined to keep the "bar of excellence" quite high for new hires. This enables RedLine to accomplish what other firms cannot and promotes a high level of staff retention. We offer services ranging from full life cycle HPC systems engineering to remote managed services to HPC program analysis.


We are seeking a Senior HPC Systems Engineer to join our NASA NACS High Performance Computing team at NASA's Ames Research Center in Mountain View, CA. This role primarily provides development for Supercomputing Batch Scheduling with Supercomputing Systems Administration secondary support for our NASA NACS High Performance Computing (HPC) contract.


U.S. citizenship and the ability to obtain a Public Trust security clearance are mandatory requirements for this position. This position can be remote but will work Pacific time zone business hours. Travel to customer site will be required 2-3 times a year. 

 

An individual at this skill level should have demonstrated extensive experience working with common HPC batch schedulers e.g. (PBS, Slurm, or Moab/Torque) while contributing to the support of users of HPC resources on the various issues they might have getting applications to run efficiently. This individual should demonstrate experience installing, maintaining, and upgrading HPC systems. The individual, along with the entire HPC team, will be engaged in the day-to-day operations and support of the HPC resources. Activities may include system patching, OS upgrades, deploying new systems, writing scripts, and troubleshooting system issues on the HPC system. The ability to interact with users to determine symptoms, and then reproduce their issues to isolate root cause of failure is a critical skill for this position. There will also be activities in testing, benchmarking, user tool scripting, and analyzing trouble tickets to find patterns indicating system or user education issues.

 

Duties and Responsibilities:

  • Oversee and directly contribute to significant ongoing HPC integrations to the environment
  • Write and shepherd scalable feature designs through the entire software development process, from requirements and use cases to release
  • Design and develop enhancements to the PBSPro batch scheduler based on customer-driven requirements.
    • Work extensively with PBS vendor, Altair, on bug fixes and feature releases
  • Apply best practices in software engineering, delivering projects on time, on budget, and with excellent quality
  • Provide support to staff and end users to resolve batch scheduler issues
  • Modify existing software to correct errors and/or improve performance
  • Mentoring junior staff and cross training peers
  • After hours/weekend support as required
  • Moderate and contribute to Supercomputing System Administration that contributes to:
    • Day-to-day operations of the Linux HPC clusters and storage systems
    • Proactive monitoring, analyze, and correct system issues
    • Development of scripts to automate repetitive tasks or tools to enhance support of the HPC systems
    • System performance analysis and tuning
    • Building, installing, and supporting user-requested software
    • Supporting evaluation and assessment of new HPC technology
    • Resolving user report issues and manage support tickets requests in Remedy
Requirements:
  • Bachelors of Science degree in Computer Science or related field
  • Strong computer science background with in-depth systems-level knowledge in operating systems and networking
  • Solid understanding of the software development process, including requirements, use cases, design, coding, documentation and testing of scalable, distributed applications in a Linux environment
  • A minimum of 10 years experience with integration development of HPC systems and scheduling software (PBS, Slurm, or Moab/Torque)
  • A minimum of 10 years of experience developing system software in heterogeneous, multi-platform HPC environments
  • Strong ability to analyze, debug and maintain the integrity of an existing code base
  • Demonstrated equivalence of 10 years of Linux/UNIX user support experience and hands-on experience with administration of Linux systems
  • Experience working with HPC applications and proficiency in at least C, C++, or Fortran
  • Superior scripting skills and excellent attention to detail; proficiency in at least Python, Perl, or Bash
  • Strong ability to interact with customers to understand needs, elicit requirements, and get feedback on prototype solutions
  • Excellent communication and people skills; excellent time management and organizational skills
  • Experience with system configuration management tools e.g. puppet, chef, ansible
  • Experience with revision control software e.g. CVS, SVN, Git
  • Track record of delivering commercial quality software on schedule with excellent quality through multiple release cycles
  • Proficiency at technical writing

Preferred Skills:

  • Proficiency with analysis and problem-solving skills for debugging and optimization of applications
  • Familiarity/proficiency with OpenMP and Message Passing Interface (MPI) programming
  • Experience with Lustre, and InfiniBand
  • Experience with cloud technologies (AWS, Azure, GCP), OpenStack or Kubernetes is a plus

To learn more about RedLine, please visit our website at www.RedLinePerf.com


Submit Your Application

You have successfully applied
  • You have errors in applying