An individual at this skill level should have demonstrated extensive experience working through all aspects of large scale HPC environments: designing, installing, maintaining, and upgrading of HPC systems. Specific desired areas of expertise include common HPC batch schedulers e.g. (PBS, Slurm, or Moab/Torque) and InfiniBand troubleshooting and optimization.
The perspective candidate, along with the entire HPC team, will be expected to engage in the day-to-day operations and support of the HPC resources. Activities may include system patching, OS upgrades, deploying new systems, writing scripts, and troubleshooting system issues on the HPC system while also contributing to the support of the scientific users of HPC resources on the various issues they might have getting applications to run efficiently. The ability to interact with users to determine symptoms, and then reproduce their issues to isolate the causes is critical skills for this work. There will also be activities in testing, benchmarking, user tool scripting, and analyzing trouble tickets to find patterns indicating system or user education issues.
location: Mountain View, California
job type: Contract
salary: $90 - 100 per hour
work hours: 8am to 5pm
education: Bachelors
responsibilities:
- Design, deploy and maintain HPC clusters with over 2000+ nodes with InfiniBand, 100+ petabytes of data storage in production.
- Shepherd and/or contribute to scalable feature designs through the entire software development process, from requirements and use cases to release
- Designs and develops scripts for system administration, monitoring and usage reporting.
- Modify existing software to correct errors and/or improve performance
- Designs and develops scripts for system regression test and performance (file systems (Luster), scheduler (PBS), interconnect (HDR/NDR, Slingshot, ), high availability, etc.).
- Troubleshoots, isolates and resolves application, system and other technical problems (hardware, software, and network).
- Understands research use cases, researches and deploys new technologies, defining cost, performance and other trade-offs.
- Manages and maintains tools for provisioning, configuration management (HPCM, Ansible & GIT), resource management, scheduling and all necessary aspects of HPC in accordance with best practices.
- Researches, deploys and manages networking and security infrastructure, including development of policies and procedures.
- Assists in developing and writing proposals and publications.
- Creates and provides clear documentation.
- Mentoring junior staff and cross training peers
- After hours/weekend support as required
- Moderate Supercomputing System Administration that contributes to:
- Day-to-day operations of the Linux HPC clusters and storage systems
- Proactive monitoring, analyze, and correct system issues
- Development of scripts to automate repetitive tasks or tools to enhance support of the HPC systems
- System performance analysis and tuning
- Building, installing, and supporting user-requested software
- Supporting evaluation and assessment of new HPC technology
- Resolving user report issues and manage support tickets requests in Remedy
qualifications:
- Proficiency with analysis and problem-solving skills for debugging and optimization of applications
- Familiarity/proficiency with OpenMP and Message Passing Interface (MPI) programming
- Experience with Lustre, and InfiniBand
- Experience with cloud technologies (AWS, Azure, GCP), OpenStack or Kubernetes is a plus
skills:
- Bachelor's degree in computer science or related field
- Strong computer science background with in-depth systems-level knowledge in operating systems and networking
- A minimum of 10 years of experience in the administration of HPC systems and scheduling software (PBS, Slurm, or Moab/Torque)
- A minimum of 10 years of experience of systems programming in heterogeneous, multi-platform HPC environments
- Strong ability to analyze, debug and maintain the integrity of an existing code base
- Demonstrated equivalence of 5 years of Linux/UNIX user support experience and hands-on experience with administration of Linux systems
- Experience working with HPC applications and proficiency in at least C, C++, or Fortran
- Superior scripting skills and excellent attention to detail; proficiency in at least Python, Perl, or Bash
- Strong ability to interact with customers to understand needs, elicit requirements, and get feedback on prototype solutions
- Excellent communication and people skills; excellent time management and organizational skills
- Experience with system configuration management tools e.g. , puppet, chef, ansible
- Experience with revision control software e.g. CVS, SVN, Git
- Track record of delivering commercial quality software on schedule with excellent quality through multiple release cycles
- Proficiency at documentation and technical writing
Equal Opportunity Employer: Race, Color, Religion, Sex, Sexual Orientation, Gender Identity, National Origin, Age, Genetic Information, Disability, Protected Veteran Status, or any other legally protected group status.
At Randstad Digital, we welcome people of all abilities and want to ensure that our hiring and interview process meets the needs of all applicants. If you require a reasonable accommodation to make your application or interview experience a great one, please contact HRsupport@randstadusa.com.
Pay offered to a successful candidate will be based on several factors including the candidate's education, work experience, work location, specific job duties, certifications, etc. In addition, Randstad Digital offers a comprehensive benefits package, including health, an incentive and recognition program, and 401K contribution (all benefits are based on eligibility).
This posting is open for thirty (30) days.
Qualified applicants in San Francisco with criminal histories will be considered for employment in accordance with the San Francisco Fair Chance Ordinance.
Qualified applicants in the unincorporated areas of Los Angeles County with criminal histories will be considered for employment in accordance with the Los Angeles County's Fair Chance Ordinance for Employers.
We will consider for employment all qualified Applicants, including those with criminal histories, in a manner consistent with the requirements of applicable state and local laws, including the City of Los Angeles' Fair Chance Initiative for Hiring Ordinance.