🕐 Posted 5d ago

Senior Distributed Systems Engineer

Institute of Foundation Models

SunnyvaleFull-timeMid LevelOn-site

Job Description

Job Description Job Description About the Institute of Foundation Models The Institute of Foundation Models (IFM) designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models. We believe performance, fault tolerance, and scalability are co-designed across model architecture, communication systems, runtime, and hardware topology. This role sits at the core of that effort — driving communication performance, distributed reliability, and cross-layer optimization for large-scale training workloads.

The Mission We are looking for a deeply technical engineer to co-design and optimize the communication stack for large-scale distributed training, including hybrid parallelism and Mixture-of-Experts (MoE) workloads. This is not a network operations role. This is a systems-level engineering position focused on performance engineering, distributed debugging, and communication-runtime co-design. · Design and optimize expert-parallel and hybrid-parallel communication patterns · Drive high-performance hierarchical collectives for MoE workloads · Co-design runtime orchestration with communication topology awareness · Reduce tail latency and improve determinism across thousands of GPUs · Architect fault-tolerant distributed execution under real-world cluster failures Core Technical Scope · Communication-compute overlap and topology-aware collective optimization · Deep debugging of NCCL, RDMA, and custom communication layers · Hybrid expert parallel strategies in modern large-scale MoE systems · Elastic and resilient distributed job orchestration concepts · Congestion analysis and routing optimization across InfiniBand/RoCE fabrics · Microbenchmarking and performance modeling for communication-heavy workloads Expected Technical Depth · Hybrid expert parallel communication for Mixture-of-Experts training · Scaling behavior under network pressure · Distributed orchestration for elastic, large-scale training · Fault detection and recovery in distributed GPU workloads · Cross-layer bottlenecks: GPU ↔ NIC ↔ PCIe ↔ NVSwitch ↔ Fabric ↔ Scheduler Required Background · Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth) · Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA · Deep familiarity with NCCL and/or UCX internals · Strong systems programming ability (C/C++, Rust, or Go) · Strong familiarity with modern model training frameworks such as PyTorch · Ability to troubleshoot and profile training performance issues related to communication bottlenecks · Ability to translate research ideas into production-grade optimizations · Experience debugging distributed hangs, desynchronization, and performance regressions What We Mean by "Hardcore" · You can explain why an communication degrades at scale and how to fix it · You have improved real cluster throughput via communication redesign · You can trace a distributed hang across ranks and identify the root cause · You are comfortable working at the boundary between hardware and runtime Application Requirements · Include a link to your GitHub (required) · Provide links to relevant distributed systems, HPC, or large-scale training projects · Include a list of publications and/or public technical reports (if applicable) · Describe the hardest distributed debugging problem you solved · Include measurable performance improvements you have delivered Academic Qualifications Master’s, or Bachelor’s + 1 year of relevant experience.

Visa Sponsorship This position is eligible for visa sponsorship. Benefits Include *Comprehensive medical, dental, and vision benefits *Bonus *401K Plan *Generous paid time off, sick leave and holidays *Paid Parental Leave *Employee Assistance Program *Life insurance and disability

Posted 5 days ago

Related Jobs

Related Searches

Apply Now