Job Details

HPC Site Reliability Engineer

  2025-06-23     Trust In Soda     Santa Rosa,CA  
Description:

Love solving gnarly problems in AI infrastructure?


Our client is building the AI Native GPU Cloud—and we need a senior HPC Site Reliability Engineer to keep it humming.


You'll own the reliability and performance of our cutting-edge Nvidia-based HPC systems. Think DGX clusters, RoCE topologies, and automation pipelines built in Ansible and Terraform. If that lights you up, read on.


This role is remote (US-based) and offers the chance to shape our infrastructure from the ground up. Expect high-impact work, loads of autonomy, and collaboration with smart folks across architecture, engineering, and ops.


You'll:

  • Set up and optimize HPC clusters and networks (think DGX, HGX, GPU Direct)
  • Debug low-level networking issues with Cisco, Juniper, and more
  • Automate configs with Ansible + Terraform
  • Monitor everything with Grafana, UFM, ELK, NetQ
  • Own 24/7 reliability, on-call, and root cause analysis
  • This role is perfect if you:

    • Have 6+ years in HPC or networking-heavy roles
    • Know BGP, EVPN, VxLAN, RDMA inside and out
    • Have SRE experience in high-stakes environments
    • Love solving infra puzzles at scale


    Bonus points for CCIE/JNCIS, InfiniBand, or cloud/HPC interconnect experience.


    Sound like your kind of challenge? Hit apply and let's talk.


    Apply for this Job

    Please use the APPLY HERE link below to view additional details and application instructions.

    Apply Here

    Back to Search