Love solving gnarly problems in AI infrastructure?
Our client is building the AI Native GPU Cloud—and we need a senior HPC Site Reliability Engineer to keep it humming.
You'll own the reliability and performance of our cutting-edge Nvidia-based HPC systems. Think DGX clusters, RoCE topologies, and automation pipelines built in Ansible and Terraform. If that lights you up, read on.
This role is remote (US-based) and offers the chance to shape our infrastructure from the ground up. Expect high-impact work, loads of autonomy, and collaboration with smart folks across architecture, engineering, and ops.
You'll:
This role is perfect if you:
Bonus points for CCIE/JNCIS, InfiniBand, or cloud/HPC interconnect experience.
Sound like your kind of challenge? Hit apply and let's talk.