Job Details

View jobs in our app

Learn more about the app. Workinapps.com

HPC Site Reliability Engineer

2025-06-23 Trust In Soda Santa Rosa,CA

Description:

Love solving gnarly problems in AI infrastructure?

Our client is building the AI Native GPU Cloud—and we need a senior HPC Site Reliability Engineer to keep it humming.

You'll own the reliability and performance of our cutting-edge Nvidia-based HPC systems. Think DGX clusters, RoCE topologies, and automation pipelines built in Ansible and Terraform. If that lights you up, read on.

This role is remote (US-based) and offers the chance to shape our infrastructure from the ground up. Expect high-impact work, loads of autonomy, and collaboration with smart folks across architecture, engineering, and ops.

You'll:

Set up and optimize HPC clusters and networks (think DGX, HGX, GPU Direct)
Debug low-level networking issues with Cisco, Juniper, and more
Automate configs with Ansible + Terraform
Monitor everything with Grafana, UFM, ELK, NetQ
Own 24/7 reliability, on-call, and root cause analysis

This role is perfect if you:

Have 6+ years in HPC or networking-heavy roles
Know BGP, EVPN, VxLAN, RDMA inside and out
Have SRE experience in high-stakes environments
Love solving infra puzzles at scale

Bonus points for CCIE/JNCIS, InfiniBand, or cloud/HPC interconnect experience.

Sound like your kind of challenge? Hit apply and let's talk.

Job Details

View jobs in our app

HPC Site Reliability Engineer

Apply for this Job

Registration Required

Login to Apply

You are leaving our site

Registration Required

Email this job to a friend

Job: HPC Site Reliability Engineer

Job Alert Sign Up

Add To Job Alert

Job Alert Updated

Email Customer Care