Staff Engineer, Distributed Storage and HPC & AI Infrastructure
Company: Together AI
Location: San Francisco
Posted on: April 1, 2026
|
|
|
Job Description:
About the Role In this role, you will design and deliver
multi-petabyte storage systems purpose-built for the world’s
largest AI training and inference workloads. You’ll architect
high-performance parallel filesystems and object stores, evaluate
and integrate cutting-edge technologies such as WekaFS, Ceph, and
Lustre, and drive aggressive cost optimization-routinely achieving
30-50% savings through intelligent tiering, lifecycle policies,
capacity forecasting, and right-sizing. You will also build
Kubernetes-native storage operators and self-service platforms that
provide automated provisioning, strict multi-tenancy, performance
isolation, and quota enforcement at cluster scale. Day-to-day,
you’ll optimize end-to-end data paths for 10-50 GB/s per node,
design multi-tier caching architectures, implement intelligent
prefetching and model-weight distribution, and tune parallel
filesystems for AI workloads. Responsibilities Design
multi-petabyte AI/ML storage systems; integrate WekaFS, Ceph, etc.;
lead capacity planning and cost optimization (30-50% savings via
tiering, lifecycle policies, right-sizing). Design/optimize RDMA,
InfiniBand, 400GbE networks; tune for max throughput/min latency;
implement NVMe-oF/iSCSI; troubleshoot bottlenecks; optimize TCP/IP
for storage. Build Kubernetes storage operators/controllers; enable
automated provisioning, self-service abstractions, multi-tenant
isolation, quotas; create reusable Helm/Terraform patterns. Deliver
10-50 GB/s per GPU node; optimize caching
(weights/datasets/checkpoints), parallel filesystems, and data
paths; troubleshoot with profiling tools; scale to thousands of
nodes. Build multi-tier caches (local NVMe, distributed, object);
optimize data locality and model-weight distribution; implement
smart prefetching/eviction. Implement monitoring, alerting, SLOs;
design DR/backups with runbooks; run chaos engineering; ensure
99.9% uptime via proactive/automated remediation. Partner with
ML/SRE teams; mentor on storage best practices; contribute to
open-source; write docs, postmortems, and public learnings.
Requirements 8 years in storage engineering with 3 years managing
distributed storage at multi-petabyte scale Proven track record
deploying and operating high-performance storage for GPU/HPC
clusters Deep Kubernetes and cloud-native storage experience in
production environments Strong coding skills in Go and Python with
demonstrated ability to build production-grade tools BS/MS in
Computer Science, Engineering, or equivalent practical experience
History of technical leadership: designing systems that
significantly improved performance (>3x), reliability (99.9%
uptime), or cost efficiency Distributed Storage Systems: Deep
expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel
filesystems at multi-petabyte scale Object Storage: Production
experience with S3, MinIO, Ceph, or R2 including performance
optimization and cost management Kubernetes Storage: CSI drivers,
StatefulSets, PersistentVolumes, storage operators, and custom
controllers Storage optimization for GPU workloads, RDMA/InfiniBand
networking, parallel filesystem optimization (100 GB/s aggregate
cluster throughput) Programming: Go and Python for automation,
operators, and tooling Infrastructure as Code: Terraform, Ansible,
Helm, GitOps (ArgoCD) Linux Storage Stack: Advanced knowledge of
filesystems (ext4, xfs), LVM, NVMe optimization, RAID
configurations Observability: Prometheus, Grafana, Thanos
architecture and operations Nice to Have Skills GPU Direct Storage
(GDS), NVMe-oF, storage networking (100GbE/400GbE) ML/AI storage
patterns (model weights, checkpointing, dataset caching) Kubernetes
operator development (controller-runtime, kubebuilder) Storage
snapshots, cloning, and thin provisioning Backup and disaster
recovery (Velero, Restic, cross-region replication) Storage
encryption (at-rest and in-transit), security and compliance
Storage benchmarking and profiling tools (fio, iperf3, iostat,
blktrace) About Together AI Together AI is a research-driven
artificial intelligence company. We believe open and transparent AI
systems will drive innovation and create the best outcomes for
society, and together we are on a mission to significantly lower
the cost of modern AI systems by co-designing software, hardware,
algorithms, and models. We have contributed to leading open-source
research, models, and datasets to advance the frontier of AI, and
our team has been behind technological advancement such as
FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to
join a passionate group of researchers in our journey in building
the next generation AI infrastructure. Compensation We offer
competitive compensation, startup equity, health insurance, and
other benefits, as well as flexibility in terms of remote work. The
US base salary range for this full-time position is: $160,000 -
$260,000 equity benefits. Our salary ranges are determined by
location, level and role. Individual compensation will be
determined by experience, skills, and job-related knowledge. Equal
Opportunity Together AI is an Equal Opportunity Employer and is
proud to offer equal employment opportunity to everyone regardless
of race, color, ancestry, religion, sex, national origin, sexual
orientation, age, citizenship, marital status, disability, gender
identity, veteran status, and more. Please see our privacy policy
at https://www.together.ai/privacy
Keywords: Together AI, Milpitas , Staff Engineer, Distributed Storage and HPC & AI Infrastructure, IT / Software / Systems , San Francisco, California