Published

July 24, 2026

Location

Dallas, TX - Hybrid 3 days/week In-Office

Description

Hybrid - 3 days/week in-office

USC and GC Preferred

Relocation Assistance Available!

In this role, you will design, implement, and optimize GPU-accelerated container platforms at scale, enabling high-performance workloads (AI/ML, HPC, LLM training) across hybrid or on-prem environments.

You will have deep expertise with both NVIDIA and Kubernetes ecosystems, including GPU scheduling, device plugins and custom operators.

Key responsibilities of the role include:

Architecting and operating Kubernetes clusters optimized for GPU workloads, leveraging NVIDIA GPU Operator, Network Operator and DCGM
Developing, deploying and maintaining custom Kubernetes operators and controllers to automate infrastructure services
Integrating NVIDIA device plugins, Multi-Instance GPU (MIG) and GPU sharing features into the scheduling layer
Optimizing GPU utilization and job placement through scheduler extensions, such as kube-scheduler plugins, Slurm and Volcano
Collaborating with HPC, ML and DevOps teams to ensure multi-tenant, high-throughput cluster performance
Driving observability and telemetry integrations using Prometheus, Grafana, DCGM Exporter and OpenTelemetry
Implementing secure multi-user and multi-namespace GPU isolation, with RBAC and policy enforcement, such as OPA or Gatekeeper
Maintaining CI/CD pipelines for Kubernetes infrastructure using GitOps, ArgoCD and FluxCD
Contributing to infrastructure-as-code, using Terraform, Helm, and Kustomize
Participating in performance tuning, incident response and production readiness reviews

Requirements

Extensive experience with Kubernetes in production-grade environments and working with NVIDIA and Kubernetes, including GPU Operator, device plugin, NVML, MIG and DCGM
Proficiency in Go or Python for operator development and Kubernetes controller logic
Deep understanding of Kubernetes internals, including CRDs, RBAC, custom controllers and scheduler extensions
Experience with GPU-intensive workloads, for example for LLMs, training pipelines and scientific computing
Hands-on experience with Helm, Kustomize and GitOps workflows
Familiarity with CNI plugins, especially NVIDIA CNI and Multus
Experience with monitoring GPU metrics and cluster health using Prometheus and DCGM Exporter

Apply Online

Related Jobs

Customer Operations Analyst Dallas, TX - Hybrid - 3 days/week in-office new

July 28, 2026

Director of Vulnerability and Exploits Dallas, TX - Hybrid - 3 days/week in-office

July 16, 2026

Technical Account Manager (HPC, ML/AI) Dallas, TX - Hybrid - 3 day/week in-office

July 16, 2026

Software Engineer Melville, NY - Hybrid - 3 days/week in-office

July 13, 2026

Incident Response Engineer - Cyber Defense Dallas, TX - Hybrid - 3 days/week in-office

July 9, 2026