Career Techniques Inc
Description
Hybrid - 3 days/week in-office
USC and GC Preferred
Relocation Assistance Available!
In this role, you will design, implement, and optimize GPU-accelerated container platforms at scale, enabling high-performance workloads (AI/ML, HPC, LLM training) across hybrid or on-prem environments.
You will have deep expertise with both NVIDIA and Kubernetes ecosystems, including GPU scheduling, device plugins and custom operators.
Key responsibilities of the role include:
- Architecting and operating Kubernetes clusters optimized for GPU workloads, leveraging NVIDIA GPU Operator, Network Operator and DCGM
- Developing, deploying and maintaining custom Kubernetes operators and controllers to automate infrastructure services
- Integrating NVIDIA device plugins, Multi-Instance GPU (MIG) and GPU sharing features into the scheduling layer
- Optimizing GPU utilization and job placement through scheduler extensions, such as kube-scheduler plugins, Slurm and Volcano
- Collaborating with HPC, ML and DevOps teams to ensure multi-tenant, high-throughput cluster performance
- Driving observability and telemetry integrations using Prometheus, Grafana, DCGM Exporter and OpenTelemetry
- Implementing secure multi-user and multi-namespace GPU isolation, with RBAC and policy enforcement, such as OPA or Gatekeeper
- Maintaining CI/CD pipelines for Kubernetes infrastructure using GitOps, ArgoCD and FluxCD
- Contributing to infrastructure-as-code, using Terraform, Helm, and Kustomize
- Participating in performance tuning, incident response and production readiness reviews
