Published

May 6, 2026

Location

Dallas, TX - Hybrid - 3 days/week in-office

Description

You'll help design and scale observability platforms that handle telemetry from industry-leading GPU clusters and large-scale distributed systems. You'll work closely with experienced engineers to develop metrics pipelines, logging systems and tracing solutions that improve reliability and visibility across our services

Must Haves:

Experience with modern observability tools and frameworks, such as Prometheus, Grafana or OpenTelemetry (OTEL)
Exposure with cloud platforms, such as AWS, Azure, or Google Cloud
Familiarity with microservices architectures and containerized environments, such as Kubernetes and Docker
Interest in system reliability, performance engineering and platform-scale infrastructure
Good communication and collaboration skills

Nice to Haves

Exposure to enterprise observability platforms, such as Datadog or Dynatrace
Experience working with telemetry data (metrics, logs, traces) in large environments
Proficiency in scripting or programming languages (e.g. Python, Go)
Familiarity with Infrastructure-as-Code tools or deployment automation

Apply Online

Related Jobs

VP of Security Operations Dallas, TX - Hybrid - 3 days/week in-office new

July 16, 2026

Director of Vulnerability and Exploits Dallas, TX - Hybrid - 3 days/week in-office new

July 16, 2026

Technical Account Manager (HPC, ML/AI) Dallas, TX - Hybrid - 3 day/week in-office new

July 16, 2026

SDET Melville, NY - Hybrid - 3 days/week in-office

July 13, 2026

Software Engineer Melville, NY - Hybrid - 3 days/week in-office

July 13, 2026