Career Techniques Inc
Description
You'll help design and scale observability platforms that handle telemetry from industry-leading GPU clusters and large-scale distributed systems. You'll work closely with experienced engineers to develop metrics pipelines, logging systems and tracing solutions that improve reliability and visibility across our services
Must Haves:
- Experience with modern observability tools and frameworks, such as Prometheus, Grafana or OpenTelemetry (OTEL)
- Exposure with cloud platforms, such as AWS, Azure, or Google Cloud
- Familiarity with microservices architectures and containerized environments, such as Kubernetes and Docker
- Interest in system reliability, performance engineering and platform-scale infrastructure
- Good communication and collaboration skills
Nice to Haves
- Exposure to enterprise observability platforms, such as Datadog or Dynatrace
- Experience working with telemetry data (metrics, logs, traces) in large environments
- Proficiency in scripting or programming languages (e.g. Python, Go)
- Familiarity with Infrastructure-as-Code tools or deployment automation
