Principal Infrastructure Performance Engineer

Upgrade, Inc.

full-time

Posted on: 8/30/2025

Location: 🇺🇸 United States

✨ AI Apply

Lead

AWSCloudGoGrafanaJavaKubernetesLinuxMicroservicesPrometheusPythonSQLTerraform

About the role

Build a resilient, secure, and efficient cloud based observability platform.
Monitor and troubleshoot platform issues, including finding solutions to reduce known issues.
Build and scale the observability infrastructure to meet rapidly increasing demand.
Develop and improve operational practices and procedures.
Sample projects:
Improve database monitoring: develop custom prometheus exporters in Go for use cases that go beyond what is possible with SQL exporter. Create Grafana dashboards and alerts for these new metrics.
MCP servers for observability: deploy MCP server to integrate our observability stack with our LLM tools.
Our Tech Stack: Monitoring: VictoriaMetrics, Grafana, Prometheus, OpenTelemetry, Honeycomb, Sumologic.
Infrastructure as code: Terraform.
CD: GitOps, ArgoCD, ArgoRollouts.
CI: Tekton.
Scripting: Bash.
Programming: Golang (preferred).
AWS: EKS, Cloudwatch, S3, DynamodDB, RDS, SNS, SQS, Lambda.

8+ years of relevant production-level experience.
Experience with VictoriaMetrics.
Experience with Sumologic.
Experience with tracing tools (e.g. OpenTelemetry, Honeycomb, Tempo).
Experience with profiling tools (e.g. Pyroscope).
Knowledge of cloud monitoring, logging and cost management tools.
Programming/scripting knowledge (Go, Java, or Python) and understanding of JVM concepts.
In-depth knowledge of AWS services, hands-on experience in AWS provisioning using terraform.
Experience with containerized applications and Kubernetes / EKS. Creating and updating / maintaining Helm charts.
Understanding of microservices architecture and debugging/investigation techniques.
Strong understanding of systems, networking and troubleshooting techniques.
Experience in automated build pipeline, continuous integration and continuous deployment.
Ability to operate in an agile, entrepreneurial start-up environment.
Experience with running Linux in production.