Tech Stack
AWSCloudGrafanaKubernetesPrometheusTerraform
About the role
- Serve as first responder for production incidents during U.S. operating hours (±2h EST).
- Lead triage during outages, analyzing logs, metrics, and traces to identify root causes.
- Drive incident postmortems and follow-ups to prevent recurrence.
- Communicate clearly and quickly during incidents to internal stakeholders.
- Own reliability outcomes across all OpenFX systems, with a focus on uptime, latency, and error budgets.
- Enhance observability through logging, metrics, alerting, and dashboards.
- Optimize on-call processes and ensure smooth handoffs across IST, EST, and PST coverage.
- Partner with DevOps and engineering pods to implement fixes or approve production changes.
- Proactively identify systemic reliability risks and propose improvements.
- Contribute automation and tooling to reduce manual incident handling.
- Champion best practices in reliability engineering and operational excellence.
Requirements
- 5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering.
- Proven experience leading incident response, running postmortems, and communicating during outages.
- Strong background with cloud infrastructure (AWS preferred), container orchestration (Kubernetes, ECS), and Infrastructure-as-Code (Terraform, CloudFormation).
- Familiarity with observability stacks (e.g., Prometheus, Grafana, Datadog, ELK, OpenTelemetry).
- Ability to triage errors at both the infrastructure and application level, and escalate effectively when deeper intervention is required.
- Ownership mindset with strong communication skills in high-pressure situations.
- Competitive salary and benefits package.
- Equity in a rapidly growing company.
- Opportunity to work on mission-critical infrastructure in fintech.
- A collaborative team culture with a bias toward ownership and outcomes.
- The chance to make a direct impact on the resilience of global financial infrastructure.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
Site Reliability EngineeringDevOpsInfrastructure Engineeringincident responsepostmortemscloud infrastructurecontainer orchestrationInfrastructure-as-Codeobservability stackserror triage
Soft skills
communicationownership mindsethigh-pressure situation management