Abacus Insights

Principal Sustaining Engineer – Forward Deployed

Abacus Insights

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Act as a senior technical escalation point during production incidents
  • Lead real-time incident triage, mitigation, and recovery efforts
  • Drive root cause analysis (RCA) with a focus on systemic, long-term fixes
  • Identify recurring failure patterns and push for architectural or operational improvements
  • Partner with Customer Success and Engineering to manage customer impact during incidents
  • Own post-launch reliability, stability, and operational quality of core systems
  • Investigate and resolve complex field issues and production defects
  • Ensure fixes developed during incidents or customer escalations are up streamed into the core product
  • Improve operational readiness of services through better runbooks, monitoring, and alerting
  • Reduce operational toil by converting repeated manual work into automation
  • Engage directly with strategic customers to solve real-world, production-grade technical challenges
  • Support complex deployments, integrations, and escalations in customer environments
  • Act as a trusted technical partner to customers during high-impact issues
  • Translate customer learnings into concrete product, platform, and operational improvements
  • Contribute to reusable tools, playbooks, and best practices that accelerate future deployments
  • Serve as a subject matter expert for AWS-hosted production systems
  • Troubleshoot and resolve issues across:
  • AWS compute, storage, networking, IAM, and security
  • Databricks jobs, clusters, and Spark-based data pipelines
  • Debug performance degradation, scalability issues, job failures, and data correctness problems
  • Partner with platform and data teams to harden systems for reliability, scale, and operability
  • Write production-quality code to:
  • Automate operational workflows
  • Improve reliability and observability
  • Eliminate manual intervention and reduce incident frequency
  • Contribute primarily in Python, with exposure to JVM-based systems as needed
  • Review code with a strong emphasis on operability, resiliency, and maintainability
  • Provide technical leadership without formal authority, influencing design and operational decisions
  • Mentor engineers through pairing, reviews, and incident leadership
  • Collaborate closely with Product, Engineering, Data, and Customer teams
  • Operate effectively in high-pressure, ambiguous environments, especially during customer-impacting incidents

Requirements

  • 10+ years of experience in software engineering, SRE, sustaining engineering, or production operations
  • Deep hands-on experience operating production systems in AWS
  • Strong experience troubleshooting Databricks and large-scale data platforms
  • Proficiency in Python and experience building production services or tooling
  • Strong understanding of:
  • Distributed systems
  • Incident management and RCA practices
  • Monitoring, alerting, and observability
  • CI/CD Pipelines that leverage Infrastructure as Code.
  • Proven ability to own problems end-to-end, from detection to permanent resolution
  • Excellent communication skills, especially during incidents and customer escalations
  • Ability to work backward from customer impact to root cause across systems and codebases, delivering fixes in environments with minimal documentation.
  • Strong instinct for operational risk, with the ability to proactively identify failure modes and harden systems before they impact customers.
Benefits
  • Unlimited paid time off – recharge when you need it
  • Work from anywhere – flexibility to fit your life
  • Comprehensive health coverage – multiple plan options to choose from
  • Equity for every employee – share in our success
  • Growth-focused environment – your development matters here
  • Home office setup allowance – one-time support to get you started
  • Monthly cell phone allowance – stay connected with ease
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
PythonAWSDatabricksdistributed systemsincident managementroot cause analysismonitoringalertingCI/CDInfrastructure as Code
Soft Skills
communicationtechnical leadershipmentoringproblem-solvingcollaborationoperational risk managementadaptabilityinfluencingcustomer focusworking under pressure