
Principal Sustaining Engineer – Forward Deployed
Abacus Insights
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Job Level
Tech Stack
About the role
- Act as a senior technical escalation point during production incidents
- Lead real-time incident triage, mitigation, and recovery efforts
- Drive root cause analysis (RCA) with a focus on systemic, long-term fixes
- Identify recurring failure patterns and push for architectural or operational improvements
- Partner with Customer Success and Engineering to manage customer impact during incidents
- Own post-launch reliability, stability, and operational quality of core systems
- Investigate and resolve complex field issues and production defects
- Ensure fixes developed during incidents or customer escalations are up streamed into the core product
- Improve operational readiness of services through better runbooks, monitoring, and alerting
- Reduce operational toil by converting repeated manual work into automation
- Engage directly with strategic customers to solve real-world, production-grade technical challenges
- Support complex deployments, integrations, and escalations in customer environments
- Act as a trusted technical partner to customers during high-impact issues
- Translate customer learnings into concrete product, platform, and operational improvements
- Contribute to reusable tools, playbooks, and best practices that accelerate future deployments
- Serve as a subject matter expert for AWS-hosted production systems
- Troubleshoot and resolve issues across:
- AWS compute, storage, networking, IAM, and security
- Databricks jobs, clusters, and Spark-based data pipelines
- Debug performance degradation, scalability issues, job failures, and data correctness problems
- Partner with platform and data teams to harden systems for reliability, scale, and operability
- Write production-quality code to:
- Automate operational workflows
- Improve reliability and observability
- Eliminate manual intervention and reduce incident frequency
- Contribute primarily in Python, with exposure to JVM-based systems as needed
- Review code with a strong emphasis on operability, resiliency, and maintainability
- Provide technical leadership without formal authority, influencing design and operational decisions
- Mentor engineers through pairing, reviews, and incident leadership
- Collaborate closely with Product, Engineering, Data, and Customer teams
- Operate effectively in high-pressure, ambiguous environments, especially during customer-impacting incidents
Requirements
- 10+ years of experience in software engineering, SRE, sustaining engineering, or production operations
- Deep hands-on experience operating production systems in AWS
- Strong experience troubleshooting Databricks and large-scale data platforms
- Proficiency in Python and experience building production services or tooling
- Strong understanding of:
- Distributed systems
- Incident management and RCA practices
- Monitoring, alerting, and observability
- CI/CD Pipelines that leverage Infrastructure as Code.
- Proven ability to own problems end-to-end, from detection to permanent resolution
- Excellent communication skills, especially during incidents and customer escalations
- Ability to work backward from customer impact to root cause across systems and codebases, delivering fixes in environments with minimal documentation.
- Strong instinct for operational risk, with the ability to proactively identify failure modes and harden systems before they impact customers.
Benefits
- Unlimited paid time off – recharge when you need it
- Work from anywhere – flexibility to fit your life
- Comprehensive health coverage – multiple plan options to choose from
- Equity for every employee – share in our success
- Growth-focused environment – your development matters here
- Home office setup allowance – one-time support to get you started
- Monthly cell phone allowance – stay connected with ease
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
PythonAWSDatabricksdistributed systemsincident managementroot cause analysismonitoringalertingCI/CDInfrastructure as Code
Soft Skills
communicationtechnical leadershipmentoringproblem-solvingcollaborationoperational risk managementadaptabilityinfluencingcustomer focusworking under pressure