Tech Stack
ApacheAWSCloudGoogle Cloud PlatformJavaScriptKafkaPythonSaltStackShell ScriptingSpark
About the role
- Serve as the first line of defense for production incidents, ensuring rapid triage, root cause analysis, and resolution.
- Monitor system health and performance of deployed APIs and integrating applications.
- Track and investigate issues related to latency, failures, or broken integrations, escalating to the API engineering group where appropriate.
- Collaborate with API engineering, Developer Services, Product Management, platform, and governance teams to ensure stability and performance.
- Collaborate with platform engineers to implement observability, logging, and alerting best practices for API services.
- Build diagnostic tools, runbooks, and automated workflows to improve incident response time and reduce manual intervention.
- Maintain knowledge bases and playbooks for repeatable troubleshooting and knowledge transfer.
- Partner with governance and compliance teams to ensure incidents are documented and remediated in line with internal policy.
- Contribute to retrospectives and continuous improvement efforts to harden production systems.
Requirements
- 3+ years of experience in production support, site reliability engineering (SRE), or DevOps—preferably supporting Apigee APIs.
- Strong understanding of cloud infrastructure (AWS, GCP) and observability tools
- Proficiency in Python or shell scripting for automation and troubleshooting
- Proficiency in programming languages such as Python and JavaScript
- Strong analytical, communication, and incident management skills
- Familiarity with big data technologies (Apache Spark, Kafka)
- Experience with CI/CD tools and Alerts/Monitoring automation
- Familiarity with API Integrations
- Bachelor’s degree in Computer Science, Engineering, or a related field (preferred)
- Ability to work proactively with a high level of initiative and accuracy
- Ability to manage multiple assignments effectively and meet established deadlines
- Strong interpersonal skills to interact professionally with staff and stakeholders
- Excellent organizational skills and attention to detail
- Critical thinking ability for moderately to highly complex tasks
- Flexibility in adapting to changing business needs and priorities
- Ability to work creatively and independently with minimal supervision
- Ability to utilize experience and judgment in accomplishing goals
- Experience in navigating organizational structures and collaborating across teams
- Travel Required: 2%