
Software Development Engineer III – SRE
Expedia Group
full-time
Posted on:
Location Type: Office
Location: Gurgaon • India
Visit company websiteExplore more
About the role
- Engage with domain owners on Business Continuity plans and run simulated Disaster Recovery scenarios to improve application resilience.
- Implement and support monitoring and alerting strategies to ensure the health, availability, capacity, and performance standards.
- Monitor and proactively identify system errors & opportunities to improve customer experience.
- Share domain and industry knowledge between cross-functional teams.
- Facilitate collaboration with different stakeholders with varied perspectives to develop effective solutions on Disaster Recovery Processes.
- Build reporting capabilities to showcase operational health and quality.
- Provide technical support, identification, troubleshooting, and resolution to issues and impacts.
- Drive a culture of root cause analysis and continuous improvement.
- Operationally support applications and services across multiple environments.
Requirements
- Bachelor’s or Master’s degree in a Technical Field with 6+ years or equivalent related professional experience
- Excellent problem-solving and analytical skills with strong attention to detail
- Experience in System Design, and Architecture
- Strong written and verbal communication skills
- Expert in AWS and EKS, with in-depth knowledge of infrastructure setup and multi-region environments
- In-depth knowledge of Reliability Concepts such as SLOs, SLIs, Error Budgets, and Disaster Recovery processes
- Exposure in Python Scripting (preferred)
- Knowledge of AI/ML concepts (preferred)
- Knowledge of various monitoring tools like Splunk, Datadog, and Catchpoint (preferred)
Benefits
- exciting travel perks
- generous time-off
- parental leave
- flexible work model
- career development resources
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
System DesignArchitectureAWSEKSPython ScriptingReliability ConceptsSLOsSLIsError BudgetsDisaster Recovery
Soft skills
problem-solvinganalytical skillsattention to detailwritten communicationverbal communicationcollaborationstakeholder engagementroot cause analysiscontinuous improvement