Serve as the primary contact responsible for overall application health, performance, and capacity (Operational Readiness Architect)
Support services before they go live via system design consulting, capacity planning and launch reviews
Partner with development and product teams to establish monitoring and alerting strategy and frameworks for zero downtime deployments
Perform operability and resilience design and implement and maintain highly reliable and scalable infrastructure (Site Reliability Engineering)
Perform root cause analysis of incidents and collaborate with development teams to resolve issues
Participate in on-call rotations and respond to critical incidents; complete end-to-end run ownership of the product
Practice sustainable incident response and blameless post-mortems; automate data-driven alerts and work with teams to establish SLOs
Tackle complex development, automation, and business process problems and improve the full service lifecycle
Support and lead CI/CD pipeline operations, validation, operational gating, and DevOps automation best practices
Design and implement solutions for capacity planning and performance optimization and increase automation to reduce toil
Analyze ITSM activities and provide feedback to development teams on operational gaps or resiliency concerns
Collaborate with cross-functional teams, mentor others, and drive continuous improvement
Requirements
BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), or equivalent practical experience
Ability to read, write, and understand code in one of the programming languages
Strong understanding of DevOps principles, practices along with configuration management
Experience in operational and resilience designing, building, and operating large-scale, distributed systems
Appetite for change and pushing the boundaries of what can be done with automation
Experience with algorithms, data structures, scripting, pipeline management, and software design
Systematic problem-solving approach, analytical, coupled with strong communication skills and a sense of ownership and drive
Interest in designing, analysing, and troubleshooting large-scale distributed systems
Strong leadership and mentoring skills
A passion for observability, automation and continuous improvement
Willingness and ability to learn and take on challenging opportunities and to work as a member of matrix based diverse and geographically distributed project team
Ability to balance doing things right with fixing things quickly
Comfortable collaborating with cross-functional teams to ensure that expected system behaviour is understood and monitoring exists to detect anomalies
Expert coding experience in one or more of: C++, Java, Spring Framework, Python, Go, Spark, Bigdata, GRPC
Familiarity with cloud platforms like AWS, Azure, or GCP
Experience with Message Queue technologies like RabbitMQ, Event Broker, Kafka, or ActiveMQ
Background on cloud native tooling and orchestration technologies (Kubernetes preferred)
Experience in observability tools such as Splunk, Dynatrace, Prometheus, Datadog, Grafana, and Monitoring as a Code
Experience in production support environments and ITIL processes
Experience with CI/CD tools like Git/BitBucket, Jenkins, Maven, Artifactory, Groovy and Chef
Understanding of client-server relationships, network concepts (Layer 1 to Layer 3), stack trace analysis, load balancers, application firewalls, operating system navigation, logging and monitoring methods, high availability and business continuity planning, caching concepts, and configuration management
(Great to have) Hands-on experience with Kubernetes, Docker, Azure Container Registry, public cloud strategy and Azure DevOps/AZ-400/AZ-203 certifications, DevSecOps tools, certificate management, mutual TLS, SSL, SSH keys, and encryption
Benefits
insurance (including medical, prescription drug, dental, vision, disability, life insurance)
flexible spending account and health savings account
paid leaves (including 16 weeks of new parent leave and up to 20 days of bereavement leave)
80 hours of Paid Sick and Safe Time
25 days of vacation time and 5 personal days (pro-rated based on date of hire)
10 annual paid U.S. observed holidays
401k with a best-in-class company match
deferred compensation for eligible roles
fitness reimbursement or on-site fitness facilities
eligibility for tuition reimbursement
competitive base salary and may be eligible for an annual bonus or commissions depending on the role
ATS Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
strong communication skillsleadershipmentoringanalytical problem-solvingsense of ownershipdrive for continuous improvementcollaborationadaptabilitysystematic approachappetite for change