FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Senior Site Reliability Engineer, AI Infrastructure
PointClickCareSenior Site Reliability Engineer at PointClickCare focused on AI platforms reliability and operational excellence. Collaborating with cross-functional teams to ensure secure and efficient service delivery.
Tech Stack
Tools & technologiesAzureCloudKubernetesTerraform
About the role
Key responsibilities & impact- Own service level objectives, error budgets, and reliability targets for the infrastructure underpinning cloud-based platforms — ensuring infrastructure observability (metrics, logs, traces), alert quality, and telemetry completeness across platform components and serving endpoints
- Design, build, and maintain infrastructure-as-code, operational automation, and change control workflows for AI/ML platforms — with a focus on repeatability, consistency, and toil reduction
- Implement and maintain platform security controls — including network segmentation, secrets management, encryption, and data protection safeguards — aligned to compliance requirements and partnering with security teams to respond to emerging risks
- Lead incident response and blameless postmortems; validate backup/restore and disaster recovery processes; conduct game days and resiliency testing to harden platform and infrastructure reliability
- Mentor engineers, influence design reviews, and collaborate across engineering teams to improve platform resiliency, cost efficiency, capacity planning, and operational standards
Requirements
What you’ll need- Minimum:
- 5+ years in SRE, platform engineering, or infrastructure roles supporting production cloud environments and mission-critical applications
- Strong proficiency with observability — metrics, logging, distributed tracing, SLI/SLO frameworks — and production ownership including incident response, blameless postmortems, and on-call operations
- Strong proficiency with Infrastructure as Code (Terraform), GitOps practices, and CI/CD for infrastructure and platform changes
- Working proficiency with cloud platform administration — compute, networking, storage, and operating managed data or AI/ML platform services in production (e.g., Databricks, Azure ML, or Kubernetes-hosted infrastructure)
- Working proficiency with platform security — network segmentation, secrets management, encryption at rest and in transit, and key management
- Strong programming skills for automation, operational tooling, and infrastructure management
- Strong communication and documentation skills — able to write runbooks, lead postmortems, influence operational standards across teams, and translate technical complexity for diverse audiences
- Preferred:
- Experience with disaster recovery planning, multi-region patterns, and capacity or cost optimization (FinOps)
- Working knowledge of container orchestration (Kubernetes), progressive delivery patterns (blue/green, canary), and data lineage tooling
- Experience in healthcare, life sciences, or other highly regulated industries with data privacy requirements
Benefits
Comp & perks- Benefits starting from Day 1!
- Retirement Plan Matching
- Flexible Paid Time Off
- Wellness Support Programs and Resources
- Parental & Caregiver Leaves
- Fertility & Adoption Support
- Continuous Development Support Program
- Employee Assistance Program
- Allyship and Inclusion Communities
- Employee Recognition … and more!
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
infrastructure-as-codeTerraformGitOpsCI/CDobservabilitymetricsloggingdistributed tracingplatform securityprogramming
Soft Skills
communicationdocumentationmentoringcollaborationincident responseblameless postmortemsinfluencing design reviewscapacity planningoperational standardstoil reduction