Parallel Domain

Principal Site Reliability Engineer

Parallel Domain

full-time

Posted on:

Location Type: Remote

Location: Canada

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Own and evolve our AWS-based infrastructure, improving platform performance and availability today, and building toward deployable configurations that support enterprise customer environments tomorrow.
  • Own EKS cluster operations across production regions: node pool strategy, AMI lifecycle, autoscaling, and Kubernetes workload health.
  • Support the GitOps deployment pipeline - define, deploy, and manage applications across clusters using infrastructure-as-code.
  • Manage complex networking: VPC design, cross-region connectivity, DNS, and load balancing.
  • Lead infrastructure deprecation and migration efforts with minimal disruption.
  • Own SLO measurement infrastructure; enable proactive triage of emerging issues before they impact customers.
  • Lead incident investigation, root cause analysis and postmortems, driving systemic fixes rather than one-off patches.
  • Design and improve automated remediation systems to reduce MTTR.
  • Review and provide security-conscious feedback on platform architecture decisions.
  • Own cloud IAM governance - roles, policies, and access boundaries across accounts and services.
  • Lead compliance-adjacent work including audit-readiness, partner certification requirements, and supporting responses to customer security questionnaires.
  • Partner with application development teams to build an inherently secure platform and drive next-generation deployment architecture.
  • Partner with customer teams to ensure availability for expected utilization.
  • Partner with Finance on cloud cost optimization - lifecycle policies, right-sizing, and spend visibility.
  • Support GPU and batch workloads in collaboration with simulation and ML engineering teams.
  • Improve CI/CD pipelines and automated infrastructure validation.
  • Support engineering teams with infra-side debugging, log analysis, and environment configuration.

Requirements

  • 5+ years in SRE, DevOps, or infrastructure engineering roles.
  • Infrastructure-as-code proficiency - Terraform modules, state management, and multi-environment patterns.
  • Deep AWS experience - EKS, EC2, IAM, S3, Storage Gateway, VPC networking, Transit Gateway, CloudFront, KMS, and IRSA.
  • Kubernetes expertise - cluster operations, node pools, probes, cordoning, pod scheduling, RBAC, Helm, node autoscaling (Karpenter experience a plus); solid understanding of containerization and AMI lifecycle management.
  • CI/CD - experience with GitOps workflows and pipeline tooling (ArgoCD, GitHub Actions, Jenkins)
  • Solid networking fundamentals - CIDR design, security groups, DNS, load balancing, VPN, cross-region connectivity.
  • Experience with monitoring and observability tooling - Prometheus, Grafana, Elasticsearch.
  • Comfort with Python and Bash for tooling and automation.
  • Familiarity working across Linux and Windows environments. Operational familiarity with Windows Server is a meaningful advantage.
  • You communicate clearly across engineering, product, and customer-facing teams, flagging issues with urgency proportional to customer impact.
  • You advocate for SRE best practices and can effectively operationalize an informed and principled view on security.
  • You take end-to-end ownership of complex, multi-team efforts - from planning through execution and post-change verification.
  • You know when to push for a clean solution vs. when to accept a pragmatic one, and you communicate that tradeoff clearly.
Benefits
  • Health insurance
  • Retirement plans
  • Paid time off
  • Flexible work arrangements
  • Professional development opportunities
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
AWSEKSEC2IAMS3TerraformKubernetesCI/CDPythonBash
Soft Skills
communicationadvocacy for SRE best practicesownershipurgencycollaborationproblem-solvingleadershipincident investigationroot cause analysisfeedback