Senior Site Reliability Engineer, SRE

Mogul

full-time

Posted on: 10/23/2025

Location Type: Remote

Location: Remote • 🇨🇦 Canada

Visit company website

✨ AI Apply

Apply

Job Level

Senior

Tech Stack

AnsibleApacheAWSCloudDockerDynamoDBEC2ElasticSearchJavaJavaScriptJenkinsKubernetesMicroservicesNode.jsOraclePostgresPythonSQLTerraformZookeeper

About the role

Own and manage our AWS cloud-based technology stack, using native AWS services and top-tier SRE tools to support multiple client environments with Java-based applications and microservices architecture.
Define SRE strategy, vision, and goals aligned to Vitech’s overall objectives. Establish roadmaps and plans for improving system reliability, scalability, and efficiency.
Collaborate with Architecture review boards, Solution Architects, engage in viable solutions reviews/implementations.
Design/refine and implement SLIs and SLO’s that covers broad spectrum of SRE – availability, performance, Error budgeting
Design, deploy, and manage AWS Aurora PostgreSQL clusters for high availability and scalability. Optimize SQL queries, indexes, and database parameters for performance tuning.
Automate database operations using Terraform, Ansible, AWS Lambda, and AWS CLI. Manage Aurora’s read replicas, auto-scaling, and failover mechanisms.
Enhance infrastructure as code (IAC) patterns using technologies like Terraform, CloudFormation, Ansible, Python, and SDK. Collaborate with DevOps teams to integrate Aurora with CI/CD pipelines.
Provide full-stack support, as per assigned schedule, on applications across technologies such as Oracle WebLogic, AWS Aurora PostgreSQL, Oracle Database, Apache Tomcat, AWS Elastic Beanstalk, Docker/ECS, EC2, S3, etc.,
Troubleshoot database incidents, perform root cause analysis, and implement preventive measures. Document database architecture, configurations, and operational procedures.
Ensure high availability, scalability, and performance of PostgreSQL databases on AWS Aurora. Monitor database health, troubleshoot issues, and perform root cause analysis for incidents.
Embrace SRE principles such as Chaos Engineering, Reliability, Reducing Toil, etc.,

Requirements

Proven hands-on experience as an SRE for critical, client-facing applications, with the ability to dive deep into daily SRE tasks, manage incidents, and oversee operational tools.
4+ years of experience developing and/or administering software in AWS public cloud and deep level experience in hosting applications in AWS (EC2, EBS, ECS/EKS, Elastic Beanstalk, RDS, CloudWatch).
3+ years of experience in managing relational databases (Oracle, and/or PostgreSQL) in both cloud and on-prem environments, including SRE tasks like backup/restore, Performance issues and replication.
Demonstrable cross-functional full-stack knowledge with compute, storage, networking, security and databases
Strong understanding of AWS networking concepts (VPC, VPN/DX/Endpoints, Route53, CloudFront, Load Balancers, WAF).
Experience with containerized applications (Docker, Kubernetes, ECS). Leverage AWS Aurora features (e.g., read replicas, auto-scaling, multi-region deployments) to enhance database performance and reliability.
Familiarity with Datalake architecture, Elasticsearch, Zookeeper, DynamoDB, a plus.
Familiarity with tools like pgAdmin, psql, or other database management utilities. Automate routine database maintenance tasks (e.g., vacuuming, reindexing, patching). Knowledge of backup and recovery strategies (e.g., pg_dump, PITR).
Set up and maintain monitoring and alerting systems for database performance and availability (e.g., CloudWatch, Honeycomb, New Relic, Dynatrace etc.,).
Work closely with development teams to optimize database schemas, queries, and application performance. Provide database support during application deployments and migrations.
Hands-on experience with web/application layers (Oracle WebLogic, Apache Tomcat, AWS Elastic Beanstalk, SSL certificates, S3 buckets).
Automation experience with Infrastructure as Code (Terraform, CloudFormation, Python, Jenkins, GitHub/Actions). Knowledge of multi-region Aurora Global Databases for disaster recovery.
Scripting experience in Python, Bash, Java, JavaScript, Node.js.
Oversee and streamline change management procedures, efficiently handling daily production change requests to ensure seamless operations.
Excellent written/verbal communication, critical thinking. #LI-Remote

Benefits

At Vitech, we believe in empowering our teams to drive innovation through technology. **If you thrive in a dynamic environment and are eager to drive innovation in SRE practices, we want to hear from you!******You’ll be part of a forward-thinking team that values collaboration, innovation, and continuous improvement. We provide a supportive and inclusive environment where you can grow as a leader while helping shape the future of our organization.
* We offer a competitive compensation package along with comprehensive benefits that support your health, well-being, and financial security.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills

AWSJavamicroservices architecturePostgreSQLTerraformAnsiblePythonDockerSQLChaos Engineering

Soft skills

critical thinkingcommunicationcollaborationincident managementproblem-solvingcross-functional knowledgechange managementoperational oversightstrategic planningperformance optimization