
Senior Site Reliability Engineer, SRE
Mogul
full-time
Posted on:
Location Type: Remote
Location: Remote • 🇨🇦 Canada
Visit company websiteJob Level
Senior
Tech Stack
AnsibleApacheAWSCloudDockerDynamoDBEC2ElasticSearchJavaJavaScriptJenkinsKubernetesMicroservicesNode.jsOraclePostgresPythonSQLTerraformZookeeper
About the role
- Own and manage our AWS cloud-based technology stack, using native AWS services and top-tier SRE tools to support multiple client environments with Java-based applications and microservices architecture.
- Define SRE strategy, vision, and goals aligned to Vitech’s overall objectives. Establish roadmaps and plans for improving system reliability, scalability, and efficiency.
- Collaborate with Architecture review boards, Solution Architects, engage in viable solutions reviews/implementations.
- Design/refine and implement SLIs and SLO’s that covers broad spectrum of SRE – availability, performance, Error budgeting
- Design, deploy, and manage AWS Aurora PostgreSQL clusters for high availability and scalability. Optimize SQL queries, indexes, and database parameters for performance tuning.
- Automate database operations using Terraform, Ansible, AWS Lambda, and AWS CLI. Manage Aurora’s read replicas, auto-scaling, and failover mechanisms.
- Enhance infrastructure as code (IAC) patterns using technologies like Terraform, CloudFormation, Ansible, Python, and SDK. Collaborate with DevOps teams to integrate Aurora with CI/CD pipelines.
- Provide full-stack support, as per assigned schedule, on applications across technologies such as Oracle WebLogic, AWS Aurora PostgreSQL, Oracle Database, Apache Tomcat, AWS Elastic Beanstalk, Docker/ECS, EC2, S3, etc.,
- Troubleshoot database incidents, perform root cause analysis, and implement preventive measures. Document database architecture, configurations, and operational procedures.
- Ensure high availability, scalability, and performance of PostgreSQL databases on AWS Aurora. Monitor database health, troubleshoot issues, and perform root cause analysis for incidents.
- Embrace SRE principles such as Chaos Engineering, Reliability, Reducing Toil, etc.,
Requirements
- Proven hands-on experience as an SRE for critical, client-facing applications, with the ability to dive deep into daily SRE tasks, manage incidents, and oversee operational tools.
- 4+ years of experience developing and/or administering software in AWS public cloud and deep level experience in hosting applications in AWS (EC2, EBS, ECS/EKS, Elastic Beanstalk, RDS, CloudWatch).
- 3+ years of experience in managing relational databases (Oracle, and/or PostgreSQL) in both cloud and on-prem environments, including SRE tasks like backup/restore, Performance issues and replication.
- Demonstrable cross-functional full-stack knowledge with compute, storage, networking, security and databases
- Strong understanding of AWS networking concepts (VPC, VPN/DX/Endpoints, Route53, CloudFront, Load Balancers, WAF).
- Experience with containerized applications (Docker, Kubernetes, ECS). Leverage AWS Aurora features (e.g., read replicas, auto-scaling, multi-region deployments) to enhance database performance and reliability.
- Familiarity with Datalake architecture, Elasticsearch, Zookeeper, DynamoDB, a plus.
- Familiarity with tools like pgAdmin, psql, or other database management utilities. Automate routine database maintenance tasks (e.g., vacuuming, reindexing, patching). Knowledge of backup and recovery strategies (e.g., pg_dump, PITR).
- Set up and maintain monitoring and alerting systems for database performance and availability (e.g., CloudWatch, Honeycomb, New Relic, Dynatrace etc.,).
- Work closely with development teams to optimize database schemas, queries, and application performance. Provide database support during application deployments and migrations.
- Hands-on experience with web/application layers (Oracle WebLogic, Apache Tomcat, AWS Elastic Beanstalk, SSL certificates, S3 buckets).
- Automation experience with Infrastructure as Code (Terraform, CloudFormation, Python, Jenkins, GitHub/Actions). Knowledge of multi-region Aurora Global Databases for disaster recovery.
- Scripting experience in Python, Bash, Java, JavaScript, Node.js.
- Oversee and streamline change management procedures, efficiently handling daily production change requests to ensure seamless operations.
- Excellent written/verbal communication, critical thinking. #LI-Remote
Benefits
- At Vitech, we believe in empowering our teams to drive innovation through technology. **If you thrive in a dynamic environment and are eager to drive innovation in SRE practices, we want to hear from you!******You’ll be part of a forward-thinking team that values collaboration, innovation, and continuous improvement. We provide a supportive and inclusive environment where you can grow as a leader while helping shape the future of our organization.
- * We offer a competitive compensation package along with comprehensive benefits that support your health, well-being, and financial security.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
AWSJavamicroservices architecturePostgreSQLTerraformAnsiblePythonDockerSQLChaos Engineering
Soft skills
critical thinkingcommunicationcollaborationincident managementproblem-solvingcross-functional knowledgechange managementoperational oversightstrategic planningperformance optimization