Tech Stack
AWSAzureCloudDockerGoGoogle Cloud PlatformGrafanaJavaKubernetesPrometheusPythonScalaTerraform
About the role
- Lead the design, implementation, and maintenance of WRITER, Inc.’s cloud infrastructure to ensure high availability and performance
- Design and implement scalable cloud automation to support seamless deployment for our largest enterprise customers
- Automate infrastructure provisioning and management using Terraform & Python
- Collaborate with development teams to optimize cloud resources and enhance system reliability
- Develop and maintain monitoring and alerting systems to proactively identify and resolve issues affecting the reliability of our writing solutions
- Conduct post-mortem analyses of system failures to identify root causes and implement preventive measures
- Optimize and scale our cloud infrastructure to support growing user demand and ensure cost efficiency
- Ensure the security and compliance of our systems, adhering to industry standards and regulations
- Provide mentorship and technical guidance to junior engineers, fostering a culture of reliability and continuous improvement
- Stay current with emerging technologies and industry trends to continuously improve our site reliability practices
Requirements
- Proven expertise in Site Reliability Engineering with a minimum of 7 years of hands-on experience
- Deep understanding of system architecture and infrastructure design to ensure high availability and performance
- Bachelor’s degree in Computer Science, Engineering, or a related technical field
- Strong proficiency in programming languages such as Python, Java, Go for automation and monitoring
- Experience with cloud platforms like AWS, Azure, or GCP, and their respective services for scalable and resilient systems
- Expertise in containerization technologies (e.g., Docker, Kubernetes) and orchestration tools
- Knowledge of monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack) to maintain system health and performance
- Ability to lead and mentor junior engineers in best practices for reliability and system optimization
- Excellent communication skills to collaborate effectively with cross-functional teams and stakeholders
- Proactive approach to identifying and mitigating potential system failures and performance bottlenecks
- Software engineering expertise (preferred)
- Terraform (preferred)
- Python (preferred)
- Kubernetes (preferred)
- Scala (preferred)
- AWS/GCP (preferred)
- Applicants must answer work authorization questions on the application form (are you legally authorized to work in the country where the job is located; will you require visa sponsorship?)
- Applicants must confirm they are at least 18 years of age and able to attend in-person collaborative sessions in office 2-3 days/week