Tech Stack
AnsibleChefCloudGoogle Cloud PlatformGrafanaHadoopJenkinsKubernetesLinuxMySQLPostgresPrometheusPuppetPythonSparkSplunkSQLTerraform
About the role
- Deploy and manage cloud resources (VMs, Databases (MySQL and Postgres), networking) aligned with business requirements
- Monitor cloud systems for performance, availability, and security using Prometheus, Grafana, and Splunk
- Manage and maintain services with zero downtime; automate monitoring via GKE clusters, Dataproc clusters, Cloud SQL
- Implement Infrastructure as Code (IaC) using Terraform or Kubernetes YAML
- Respond to and resolve technical issues including connectivity, performance, and security incidents
- Conduct root cause analysis, lead remediation efforts, and maintain incident reports and documentation
- Collaborate with cross-functional teams to resolve complex issues
- Implement access controls and identity management; ensure compliance with legal, infosec, and privacy standards
- Conduct audits and vulnerability management and coordinate remediation with teams/vendors
- Automate infrastructure provisioning and operational processes; support CI/CD pipelines
- Design and maintain monitoring dashboards, set up alerting, analyze logs and metrics, and generate system health reports
- Maintain comprehensive documentation, build/update knowledge base, and conduct workshops/webinars to share best practices
Requirements
- 4+ years of experience in cloud environments (GCP DevOps mandatory)
- Good Linux system administration experience
- Hands-on experience with Terraform, Jenkins, Ansible
- Proficiency in monitoring/debugging tools (Splunk, Prometheus, Grafana, Kubectl)
- Strong scripting skills (e.g., Bash, Python)
- Experience with DevOps tools: Jenkins, CI/CD pipelines, Git, Kubernetes, Helm
- Exposure to Big Data technologies: Hadoop ecosystem, Hive, Spark, Jupyter Notebook
- Solid understanding of networking, virtualization, and DevOps practices
- Familiarity with ITIL framework and patch management
- Experience with Confluence and Jira ticketing and alerting mechanisms
- Participation in a rotational on-call schedule is required
- Nice-to-Have: Database query proficiency for troubleshooting
- Nice-to-Have: Configuration management tools (Chef, Puppet, Ansible)