Tech Stack
AnsibleAWSAzureCassandraCloudDistributed SystemsDNSDockerElasticSearchFirewallsGoGrafanaJenkinsKafkaKubernetesLinuxNoSQLPerlPrometheusPythonRubySaltStackSplunkSQLTCP/IPTerraformUnixVMware
About the role
- Design and implement platform on the cloud to support OXIO backend services
- Automate technical operations: deployments, scaling, recovery, etc.
- Monitor and maintain mission-critical production infrastructure to ensure maximum uptime
- Participate in an on-call rotation and culture of continuous improvement through blameless postmortems
- Enable the Engineering/Telecom/Data Engineering teams by providing them the tools to operate the service they build
Requirements
- Understanding of Linux/Unix systems (most systems are Linux-based)
- Familiarity with Linux/Unix system internals like process management, filesystems, memory management, and networking
- Proficiency in at least one programming language (Python, Go, or Ruby)
- Strong skills in scripting (Bash, Perl)
- Experience with infrastructure provisioning tools such as Terraform, CloudFormation, or Ansible
- Familiarity with containerization (Docker) and orchestration tools (Kubernetes)
- Familiarity with monitoring tools like Prometheus, Grafana, or Datadog
- Knowledge of setting up alerts, analyzing logs, and creating dashboards for observability
- Familiarity with incident management practices (e.g., runbooks, postmortems)
- Experience in being part of an on-call rotation and handling incidents
- Experience in setting up and maintaining Continuous Integration/Continuous Delivery pipelines (Jenkins, GitLab CI, CircleCI, etc.)
- Hands-on experience with cloud providers (AWS, Google Cloud, Azure)
- Knowledge of virtualization technologies (VMware, KVM) and cloud-native architecture
- Understanding of TCP/IP, DNS, HTTP/HTTPS, load balancing, and firewalls
- Strong understanding of deployment strategies (canary releases, blue-green deployments, etc.) [nice to have]
- Familiarity with high availability and understanding failover mechanisms [nice to have]
- Familiarity with IAM (Identity and Access Management) and zero trust principles [nice to have]
- Experience working with distributed systems (e.g., Kafka, Cassandra, Elasticsearch) [nice to have]
- Building custom monitoring tools or writing complex automation scripts [nice to have]
- Functional knowledge of database management (SQL and NoSQL) [nice to have]
- Familiarity with distributed tracing (Jaeger, OpenTelemetry) and advanced log aggregation strategies (ELK stack, Splunk) [nice to have]
- Familiarity with performance profiling tools and optimizing application performance under heavy load [nice to have]
- Familiarity in load testing and identifying bottlenecks [nice to have]
- Familiarity with Configuration Management using SaltStack for maintaining server configurations [nice to have]