Tech Stack
AnsibleAWSAzureCloudDistributed SystemsFirewallsGoGoogle Cloud PlatformGrafanaKubernetesLinuxMicroservicesOpen SourcePostgresPrometheusPythonRedisSDLCTerraformVault
About the role
- Infrastructure Management & Operations: Implement and manage container orchestration platforms (Kubernetes, Nomad, Talos) for the Inco testnet and mainnet deployments; Ensure infrastructure scalability, reliability, and security through best practices; Manage hybrid cloud and on-premises infrastructure with focus on performance optimization; Deploy and maintain blockchain nodes (Ethereum, Solana, etc.) ensuring high availability and performance
- Monitoring & Observability: Build comprehensive monitoring and alerting systems using Prometheus, Grafana, and Loki; Design and implement distributed tracing and logging infrastructure; Create custom dashboards and metrics for network performance monitoring; Establish SLIs/SLOs and implement proactive alerting strategies
- Automation & Infrastructure as Code: Develop and maintain infrastructure as code using Terraform and Ansible; Implement GitOps workflows using ArgoCD or similar tools for continuous deployment; Build secure CI/CD pipelines with automated security scanning and compliance checks
- Security and Compliance: Implement defense-in-depth security strategies including network segmentation, secrets management (Vault), and vulnerability scanning; Manage security baselines, conduct regular audits, and coordinate penetration testing; Prepare and maintain infrastructure for compliance audits (SOC2, ISO 27001); Configure VPNs, firewalls, IDS/IPS, and implement zero-trust architecture
- System Administration & Platform Engineering: Configure and optimize distributed systems for high-performance computing workloads; Network engineering including routing, load balancing, firewall configuration, and network segmentation; Manage distributed systems, bare metal servers, and virtualization platforms
- Documentation & Collaboration: Create comprehensive documentation for infrastructure, security procedures, and incident response playbooks; Collaborate with protocol engineering and development teams on secure SDLC practices; Contribute to security awareness and training programs
Requirements
- 5+ years DevOps/SRE experience with production infrastructure management
- Cloud expertise: Hands-on experience with at least one major cloud provider (AWS, GCP, Azure)
- Container orchestration: Production experience with Kubernetes and/or Nomad, Talos
- IaC proficiency: Hands-on experience with at least one Terraform and Ansible with GitOps (ArgoCD) implementation
- Monitoring stack: Production experience with Prometheus, Grafana, and Loki
- Linux administration: Advanced skills in system administration and security hardening
- Security implementation: Experience with vulnerability scanning, secrets management, and security automation in CI/CD
- Networking: Strong understanding of VPNs, firewalls, load balancers, and network security
- Scripting: Proficiency in Python, Go, or Bash for automation
- Compliance: Experience with security audits and implementing compliance controls