
Software Development Engineer III – Infrastructure
HighLevel
full-time
Posted on:
Location Type: Remote
Location: India
Visit company websiteExplore more
About the role
- Participate in 24/7 on-call rotations for core infrastructure systems
- Execute incident response during production events, including triage, mitigation, and recovery
- Maintain and improve runbooks, operational procedures, and escalation paths
- Help reduce MTTR and prevent repeat incidents through engineering solutions
- Improve reliability of core infrastructure components including: Kubernetes (GKE) clusters, Cloud networking and load balancing & Edge services (Cloudflare)
- Identify systemic reliability issues and drive corrective actions
- Support capacity planning, scaling, and resilience testing
- Execute security remediations across cloud and Kubernetes environments
- Support enforcement of: IAM least-privilege access, Network security controls & Runtime security policies
- Partner with Platform Security on vulnerability management and remediation
- Support security incident response and post-incident reviews
- Automate repetitive operational and security tasks
- Build tooling to improve: Incident response speed, Operational visibility & Security posture enforcement
- Reduce manual toil through scripts, tooling, and process improvements
- Support safe execution of infrastructure and configuration changes
- Ensure changes follow defined change management and audit requirements
- Contribute to incident reviews, postmortems, and continuous improvement initiatives
- Work closely with Cloud Infrastructure, SRE, Platform, Data, and Security teams
- Contribute to shared documentation and operational standards
- Mentor junior engineers and lead small reliability or security initiatives
Requirements
- 4+ years of experience operating large-scale systems
- Experience leading incident response or reliability initiatives
- Ability to identify systemic issues and propose long-term fixes
- Comfortable mentoring junior engineers and influencing peers
- Experience supporting or operating production systems
- Strong understanding of reliability, security, and operational best practices
- Comfortable working in on-call and incident response environments
- Strong troubleshooting and communication skills
- Experience with GCP or other public cloud platforms (Nice to have)
- Experience with Kubernetes (GKE) in production (Nice to have)
- Familiarity with Cloudflare, networking, or edge security (Nice to have)
- Exposure to security tooling or vulnerability management (Nice to have)
- Scripting or automation experience (Python, Go, Bash, etc.) (Nice to have)
- Experience in compliance- or audit-driven environments (SOC2, ISO) (Nice to have)
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
KubernetesGKECloud networkingLoad balancingCloudflareIncident responseScriptingAutomationSecurity remediationsCapacity planning
Soft skills
MentoringCommunicationTroubleshootingLeadershipCollaborationProblem-solvingInfluencingContinuous improvementOperational best practicesIncident management