FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Head of Engineering – Cloud & Platform
RecargaPayHead of Cloud & Platform at RecargaPay building a world-class cloud ecosystem. Lead multi-disciplinary teams to ensure reliability and efficiency in a regulated fintech environment.
Tech Stack
Tools & technologiesAWSCloudDistributed SystemsGrafanaJavaKafkaKubernetesMicroservicesNode.jsPrometheusPythonSpringSpring BootSpringBootTerraformVault
About the role
Key responsibilities & impact- Define and execute the Cloud and Platform strategy, ensuring alignment with corporate objectives, regulatory frameworks, and cost-efficiency goals.
- Lead a multi-disciplinary organization covering Cloud Infrastructure, SRE, Platform Engineering, and DevSecOps, fostering collaboration and shared accountability for uptime, security, and performance.
- Drive modernization of infrastructure and delivery pipelines, enabling a unified, automated, and compliant cloud environment.
- Partner with executive leadership to define scalable operating models, balancing autonomy for product squads with standardized guardrails and golden paths.
- Establish a long-term architectural vision for cloud services, platform frameworks, and developer enablement tools.
- Sponsor AI-assisted engineering adoption to enhance developer productivity, reduce toil, and accelerate delivery (e.g., Copilot, Cursor, LLM-based agents).
- Serve as the ultimate technical and strategic authority for AWS, Kubernetes, IaC, Observability, and Reliability practices across the organization.
- Oversee the design, scalability, and governance of the AWS multi-account organization, enforcing security, compliance, and cost policies (Control Tower, SCPs, Service Catalog).
- Lead the definition and implementation of multi-region, multi-environment architectures ensuring reliability, latency optimization, and disaster recovery readiness (RPO/RTO).
- Institutionalize well-architected principles (Security, Reliability, Performance, Cost, Sustainability) and drive continuous improvement programs based on regular audits.
- Evolve network and connectivity architectures (VPC, Transit Gateway, PrivateLink, Global Accelerator) to meet scaling, compliance, and availability goals.
- Own identity, access, and secrets management lifecycle (IAM least privilege, mTLS, KMS/HSM key rotation, Vault integration).
- Oversee monitoring and observability frameworks, implementing standards, and unified dashboards across all services.
- Ensure SLO-driven operations, with well-defined SLIs, error budgets, and automated incident management loops.
- Lead resilience and reliability engineering practices, including chaos engineering, failover drills, dependency fallback design, and proactive degradation handling.
- Build and scale the company’s Internal Developer Platform (IDP), empowering teams with self-service capabilities for environment provisioning, deployments, and observability.
- Define golden paths, opinionated tooling, and reusable infrastructure modules, enabling consistent, secure, and fast software delivery across squads.
- Ensure trunk-based development, progressive delivery (canary, blue/green), automated rollback, and health/SLO-gated deployments are embedded into CI/CD flows.
- Drive GitOps adoption to achieve deterministic deployments, auditability, and drift detection.
- Expand event-driven and streaming platforms (e.g., Kafka), defining keying, partitioning, and schema evolution strategies to support scalability and data integrity.
- Partner with Security and Compliance to embed DevSecOps and Policy-as-Code practices into CI/CD and Kubernetes admission controllers.
- Establish and lead a FinOps program, optimizing compute, storage, and data transfer costs while ensuring transparency through chargeback/showback models.
- Define cost-to-serve models per service and implement automated guardrails for budgeting and right-sizing.
- Integrate cost and performance telemetry into platform dashboards to drive data-informed decision-making.
- Partner with Finance to align cloud spend forecasts and track savings initiatives tied to architecture decisions.
- Lead and mentor senior engineering managers and principal engineers, building high-performance, high-accountability teams.
- Promote a culture of reliability, automation, and continuous improvement through transparent metrics and post-incident learning loops.
- Establish governance rhythms such as architecture councils, platform guilds, and reliability reviews to align technical direction and eliminate systemic friction.
- Collaborate closely with Risk, Compliance, and Security to uphold standards like PCI-DSS, SOC2, ISO27001, LGPD, and GDPR within cloud and platform operations.
Requirements
What you’ll need- Academic background oriented toward Computer Science, Engineering, or Software Development disciplines.
- Deep expertise in AWS cloud architecture, including multi-account management, VPC design, EKS, ECS, Lambda, and networking topologies.
- Proven experience with Infrastructure as Code (Terraform, Pulumi) and GitOps automation at scale.
- Strong understanding of Kubernetes internals, workload orchestration, and cost/performance optimization.
- Experience implementing SRE and reliability frameworks: SLOs, error budgets, chaos testing, and automated incident remediation.
- Mastery of observability and monitoring (CloudWatch, Grafana, Datadog, NewRelic) with trace/metric/log correlation.
- Proficiency in security and compliance engineering: IAM, KMS, encryption, secrets lifecycle, policy enforcement (OPA/Rego), and regulatory controls (PCI, LGPD, GDPR).
- Experience defining and governing API and event-driven architectures (OpenAPI/AsyncAPI, Kafka schema registries).
- Deep knowledge of progressive delivery, service mesh (e.g., Istio), and DevSecOps pipelines.
- Strong FinOps acumen: right-sizing, egress optimization, reserved instance and savings plan strategy, and service-level cost attribution.
- Experience integrating AI-assisted workflows (GitHub Copilot Enterprise, LLM-based linters and others) into development and CI pipelines, with measurable productivity impact.
- Extensive hands-on experience in software engineering roles, with solid proficiency in Java (Spring Boot) and working knowledge of Python and asynchronous programming.
- Strong foundation in Object-Oriented Programming and relational database systems.
- Solid understanding of web and mobile application architectures, including security, session management, and development best practices.
- Expertise in Domain-Driven Design and microservices architecture, with proven ability to design high-performance, scalable, and reliable distributed systems.
- Demonstrated experience defining and executing architectural roadmaps aligned with business and developer-experience goals.
- Deep knowledge of networking in AWS.
- Advanced experience architecting VPC topologies, including Transit Gateway, private/public subnet design, NAT/GW cost optimization, and egress control for regulated environments.
- Hands-on experience implementing observability pipelines at scale, integrating NewRelic, CloudWatch, Prometheus, Grafana, Datadog.
- Familiarity with EKS internals: node group management, autoscaling, and Kubernetes cost/latency optimization.
- Proven experience managing multi-region and multi-environment deployments.
- Expertise in AWS security hardening and compliance controls, including IAM least-privilege modeling, KMS envelope encryption, CloudTrail auditing, GuardDuty detections, and automatic remediation with Lambda/Step Functions.
- Deep understanding of container security, image signing, ECR scanning, and OPA/Rego policy design for admission controllers.
- Advanced experience with Infrastructure as Code using Terraform (modules, workspaces, policy enforcement) and Pulumi (multi-language stacks, secrets providers, CI integration).
- Proven ability to implement GitOps workflows, ensuring deterministic deployments and drift detection.
- Strong policy-as-code practice to codify security/SRE guardrails across CI/CD and Kubernetes admission controllers.
- Expertise automating application stack provisioning (app resources, service accounts, IAM bindings, egress controls) through reusable IaC modules and pipelines.
- Deep understanding of progressive delivery (canary, blue/green, shadow traffic, automated rollback) and service mesh (Istio/Linkerd/App Mesh) for safe deployment strategies.
- Mastery of resilience and reliability patterns: timeouts, bounded retries with jitter, circuit breakers, bulkheads, back-pressure, outbox/saga orchestration, and graceful degradation.
- Deep knowledge of event-driven and streaming architectures (Kafka and others), including partitioning strategies, compaction/retention policies, rebalancing, ordering guarantees, exactly-once semantics, and schema evolution via registries.
- Strong background in data performance engineering: caching (read-through/write-behind), connection pool tuning, pagination/cursoring, latency budgeting, and throughput modeling.
- Experience with SLO-driven reliability: defining SLIs, error budgets, and reducing alert fatigue via multi-signal correlation.
- Proficiency with production monitoring tools (NewRelic, Grafana, Datadog, CloudWatch) and advanced observability instrumentation.
- Proven experience building self-service developer platforms (Backstage, Internal Developer Portals) that expose golden paths for application scaffolding, environment provisioning, and secure deployments.
- Experience implementing event-driven DevEx tooling (e.g., ephemeral environments, automated CI insights, preview deployments).
- Strong knowledge of API lifecycle management and governance (OpenAPI/AsyncAPI, contract testing, versioning, idempotency, error modeling).
- Expertise in CI/CD automation and DevSecOps (GitHub Actions, CodeBuild/CodePipeline, artifact provenance, environment promotion, changelog automation).
- Practical compliance-by-design experience translating PCI-DSS, KYC/AML, GDPR, and LGPD controls into technical patterns (tokenization, segmentation, audit trails, retention/erasure).
- Experience leading AWS Well-Architected Framework reviews across all pillars (Security, Reliability, Performance, Cost, Operational Excellence, Sustainability).
- Experience designing cost-aware architectures, balancing performance, resilience, and financial efficiency.
- Exposure to edge computing and CDN optimization (Lambda@Edge, CloudFront Functions, custom caching policies).
- Fluent in English and Spanish (Portuguese a plus).
Benefits
Comp & perks- Competitive and market-aligned salary.
- Annual Award Program
- Remote work — wherever you are, you’re part of the team!
- Home office allowance through a monthly deposit in the RecargaPay app.
- Health and dental plans with no co-pay.
- Life insurance.
- Flexible meal allowance (via Flash).
- TotalPass membership to take care of your health.
- Language classes.
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
AWSKubernetesInfrastructure as CodeTerraformPulumiDevSecOpsObservabilityJavaPythonEvent-driven architecture
Soft Skills
leadershipcollaborationstrategic authoritymentoringcommunicationcontinuous improvementorganizational skillsproblem-solvingaccountabilityadaptability