Principal Cloud Platform Engineer

RCH Solutions

Principal Cloud Platform Engineer with expertise in Kubernetes-based infrastructure for RCH Solutions. Join Cloud Engineering team to design and operate scalable AI Platforms in life sciences.

Posted 5/12/2026full-timeRemote • New York • 🇺🇸 United StatesLeadWebsite

Tech Stack

Tools & technologies

BigQueryCloudDistributed SystemsFluxGoogle Cloud PlatformGrafanaKubernetesNode.jsPrometheusTerraform

About the role

Key responsibilities & impact

Design, operate, and continuously improve production-grade K8s clusters at the platform level.
Lead complex cluster lifecycle management, including:
Version upgrades and dependency coordination
Failure recovery and incident resolution
Non-trivial maintenance and system evolution
Build and maintain highly reliable, scalable, multi-tenant infrastructure.
Build and maintain end-to-end observability for LLM-based systems using Grafana, LangFuse, and LangSmith — covering performance, latency, token usage, and alerting.
Architect and operate shared infrastructure across multiple teams and use cases.
Implement and enforce RBAC and access control models, Tenant isolation and security boundaries, Resource management and fairness at scale.
Ensure platform stability under diverse and competing workloads.
Operate and optimize vector database systems (Weaviate preferred) in production environments.
Support and scale Retrieval-Augmented Generation (RAG) systems.
Drive improvements in Query performance and latency, Cluster tuning and resource efficiency, Operational stability of retrieval pipelines.
Take technical ownership of production systems over time.
Build and maintain strong practices in Observability (metrics, logs, tracing), Incident response and root cause analysis, Long-term system health and resilience.
Proactively identify and resolve reliability risks.
Work closely with backend and GenAI engineers to ensure seamless integration with the platform.

Requirements

What you’ll need

5+ years hands-on background in high-scale platform engineering (internal platforms, PaaS, or shared infra)
Deep Kubernetes Platform Expertise
Hands-on experience with GKE: Cluster upgrades, node pool management, autoscaling
Managing failures, disruptions, and complex maintenance scenarios
RBAC, namespaces, network policies
GCP IAM, Workload Identity, Secret Manager
GCP Storage: BigQuery, GCS, Firestore
Terraform and IaaC experience with GitOps workflows (ArgoCD, Flux or equivalent)
Strong observability practices using Google Cloud Operations Suite (Stackdriver), Prometheus / Grafana
Hands-on experience operating vector databases in production, ideally Weaviate: Query performance tuning, Cluster stability and scaling behavior
Solid understanding of distributed systems design and failure modes
Multi-zone / regional architectures
Google Cloud Load Balancing

Benefits

Comp & perks

A competitive salary and bonus package based on experience
Comprehensive health and wellness benefits, including Medical, Dental, and Vision Insurance
Company-provided Life and Long-Term Disability Insurance
Company-sponsored 401(k) Plan
Company-provided continuing education benefit
Team-focused culture and unlimited opportunity for advancement

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

KubernetesGKERBACTerraformIaaCGoogle Cloud Operations SuitePrometheusGrafanaWeaviatedistributed systems design

Soft Skills

leadershipincident resolutionproblem-solvingtechnical ownershipcollaboration