FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Principal Cloud Platform Engineer
RCH SolutionsPrincipal Cloud Platform Engineer with expertise in Kubernetes-based infrastructure for RCH Solutions. Join Cloud Engineering team to design and operate scalable AI Platforms in life sciences.
Tech Stack
Tools & technologiesBigQueryCloudDistributed SystemsFluxGoogle Cloud PlatformGrafanaKubernetesNode.jsPrometheusTerraform
About the role
Key responsibilities & impact- Design, operate, and continuously improve production-grade K8s clusters at the platform level.
- Lead complex cluster lifecycle management, including:
- Version upgrades and dependency coordination
- Failure recovery and incident resolution
- Non-trivial maintenance and system evolution
- Build and maintain highly reliable, scalable, multi-tenant infrastructure.
- Build and maintain end-to-end observability for LLM-based systems using Grafana, LangFuse, and LangSmith — covering performance, latency, token usage, and alerting.
- Architect and operate shared infrastructure across multiple teams and use cases.
- Implement and enforce RBAC and access control models, Tenant isolation and security boundaries, Resource management and fairness at scale.
- Ensure platform stability under diverse and competing workloads.
- Operate and optimize vector database systems (Weaviate preferred) in production environments.
- Support and scale Retrieval-Augmented Generation (RAG) systems.
- Drive improvements in Query performance and latency, Cluster tuning and resource efficiency, Operational stability of retrieval pipelines.
- Take technical ownership of production systems over time.
- Build and maintain strong practices in Observability (metrics, logs, tracing), Incident response and root cause analysis, Long-term system health and resilience.
- Proactively identify and resolve reliability risks.
- Work closely with backend and GenAI engineers to ensure seamless integration with the platform.
Requirements
What you’ll need- 5+ years hands-on background in high-scale platform engineering (internal platforms, PaaS, or shared infra)
- Deep Kubernetes Platform Expertise
- Hands-on experience with GKE: Cluster upgrades, node pool management, autoscaling
- Managing failures, disruptions, and complex maintenance scenarios
- RBAC, namespaces, network policies
- GCP IAM, Workload Identity, Secret Manager
- GCP Storage: BigQuery, GCS, Firestore
- Terraform and IaaC experience with GitOps workflows (ArgoCD, Flux or equivalent)
- Strong observability practices using Google Cloud Operations Suite (Stackdriver), Prometheus / Grafana
- Hands-on experience operating vector databases in production, ideally Weaviate: Query performance tuning, Cluster stability and scaling behavior
- Solid understanding of distributed systems design and failure modes
- Multi-zone / regional architectures
- Google Cloud Load Balancing
Benefits
Comp & perks- A competitive salary and bonus package based on experience
- Comprehensive health and wellness benefits, including Medical, Dental, and Vision Insurance
- Company-provided Life and Long-Term Disability Insurance
- Company-sponsored 401(k) Plan
- Company-provided continuing education benefit
- Team-focused culture and unlimited opportunity for advancement
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
KubernetesGKERBACTerraformIaaCGoogle Cloud Operations SuitePrometheusGrafanaWeaviatedistributed systems design
Soft Skills
leadershipincident resolutionproblem-solvingtechnical ownershipcollaboration