Salary
💰 $130,000 - $225,000 per year
Tech Stack
AirflowAWSCloudDynamoDBIoTPostgresPythonRedisTerraform
About the role
- Design, provision, and maintain AWS infrastructure using IaC tools such as AWS CDK or Terraform.
- Build CI/CD and testing for apps, infra, and ML pipelines using GitHub Actions, CodeBuild, and CodePipeline.
- Operate secure networking with VPCs, PrivateLink, and VPC endpoints. Manage IAM, KMS, Secrets Manager, and audit logging.
- Stand up and operate model endpoints using AWS Bedrock and/or SageMaker; evaluate when to use ECS/EKS, Lambda, or Batch for inference jobs.
- Build and maintain application services that call LLMs through clean APIs, with streaming, batching, and backoff strategies.
- Implement prompt and tool execution flows with LangChain or similar, including agent tools and function calling.
- Design chunking and embedding pipelines for documents, time series, and multimedia. Orchestrate with Step Functions or Airflow.
- Operate vector search using OpenSearch Serverless, Aurora PostgreSQL with pgvector, or Pinecone. Tune recall, latency, and cost.
- Build and maintain knowledge bases and data syncs from S3, Aurora, DynamoDB, and external sources.
- Create offline and online eval harnesses for prompts, retrievers, and chains. Track quality, latency, and regression risk.
- Instrument model and app telemetry with CloudWatch and OpenTelemetry. Build token usage and cost dashboards with budgets and alerts.
- Add guardrails, rate limits, fallbacks, and provider routing for resilience.
- Implement PII detection and redaction, access controls, content filters, and human-in-the-loop review where needed.
- Use Bedrock Guardrails or policy services to enforce safety standards. Maintain audit trails for regulated environments.
- Build ingestion and processing pipelines for structured, unstructured, and multimedia data. Ensure integrity, lineage, and cataloging with Glue and Lake Formation.
- Optimize bulk data movement and storage in S3, Glacier, and tiered storage. Use Athena for ad-hoc analysis.
- Manage infrastructure that deploys to and communicates with edge devices. Support secure messaging, identity, and over-the-air updates.
- Partner with product and application teams to integrate retrieval services, embeddings, and LLM chains into user-facing features.
- Provide expert troubleshooting for cloud and ML services with an emphasis on uptime and performance.
- Tune retrieval quality, context window use, and caching with Redis or Bedrock Knowledge Bases.
- Optimize inference with model selection, quantization where applicable, GPU/CPU instance choices, and autoscaling strategies.
Requirements
- End-to-End Ownership: Drives work from design through production, including on-call and continuous improvement.
- LLM Systems Experience: Shipped or operated LLM-powered applications in production. Familiar with RAG design, prompt versioning, and chain orchestration using LangChain or similar.
- AWS Depth: Strong with core AWS services such as VPC, IAM, KMS, CloudWatch, S3, ECS/EKS, Lambda, Step Functions, Bedrock, and SageMaker.
- Data Engineering Skills: Comfortable building ingestion and transformation pipelines in Python. Familiar with Glue, Athena, and event-driven patterns using EventBridge and SQS.
- Security Mindset: Applies least privilege, secrets management, network isolation, and compliance practices appropriate to sensitive data.
- Evaluation and Metrics: Uses quantitative evals, A/B testing, and live metrics to guide improvements.
- Clear Communication: Explains tradeoffs and aligns partners across product, security, and application engineering.