AI Systems & Platform Internals – Technical Architect

Accellor

Technical Architect for AI Systems & Platform Internals at Accellor, designing large-scale AI systems for ChatGPT and OpenAI API. Collaborating with diverse teams to ensure operational excellence and production reliability.

Posted 6/22/2026full-timeSan Francisco • California • 🇺🇸 United StatesSeniorLeadWebsite

Tech Stack

Tools & technologies

ApacheAssemblyCloudDistributed SystemsDockerGoGrafanaJavaKubernetesLinuxMicroservicesPrometheusPythonPyTorchRayRustTensorflowTerraformTypeScript

About the role

Key responsibilities & impact

Design and evolve large-scale AI systems that support ChatGPT, OpenAI API, Codex, agentic workflows, multimodal models, and research workloads.
Define architecture across inference runtime, model serving, request routing, batching, KV-cache handling, GPU scheduling, distributed execution, observability, release gates, and production rollout.
Own technical trade-offs across latency, throughput, reliability, correctness, safety, scalability, cost, and infrastructure efficiency.
Architect high-throughput, low-latency inference systems across large-scale GPU clusters.
Work across inference engines, serving layers, scheduling systems, caching, streaming, deployment pipelines, and runtime optimization.
Partner with engineering teams to improve model-serving efficiency, tail latency, GPU utilization, memory efficiency, correctness under load, and cost per request.
Guide architecture decisions involving PyTorch, JAX, Triton, vLLM-style serving, CUDA/Triton kernels, distributed inference, tensor parallelism, pipeline parallelism, model sharding, and long-context serving.
Analyze and improve performance across GPU kernels, memory movement, collective communication, orchestration, and runtime scheduling.
Guide engineering decisions involving CUDA, Triton, NCCL/RCCL, GPU profiling, memory pressure, compute utilization, tensor layouts, interconnect behavior, and distributed execution.
Identify system-level bottlenecks across compute, memory, networking, scheduling, model execution, and data movement.
Design and guide context engineering frameworks that determine what information should be passed to the model, how it should be structured, how much context should be used, and how context quality should be measured.
Own architecture patterns for prompt structure, dynamic context assembly, retrieval-augmented generation, long-context management, conversation memory, tool context, agent state, multimodal context, source grounding, permission-aware retrieval, context compression, and context auditability.
Ensure AI systems use the right context, from the right source, with the right permissions, at the right cost, and with measurable quality.
Design and build cost optimization frameworks for large-scale LLM and GenAI workloads.
Create architecture patterns that reduce unnecessary token usage, redundant retrieval, repeated model calls, inefficient inference paths, and avoidable infrastructure spend.
Drive model routing, token budgeting, prompt compression, context pruning, semantic caching, response caching, batch inference, async execution, fallback strategies, and cost telemetry across AI workflows.
Ensure cost optimization does not compromise quality, safety, grounding, reliability, or user experience.
Collaborate with research and training infrastructure teams to support large-scale model training and post-training workflows.
Contribute to architecture around distributed training, checkpointing, orchestration, fault tolerance, observability, data movement, evaluation infrastructure, and experiment velocity.
Support frontier model workflows across pre-training, post-training, reinforcement learning, agent training, evaluation harnesses, and large-scale experiment execution.
Architect validation and release systems that ensure model updates, inference engine changes, runtime images, prompt changes, context changes, and platform releases are correct, safe, performant, and regression-free.
Define release gates across correctness, numerical stability, latency, throughput, token usage, cost regression, context quality, retrieval quality, safety behavior, reliability, and model output quality.
Ensure platform optimizations do not reduce safety, grounding, quality, or user trust.
Design systems that make AI infrastructure observable, debuggable, reliable, and operationally safe.
Define telemetry, tracing, dashboards, alerts, logs, profiling views, runbooks, SLOs, and post-incident learning loops.
Provide visibility into prompts, context payloads, retrieved sources, token consumption, model selection, cache behavior, inference latency, GPU utilization, evaluation scores, safety events, cost, and failures.
Turn production issues into stronger platform abstractions, safer rollout mechanisms, better automation, and more reliable infrastructure.
Support architecture for AI agents, tool use, memory, function calling, multimodal interaction, long-running workflows, and internal or external agent deployment.
Work across agent harnesses, evaluation pipelines, workflow orchestration, safety controls, state management, tool execution, memory systems, and product-facing runtime constraints.
Ensure agentic and multimodal systems are reliable, observable, secure, cost-aware, and safe under real workloads.
Work closely with Research, Inference, Runtime, Infrastructure, Product, Safety, Security, Technical Success, and Deployment teams.
Act as a senior technical authority who can cut across layers, resolve ambiguity, identify systemic risks, and drive architecture decisions.
Mentor engineers and technical leads on distributed systems, performance engineering, context engineering, cost optimization, production readiness, AI platform design, and architecture trade-offs.
Represent architecture decisions through design docs, RFCs, diagrams, technical reviews, operational plans, and leadership-level summaries.

Requirements

What you’ll need

10–12 years of experience in software engineering, systems architecture, ML infrastructure, distributed systems, platform engineering, inference systems, cloud infrastructure, or large-scale backend engineering.
Strong hands-on engineering experience with **Python** and at least one systems/backend language such as **C++, Go, Rust, Java, or TypeScript**.
Deep understanding of distributed systems, production infrastructure, reliability engineering, scalability, observability, and fault-tolerant architecture.
Experience designing or operating large-scale systems involving APIs, microservices, distributed compute, orchestration, job scheduling, caching, high-availability infrastructure, and production monitoring.
Strong understanding of AI/ML systems, especially model serving, inference workflows, context engineering, retrieval systems, evaluation pipelines, and production model deployment.
Practical understanding of GPU systems, accelerator-based workloads, CUDA/Triton-style programming, distributed inference, GPU profiling, memory optimization, and communication libraries such as NCCL or RCCL.
Experience with ML frameworks and serving stacks such as PyTorch, JAX, TensorFlow, Triton, vLLM-style serving, Apache Ray, Kubernetes-based serving, or internal model-serving systems.
Ability to debug complex problems across model behavior, runtime systems, distributed infrastructure, networking, GPU execution, context quality, retrieval quality, evaluation harnesses, and production services.
Strong communication skills with the ability to write clear architecture documents, evaluate trade-offs, review implementation quality, and align teams around technically sound decisions.
__**Preferred Qualifications:**__
Experience working on LLM inference, multimodal inference, agent infrastructure, AI assistants, coding agents, or frontier-model serving platforms.
Experience with tensor parallelism, pipeline parallelism, model sharding, KV-cache optimization, batching, speculative decoding, streaming inference, and long-context serving.
Experience designing context engineering platforms, prompt/version management systems, model-routing frameworks, semantic caching layers, token-budgeting systems, or LLM cost dashboards.
Experience profiling GPU workloads using Nsight Systems, Nsight Compute, rocprof, perf, Prometheus, Grafana, OpenTelemetry, or custom profiling systems.
Experience with large-scale distributed training, RL infrastructure, checkpointing, ML compiler optimizations, model graph transformations, or training runtime systems.
Experience designing release gates, regression detection systems, canary systems, CI/CD validation frameworks, and production safety controls for performance-sensitive infrastructure.
Experience with evals, model quality measurement, hallucination detection, grounding evaluation, safety testing, and model behavior monitoring.
__**Technical Skill Areas:**__
**AI Systems:** LLM serving, inference runtime, training infrastructure, post-training workflows, agent systems, multimodal models
**Inference:** batching, routing, KV-cache, streaming, latency optimization, model serving, tensor parallelism, pipeline parallelism
**Performance Engineering:** CUDA, Triton, GPU profiling, kernel optimization, memory bandwidth, communication libraries, distributed execution
**Context Engineering:** prompt architecture, dynamic context assembly, RAG, memory, context compression, context ranking, source grounding, permission-aware retrieval
**Cost Optimization:** token budgeting, caching, model routing, fallback strategies, cost telemetry, batching, async workflows, cost-quality trade-offs
**Distributed Systems:** scheduling, orchestration, reliability, fault tolerance, observability, scalability, service design
**ML Frameworks:** PyTorch, JAX, TensorFlow, Triton, vLLM-style serving, Ray
**Infrastructure:** Kubernetes, Docker, Terraform, CI/CD, cloud platforms, Linux systems, networking, storage
**Safety & Validation:** evals, release gates, canaries, regression testing, model behavior validation, rollout safety
**Candidate Profile:**
The ideal candidate is a senior hands-on architect who can operate across the full AI systems stack.
They should be able to discuss GPU memory bottlenecks, distributed inference, model-serving reliability, context quality, cost optimization, release validation, eval pipelines, observability, and production rollout with engineering teams, while also explaining architecture decisions clearly to senior leadership.
The candidate should not be limited to architecture diagrams. They must be capable of reviewing implementation quality, identifying bottlenecks, debugging production issues, challenging weak assumptions, and converting repeated failures into stronger platform abstractions.
This role requires the judgment of a senior architect, the debugging mindset of a systems engineer, and the ownership mindset required for production AI infrastructure.

Benefits

Comp & perks

🌐 Worldwide ❌ Jobs You've Hidden ⭐️ Saved Jobs ✅ Applied Jobs ✉️ Email Alerts 👤 Account Accellor Website LinkedIn All Job Openings 201 - 500 employees 🏢 Enterprise ☁️ SaaS AI
Enterprise
SaaS Accellor is a company offering AI-driven solutions across various industries, focusing on enhancing efficiency and engagement through advanced applications and data strategies. Their services include leveraging artificial intelligence for enterprise applications, product engineering, and cloud services to transform industries such as healthcare, manufacturing, financial services, real estate, retail, travel, and hospitality. Accellor partners with technology leaders like Salesforce and Microsoft Dynamics 365 to deliver personalized, intelligent business applications. Committed to responsible AI practices, Accellor helps organizations harness the potential of data and AI to drive strategic decisions, automate operations, and provide superior experiences. AI Systems & Platform Internals – Technical Architect Job not on LinkedIn 🔥 7 minutes ago 🏢🏡 San Francisco – Hybrid ⏰ Full Time 🟠 Senior 🔴 Lead 🔙 Backend Engineer 🦅 H1B Visa Sponsor Apache Assembly Cloud Distributed Systems Docker Grafana Java Kubernetes Linux Microservices Prometheus Python PyTorch Ray Rust Tensorflow Terraform TypeScript Go Apply Now Find Hiring Managers Customize resume + cover letter Report problem ☆ Save ☑️ Mark as applied ❌ Hide 📋 Description
Design and evolve large-scale AI systems that support ChatGPT, OpenAI API, Codex, agentic workflows, multimodal models, and research workloads.
Define architecture across inference runtime, model serving, request routing, batching, KV-cache handling, GPU scheduling, distributed execution, observability, release gates, and production rollout.
Own technical trade-offs across latency, throughput, reliability, correctness, safety, scalability, cost, and infrastructure efficiency.
Architect high-throughput, low-latency inference systems across large-scale GPU clusters.
Work across inference engines, serving layers, scheduling systems, caching, streaming, deployment pipelines, and runtime optimization.
Partner with engineering teams to improve model-serving efficiency, tail latency, GPU utilization, memory efficiency, correctness under load, and cost per request.
Guide architecture decisions involving PyTorch, JAX, Triton, vLLM-style serving, CUDA/Triton kernels, distributed inference, tensor parallelism, pipeline parallelism, model sharding, and long-context serving.
Analyze and improve performance across GPU kernels, memory movement, collective communication, orchestration, and runtime scheduling.
Guide engineering decisions involving CUDA, Triton, NCCL/RCCL, GPU profiling, memory pressure, compute utilization, tensor layouts, interconnect behavior, and distributed execution.
Identify system-level bottlenecks across compute, memory, networking, scheduling, model execution, and data movement.
Design and guide context engineering frameworks that determine what information should be passed to the model, how it should be structured, how much context should be used, and how context quality should be measured.
Own architecture patterns for prompt structure, dynamic context assembly, retrieval-augmented generation, long-context management, conversation memory, tool context, agent state, multimodal context, source grounding, permission-aware retrieval, context compression, and context auditability.
Ensure AI systems use the right context, from the right source, with the right permissions, at the right cost, and with measurable quality.
Design and build cost optimization frameworks for large-scale LLM and GenAI workloads.
Create architecture patterns that reduce unnecessary token usage, redundant retrieval, repeated model calls, inefficient inference paths, and avoidable infrastructure spend.
Drive model routing, token budgeting, prompt compression, context pruning, semantic caching, response caching, batch inference, async execution, fallback strategies, and cost telemetry across AI workflows.
Ensure cost optimization does not compromise quality, safety, grounding, reliability, or user experience.
Collaborate with research and training infrastructure teams to support large-scale model training and post-training workflows.
Contribute to architecture around distributed training, checkpointing, orchestration, fault tolerance, observability, data movement, evaluation infrastructure, and experiment velocity.
Support frontier model workflows across pre-training, post-training, reinforcement learning, agent training, evaluation harnesses, and large-scale experiment execution.
Architect validation and release systems that ensure model updates, inference engine changes, runtime images, prompt changes, context changes, and platform releases are correct, safe, performant, and regression-free.
Define release gates across correctness, numerical stability, latency, throughput, token usage, cost regression, context quality, retrieval quality, safety behavior, reliability, and model output quality.
Ensure platform optimizations do not reduce safety, grounding, quality, or user trust.
Design systems that make AI infrastructure observable, debuggable, reliable, and operationally safe.
Define telemetry, tracing, dashboards, alerts, logs, profiling views, runbooks, SLOs, and post-incident learning loops.
Provide visibility into prompts, context payloads, retrieved sources, token consumption, model selection, cache behavior, inference latency, GPU utilization, evaluation scores, safety events, cost, and failures.
Turn production issues into stronger platform abstractions, safer rollout mechanisms, better automation, and more reliable infrastructure.
Support architecture for AI agents, tool use, memory, function calling, multimodal interaction, long-running workflows, and internal or external agent deployment.
Work across agent harnesses, evaluation pipelines, workflow orchestration, safety controls, state management, tool execution, memory systems, and product-facing runtime constraints.
Ensure agentic and multimodal systems are reliable, observable, secure, cost-aware, and safe under real workloads.
Work closely with Research, Inference, Runtime, Infrastructure, Product, Safety, Security, Technical Success, and Deployment teams.
Act as a senior technical authority who can cut across layers, resolve ambiguity, identify systemic risks, and drive architecture decisions.
Mentor engineers and technical leads on distributed systems, performance engineering, context engineering, cost optimization, production readiness, AI platform design, and architecture trade-offs.
Represent architecture decisions through design docs, RFCs, diagrams, technical reviews, operational plans, and leadership-level summaries. 🎯 Requirements
10–12 years of experience in software engineering, systems architecture, ML infrastructure, distributed systems, platform engineering, inference systems, cloud infrastructure, or large-scale backend engineering.
Strong hands-on engineering experience with **Python** and at least one systems/backend language such as **C++, Go, Rust, Java, or TypeScript**.
Deep understanding of distributed systems, production infrastructure, reliability engineering, scalability, observability, and fault-tolerant architecture.
Experience designing or operating large-scale systems involving APIs, microservices, distributed compute, orchestration, job scheduling, caching, high-availability infrastructure, and production monitoring.
Strong understanding of AI/ML systems, especially model serving, inference workflows, context engineering, retrieval systems, evaluation pipelines, and production model deployment.
Practical understanding of GPU systems, accelerator-based workloads, CUDA/Triton-style programming, distributed inference, GPU profiling, memory optimization, and communication libraries such as NCCL or RCCL.
Experience with ML frameworks and serving stacks such as PyTorch, JAX, TensorFlow, Triton, vLLM-style serving, Apache Ray, Kubernetes-based serving, or internal model-serving systems.
Ability to debug complex problems across model behavior, runtime systems, distributed infrastructure, networking, GPU execution, context quality, retrieval quality, evaluation harnesses, and production services.
Strong communication skills with the ability to write clear architecture documents, evaluate trade-offs, review implementation quality, and align teams around technically sound decisions.
__**Preferred Qualifications:**__
Experience working on LLM inference, multimodal inference, agent infrastructure, AI assistants, coding agents, or frontier-model serving platforms.
Experience with tensor parallelism, pipeline parallelism, model sharding, KV-cache optimization, batching, speculative decoding, streaming inference, and long-context serving.
Experience designing context engineering platforms, prompt/version management systems, model-routing frameworks, semantic caching layers, token-budgeting systems, or LLM cost dashboards.
Experience profiling GPU workloads using Nsight Systems, Nsight Compute, rocprof, perf, Prometheus, Grafana, OpenTelemetry, or custom profiling systems.
Experience with large-scale distributed training, RL infrastructure, checkpointing, ML compiler optimizations, model graph transformations, or training runtime systems.
Experience designing release gates, regression detection systems, canary systems, CI/CD validation frameworks, and production safety controls for performance-sensitive infrastructure.
Experience with evals, model quality measurement, hallucination detection, grounding evaluation, safety testing, and model behavior monitoring.
__**Technical Skill Areas:**__
**AI Systems:** LLM serving, inference runtime, training infrastructure, post-training workflows, agent systems, multimodal models
**Inference:** batching, routing, KV-cache, streaming, latency optimization, model serving, tensor parallelism, pipeline parallelism
**Performance Engineering:** CUDA, Triton, GPU profiling, kernel optimization, memory bandwidth, communication libraries, distributed execution
**Context Engineering:** prompt architecture, dynamic context assembly, RAG, memory, context compression, context ranking, source grounding, permission-aware retrieval
**Cost Optimization:** token budgeting, caching, model routing, fallback strategies, cost telemetry, batching, async workflows, cost-quality trade-offs
**Distributed Systems:** scheduling, orchestration, reliability, fault tolerance, observability, scalability, service design
**ML Frameworks:** PyTorch, JAX, TensorFlow, Triton, vLLM-style serving, Ray
**Infrastructure:** Kubernetes, Docker, Terraform, CI/CD, cloud platforms, Linux systems, networking, storage
**Safety & Validation:** evals, release gates, canaries, regression testing, model behavior validation, rollout safety
**Candidate Profile:**
The ideal candidate is a senior hands-on architect who can operate across the full AI systems stack.
They should be able to discuss GPU memory bottlenecks, distributed inference, model-serving reliability, context quality, cost optimization, release validation, eval pipelines, observability, and production rollout with engineering teams, while also explaining architecture decisions clearly to senior leadership.
The candidate should not be limited to architecture diagrams. They must be capable of reviewing implementation quality, identifying bottlenecks, debugging production issues, challenging weak assumptions, and converting repeated failures into stronger platform abstractions.
This role requires the judgment of a senior architect, the debugging mindset of a systems engineer, and the ownership mindset required for production AI infrastructure. Apply Now 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score Similar Jobs Senior Backend Engineer 🕒 4 days ago Airwallex 1001 - 5000 💳 Fintech 💸 Finance Website LinkedIn All Job Openings Senior Backend Engineer at Airwallex designing scalable services and resilient APIs. Collaborating with teams to build intelligent financial systems using Python and GCP. 🏢🏡 San Francisco – Hybrid 💵 $150k - $245k / year ⏰ Full Time 🟠 Senior 🔙 Backend Engineer 🦅 H1B Visa Sponsor BigQuery Cloud Distributed Systems Google Cloud Platform Microservices NoSQL Python SQL Backend Software Engineer, Applied Foundations 🕒 4 days ago OpenAI 201 - 500 🤖 Artificial Intelligence ☁️ SaaS 🏢 Enterprise Website LinkedIn All Job Openings Backend Software Engineer designing and implementing scalable backend systems and APIs for OpenAI. Collaborating with teams to enhance performance and reliability of core products at global scale. 🏢🏡 San Francisco – Hybrid 💵 $185k - $385k / year ⏰ Full Time 🟡 Mid-level 🟠 Senior 🔙 Backend Engineer 🦅 H1B Visa Sponsor Distributed Systems Python Rust Go Senior Software Engineer, Backend 🕒 6 days ago Tubi 201 - 500 📱 Media Website LinkedIn All Job Openings Senior Software Engineer developing backend systems and AI-driven solutions for Tubi's streaming platform. Collaborating on designing high-performance services with a focus on scale and security. 🏢🏡 San Francisco – Hybrid 💵 $186.4k - $266.3k / year ⏰ Full Time 🟠 Senior 🔙 Backend Engineer 🦅 H1B Visa Sponsor Distributed Systems Elixir GRPC Kubernetes Microservices Member of Technical Staff – Core Backend 🕒 June 12 Vapi 11 - 50 🚗 Transport 🏢 Enterprise Website LinkedIn All Job Openings Member of Technical Staff optimizing voice AI backend systems and consolidating BullMQ into Kafka. Ensuring pipeline stability and improving latency on live call systems. 🏢🏡 San Francisco – Hybrid 💵 $180k - $265k / year ⏰ Full Time 🔴 Lead 🔙 Backend Engineer JavaScript Kafka Node.js Postgres TypeScript Senior Backend Engineer 🕒 June 11 Abridge 11 - 50 ⚕️ Healthcare Insurance 🤖 Artificial Intelligence ☁️ SaaS Website LinkedIn All Job Openings Senior Backend Engineer developing cloud-native applications and services for healthcare AI solutions. Collaborating with cross-functional teams to enhance product performance and user experience. 🏢🏡 San Francisco – Hybrid 💵 $210.8k - $248k / year ⏰ Full Time 🟠 Senior 🔙 Backend Engineer 🦅 H1B Visa Sponsor Cloud JavaScript Kafka Node.js TypeScript View More Backend Engineer Jobs 🌐 Worldwide Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com Search Search Jobs by country Search jobs by city Search jobs by job title Search entry-level jobs Search junior-level jobs Search senior-level jobs Search jobs by tech stack Search jobs by contract type Search remote internships Search remote part-time jobs Remote jobs Anywhere in the World Companies Hiring Anywhere in the World Companies Hiring Sales People Anywhere in the World Companies Hiring Software Engineers Anywhere in the World Resources Advice Tips for finding remote jobs Interview questions and answers Resume examples Cover letter examples Post a job Affiliates Privacy policy Terms of service Job board SEO course AI Apply Copilot OpenClaw job finder Jobs by Country Remote jobs anywhere in the world (Worldwide remote jobs) Remote jobs United States Remote jobs Australia Remote jobs Brazil Remote jobs Canada Remote jobs France Remote jobs Ireland Remote jobs Germany Remote jobs Netherlands Remote jobs Spain Remote jobs UK Popular Jobs Remote data analyst jobs Remote customer support jobs Remote executive assistant jobs Remote marketing jobs Remote product designer jobs Remote product manager jobs Remote project manager jobs Remote recruiter jobs Remote sales jobs Remote software engineer jobs Jobs by Type Remote full-time jobs Remote part-time jobs Remote contract jobs Remote internship jobs Remote entry-level jobs Remote jobs with no experience required Remote junior jobs (1-3 years of experience) Digital nomad jobs Remote jobs with no degree required Freelance remote jobs Temporary remote jobs Remote jobs hiring now Stay at home mom jobs

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

PythonC++GoRustJavaTypeScriptCUDATritonPyTorchJAX

Soft Skills

communicationleadershipmentoringproblem-solvingcollaborationtechnical authoritydecision-makingarchitecture documentationtrade-off evaluationteam alignment