The College Board

AI/ML Data Engineer

The College Board

full-time

Posted on:

Location Type: Hybrid

Location: United States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $137,000 - $148,000 per year

About the role

  • Design, build, and own batch and streaming ETL (e.g., Kinesis/Kafka → Spark/Glue → Step Functions/Airflow) for training, evaluation, and inference use cases
  • Stand up and maintain offline/online feature stores and embedding pipelines (e.g., S3/Parquet/Iceberg + vector index) with reproducible backfills
  • Implement data contracts & validation (e.g., Great Expectations/Deequ), schema evolution, and metadata/lineage capture (e.g., OpenLineage / DataHub /Amundsen)
  • Optimize lakehouse/warehouse layouts and partitioning (e.g., Redshift/Athena/Iceberg) for scalable ML and analytics
  • Productionize training and evaluation datasets with versioning (e.g., DVC/LakeFS) and experiment tracking (e.g., MLflow)
  • Build RAG foundations: document ingestion, chunking, embeddings, retrieval indexing, and quality evaluation (precision@k, faithfulness, latency, and cost)
  • Collaborate with DS to ship models to serving (e.g., SageMaker/EKS/ECS), automate feature backfills, and capture inference data for continuous improvement
  • Define SLOs and instrument observability across data and model services (freshness, drift/skew, lineage, cost, and performance)
  • Embed security & privacy by design (PII minimization/redaction, secrets management, access controls), aligning with College Board standards and FERPA
  • Build CI/CD for data and models with automated testing, quality gates, and safe rollouts (shadow/canary)
  • Maintain docs-as-code for pipelines, contracts, and runbooks; create internal guides and tech talks
  • Mentor peers through design reviews, pair/mob sessions, and post-incident learning.

Requirements

  • 4 + years in data engineering (or 3+ with substantial ML productionization)
  • Strong Python and distributed compute (Spark/Glue/Dask) skills
  • Proven experience shipping ML data systems (training/eval datasets, feature or embedding pipelines, artifact/version management, experiment tracking)
  • MLOps / LLMOps: orchestration (Airflow/Step Functions), containerization (Docker), and deployment (SageMaker/EKS/ECS); CI/CD for data & models
  • Expert SQL and data modeling for lakehouse/warehouse (Redshift/Athena/Iceberg), with performance tuning for large datasets
  • Data quality & contracts (Great Expectations/Deequ), lineage/metadata (OpenLineage/DataHub/Amundsen), and drift/skew monitoring
  • Cloud experience preferably with AWS services such as S3, Glue, Lambda, Athena, Bedrock, OpenSearch, API Gateway, DynamoDB, SageMaker, Step Functions, Redshift and Kinesis
  • BI tools like Tableau, Quicksight, or Looker for real-time analytics and dashboards
  • Security and privacy mindset; ability to design compliant pipelines handling sensitive student data
  • An ability to judiciously evaluate the feasibility, fairness, and effectiveness of AI solutions and articulate considerations and concerns around implementing models in the context of specific business applications
  • Excellent communication, collaboration, and documentation habits.
  • Preferred RAG & vector search experience (OpenSearch KNN/pgvector/FAISS) and prompt/eval frameworks
  • Real-time feature engineering (Kinesis/Kafka) and low-latency stores for online inference
  • Testing strategies for ML systems (unit/contract tests, data fuzzing, offline/online parity checks)
  • Experience in higher-ed/assessments data domains.
Benefits
  • Annual bonuses and opportunities for merit-based raises and promotions
  • A mission-driven workplace where your impact matters
  • A team that invests in your development and success

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
PythonSparkGlueDaskSQLMLOpsLLMOpsData modelingReal-time feature engineeringTesting strategies
Soft skills
CommunicationCollaborationDocumentationMentoringJudicious evaluation