
AI/ML Data Engineer
The College Board
full-time
Posted on:
Location Type: Hybrid
Location: United States
Visit company websiteExplore more
Salary
💰 $137,000 - $148,000 per year
About the role
- Design, build, and own batch and streaming ETL (e.g., Kinesis/Kafka → Spark/Glue → Step Functions/Airflow) for training, evaluation, and inference use cases
- Stand up and maintain offline/online feature stores and embedding pipelines (e.g., S3/Parquet/Iceberg + vector index) with reproducible backfills
- Implement data contracts & validation (e.g., Great Expectations/Deequ), schema evolution, and metadata/lineage capture (e.g., OpenLineage / DataHub /Amundsen)
- Optimize lakehouse/warehouse layouts and partitioning (e.g., Redshift/Athena/Iceberg) for scalable ML and analytics
- Productionize training and evaluation datasets with versioning (e.g., DVC/LakeFS) and experiment tracking (e.g., MLflow)
- Build RAG foundations: document ingestion, chunking, embeddings, retrieval indexing, and quality evaluation (precision@k, faithfulness, latency, and cost)
- Collaborate with DS to ship models to serving (e.g., SageMaker/EKS/ECS), automate feature backfills, and capture inference data for continuous improvement
- Define SLOs and instrument observability across data and model services (freshness, drift/skew, lineage, cost, and performance)
- Embed security & privacy by design (PII minimization/redaction, secrets management, access controls), aligning with College Board standards and FERPA
- Build CI/CD for data and models with automated testing, quality gates, and safe rollouts (shadow/canary)
- Maintain docs-as-code for pipelines, contracts, and runbooks; create internal guides and tech talks
- Mentor peers through design reviews, pair/mob sessions, and post-incident learning.
Requirements
- 4 + years in data engineering (or 3+ with substantial ML productionization)
- Strong Python and distributed compute (Spark/Glue/Dask) skills
- Proven experience shipping ML data systems (training/eval datasets, feature or embedding pipelines, artifact/version management, experiment tracking)
- MLOps / LLMOps: orchestration (Airflow/Step Functions), containerization (Docker), and deployment (SageMaker/EKS/ECS); CI/CD for data & models
- Expert SQL and data modeling for lakehouse/warehouse (Redshift/Athena/Iceberg), with performance tuning for large datasets
- Data quality & contracts (Great Expectations/Deequ), lineage/metadata (OpenLineage/DataHub/Amundsen), and drift/skew monitoring
- Cloud experience preferably with AWS services such as S3, Glue, Lambda, Athena, Bedrock, OpenSearch, API Gateway, DynamoDB, SageMaker, Step Functions, Redshift and Kinesis
- BI tools like Tableau, Quicksight, or Looker for real-time analytics and dashboards
- Security and privacy mindset; ability to design compliant pipelines handling sensitive student data
- An ability to judiciously evaluate the feasibility, fairness, and effectiveness of AI solutions and articulate considerations and concerns around implementing models in the context of specific business applications
- Excellent communication, collaboration, and documentation habits.
- Preferred RAG & vector search experience (OpenSearch KNN/pgvector/FAISS) and prompt/eval frameworks
- Real-time feature engineering (Kinesis/Kafka) and low-latency stores for online inference
- Testing strategies for ML systems (unit/contract tests, data fuzzing, offline/online parity checks)
- Experience in higher-ed/assessments data domains.
Benefits
- Annual bonuses and opportunities for merit-based raises and promotions
- A mission-driven workplace where your impact matters
- A team that invests in your development and success
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
PythonSparkGlueDaskSQLMLOpsLLMOpsData modelingReal-time feature engineeringTesting strategies
Soft skills
CommunicationCollaborationDocumentationMentoringJudicious evaluation