
Lead Engineer, Enterprise Incident & Change Management
The College Board
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Salary
💰 $168,000 - $183,000 per year
Job Level
Tech Stack
About the role
- Design and Implementation (60%)
- Evaluate incident and change management frameworks using data-driven insights to identify opportunities for improvement that will provide value to the EIM team and engineering teams.
- Design and implement automation solutions for incident response and management, change management, and observability leveraging input and feedback from domain SMEs and end users.
- Develop and maintain scripts, tools, and integrations to reduce manual processes and operational overhead.
- Define key performance indicators (KPIs) and metrics to measure the success of automation and improvement efforts and develop and enhance dashboards and reporting mechanisms to measure KPIs as well as incident and change management performance.
- Ensure compliance with governance, risk, and change control policies while promoting agility and innovation.
- Lead cross-functional initiatives and partner with domain SMEs to analyze, design, and deliver powerful features, capabilities, and automation strategies that align with engineering best practices.
- Serve as a subject matter expert (SME) for cloud operations, infrastructure automation, and CI/CD pipelines.
- Strategy, Operations Support, and Communication (25%)
- Collaborate with the EIM team’s director and other technology leaders to understand business objectives and team goals and to align solutions and process improvement efforts with those goals.
- Contribute to the long-term technology strategy by researching emerging trends, evaluating new tools (especially AI-driven tools that support observability), and recommending technologies or automations that improve cost-effectiveness, metrics delivery to evaluate performance, and system and process efficiency.
- Participate in weekly on-call and incident response rotations responsible for monitoring alerts to identify potential issues, ensuring timely triage and escalation of incidents, collaborating with impacted teams, and supporting assessment, response, and communication to bring the incident to resolution.
- Play an active role in agile scrum ceremonies while contributing to high-quality team deliverables.
- Team Coordination (15%)
- Provide technical direction and guidance to team members, ensuring alignment with architectural standards, best practices and organizational objectives.
- Review designs, automation scripts, and implementation plans, offering constructive feedback to improve quality, efficiency, and maintainability.
- Foster a culture of continuous learning and collaboration by mentoring engineers in modern automation, cloud infrastructure, and operational excellence.
Requirements
- 7 + years of software development experience with Infrastructure as Code (IaC), CI/CD framework, immutable infrastructure, automation, orchestration, and other modern DevOps patterns.
- Strong proficiency in IaC tools (e.g., Terraform, CloudFormation, Ansible) and experience with CI/CD pipeline design and automation using platforms such as Jenkins, GitLab CI, or GitHub Actions is a plus.
- Strong knowledge and experience with distributed cloud infrastructure, including AWS resources such as Lambda, SNS, SQS, S3, Step Functions, EC2, ECS, VPC, IAM, CloudWatch, and DynamoDB.
- Experience building event-driven cloud-based serverless applications, with technical knowledge of cloud computing, DevOps, and microservices.
- Strong coding/scripting experience for automation and integration tasks using tools (e.g., JavaScript, TypeScript, React.js, and Node.js) and proficiency in scripting languages (Python, Bash, PowerShell, etc.).
- Familiarity with AI tools used for observability (e.g., AWS resilience hub).
- Familiarity with incident and change management systems (e.g., Jira Service Management).
- Deep understanding of ITIL frameworks, especially incident, change, and problem management.
- Experience integrating monitoring and alerting tools (e.g., Datadog, Prometheus, CloudWatch, Grafana).
- Strong troubleshooting, analytical, and problem-solving skills.
- Proven ability to lead technical initiatives, influence cross-functional teams, and prioritize and execute tasks in a high-pressure environment.
- Excellent communication skills, with the ability to translate technical details into business outcomes.
- Ability to take a weekly, on-call shift every month and a half.
- Authorization to work in the U.S.
Benefits
- Annual bonuses and opportunities for merit-based raises and promotions
- A mission-driven workplace where your impact matters
- A team that invests in your development and success
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Infrastructure as Code (IaC)CI/CD frameworkautomationorchestrationTerraformCloudFormationAnsibleAWSJavaScriptPython
Soft Skills
troubleshootinganalytical skillsproblem-solving skillsleadershipcommunication skillsmentoringcollaborationinfluencingprioritizationexecution