Work alongside the development and operations teams to ensure speedy and reliable software deployments, monitor systems, and improve overall reliability of the platform
As you discover and document system bugs, you have the motivation to go off and fix them yourself
Develop features utilizing the AI coding tool and repository of scripts to automate, scale, test, and secure the cloud infrastructure and the pipelines
Enhance performance monitoring of the various systems via Splunk or other dashboard reporting tools
Identify performance bottlenecks and optimize the performance of cloud infrastructure
Contribute to continuing our SRE journey by suggesting ways to improve engineering build, maintenance, automation and reliability across the platform with SRE/DevOps tools and Infrastructure-as-Code
Develop and code high-quality pipeline automation workflows to support inside and outside the cloud platform that are appropriate for business and technology strategies
Develop and execute test strategies that simulate real-world failure scenarios, including network disruptions, hardware failures, and system overloads
Create, script, and run performance tests to measure system behavior under varying levels of load and traffic
Identify bottlenecks, performance degradation, and areas for optimization
Design, implement, and maintain automated test suites for infrastructure and application components
Ensure that testing is integrated into the CI/CD pipeline to validate system reliability with every release
Build automated systems for continuous performance testing, stress testing, and load testing
Work closely with SREs, developers, and operations teams to define reliability goals and develop appropriate testing strategies to validate those goals
Ensure that new services and features undergo thorough testing for performance, reliability, and failure recovery before deployment to production
Validate that monitoring, logging, and alerting mechanisms are functioning correctly by testing systems under failure conditions
Ensure that Service Level Indicators (SLIs) and Service Level Objectives (SLOs) are accurately measured and tracked through automated testing frameworks
Resolve most conflicts between timeline, budget, and scope independently but intuitively raise sophisticated or consequential issues to senior management

Requirements

Typically requires Bachelor’s; however, 4 – 8 years of prior relevant experience may be considered in lieu of degree
Currently possessing and ability to maintain an active DoD Secret security clearance
Minimum of DoD 8570.01 IAT Level II Certification required prior to onboarding and must maintain certification while supporting the SMIT Contract
5+ years’ experience configuring Cisco routers, switches, and network appliances
5+ years’ experience with routing protocols (i.e., OSPF/EIGRP/BGP)
5+ years’ experience with L2 switching (i.e., Vlans, spanning tree, VTP etc.)
5+ years’ experience troubleshooting complex routing and switching issues
Experience with multiple vendor routing, switching or wireless product lines
Strong understanding and in-depth knowledge of TCP/IP network/subnet addressing
Ability to work independently or in a team environment to resolve technical issues in a dynamic environment
Experience with automated script design, coding, debugging, and maintenance skills (using bash, python, etc.) preferred
Experience in CI/CD toolsets (e.g. Jenkins, GitLab, etc.)
Experience with Containerization (Docker) and Container Orchestration (Kubernetes)
Good command of Linux/Unix and command line knowledge
Experience in application administration, configuration, and integration
Familiarity with agile development methodologies
Skilled and disciplined to work with a distributed team
Ability to work in a highly collaborative, forward thinking, and innovation-driven environment
Knowledge of Agile and DevSecOps/SRE concepts and best practices, with a desire to grow that knowledge
Hand-on experience with Atlassian products (Jira, Confluence, Bitbucket, etc.)
Experience creating JIRA and/or Azure DevOps workflows, projects, custom configurations
Experience administrating/maintaining SRE platform via Ansible playbooks (e.g. upgrading Jenkins)
Experience in automating tasks with scripting languages like PowerShell, or Python
Integrating/maintaining with various 3rd party CI/CD tools like Jenkins and Gitlab
Experience with PaaS using Red Hat OpenShift/Kubernetes and Docker containers
Experience with commercial cloud infrastructure deployment environments such as AWS and Azure
Experience with automated provisioning and configuration tools like Terraform, Cloud Formation, Chef, Puppet, Ansible, or similar technologies
Working knowledge of the Risk Management Framework (RMF), DISA STIGs

Benefits

Health and Wellness programs
Income Protection
Paid Leave
Retirement

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

AI coding toolpipeline automationperformance testingautomated test suitesCI/CDCisco routersrouting protocolsL2 switchingscripting (bash, python)containerization (Docker)

Soft Skills

independent workteam collaborationproblem-solvingcommunicationinnovation-driven mindsetdynamic environment adaptabilityconflict resolutionforward thinkingattention to detailmotivation

Certifications

Bachelor's degreeDoD Secret security clearanceDoD 8570.01 IAT Level II Certification