
Site Reliability Engineer
LMI
full-time
Posted on:
Location Type: Remote
Location: Remote • 🇺🇸 United States
Visit company websiteSalary
💰 $140,000 - $170,000 per year
Job Level
Mid-LevelSenior
Tech Stack
AnsibleAWSAzureCloudCyber SecurityGrafanaPrometheusPythonSplunkTerraform
About the role
- Monitor the health, performance, and availability of H2FMS applications, services, APIs, and data services in Army GovCloud.
- Troubleshoot system issues across application, data, and infrastructure layers.
- Implement reliability patterns such as redundancy, graceful degradation, and failover strategies.
- Support performance optimization activities based on monitoring metrics and trends.
- Manage user access controls, role-based permissions, and environment access configurations.
- Maintain, monitor, and archive system logs, audit logs, and access logs to support RMF and cATO requirements.
- Support ISSO and Cybersecurity teams in log retrieval, incident investigations, and audit preparation.
- Develop and maintain automation scripts to improve environment stability, operational workflows, and deployment reliability.
- Collaborate with DevSecOps engineers to integrate automated runtime checks, monitoring, and health checks within CI/CD pipelines.
- Assist in implementing automated scaling, alerting, and self-healing mechanisms.
- Participate in incident response activities, including detection, diagnosis, escalation, mitigation, and documentation.
- Coordinate with cybersecurity teams during security events or anomalies.
- Conduct root-cause analysis and contribute to long-term corrective actions.
- Maintain environment configuration inventories related to access, logging, monitoring, and deployment parameters.
- Support configuration management, patch activities, and version control for infrastructure and application components.
- Collaborate with the Cloud Architect on environment design updates and capacity planning.
- Document system configurations, access processes, log retention procedures, and environment health dashboards.
- Support the ISSM and ISSO teams in continuous monitoring package updates and RMF documentation.
- Maintain audit-ready artifacts related to reliability operations and environment management.
Requirements
- Bachelor’s degree in information technology, Computer Science, Engineering, Cybersecurity, or a related field.
- 3–6 years of experience in cloud operations, SRE, DevOps, or system administration roles.
- Hands-on experience with cloud monitoring, logging, and performance management tools (AWS CloudWatch, Azure Monitor, ELK/Splunk, Prometheus/Grafana, etc.).
- Experience with automation tools (Python, Bash, Terraform, Ansible, etc.).
- Familiarity with RMF, Zero Trust, and DoW cloud security requirements.
- Understanding of CI/CD pipelines and deployment processes.
- Ability to obtain and maintain a DoD Secret clearance.
Benefits
- Health insurance
- Work-Life Wellness
- Career Development
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
cloud operationssite reliability engineeringDevOpssystem administrationautomation scriptingperformance optimizationroot-cause analysisconfiguration managementincident responseenvironment design
Soft skills
troubleshootingcollaborationcommunicationproblem-solvingdocumentation
Certifications
DoD Secret clearance