
Site Reliability Engineer
TWG Global
full-time
Posted on:
Location Type: Hybrid
Location: Jacksonville • Florida • United States
Visit company websiteExplore more
Salary
💰 $120,000 - $190,000 per year
Tech Stack
About the role
- Build and maintain infrastructure to support real-time and batch ML workloads
- Implement observability tools (logging, monitoring, alerting) for model performance and system uptime
- Design and manage CI/CD pipelines for ML and data applications
- Ensure high availability, disaster recovery, and rollback capabilities for production environments
- Manage access controls, secrets, and security policies in collaboration with compliance and IT
- Troubleshoot incidents, lead postmortems, and drive root-cause resolution
- Work with U.S. and international teams to provide 24/7 coverage across time zones
Requirements
- 3–6 years of experience in DevOps, SRE, or backend engineering roles
- Proficient with tools like Docker, Kubernetes, Terraform, GitLab/GitHub Actions, Airflow
- Strong scripting in Python or Bash and familiarity with Linux environments
- Experience deploying and monitoring ML models or data pipelines in production
- Knowledge of observability stacks (e.g., Prometheus, Grafana, ELK, Datadog)
- Familiarity with cloud platforms (e.g., AWS, GCP, or Azure)
- Strong documentation, problem-solving, and incident response skills
- Preferred Qualifications:
- Experience supporting ML/AI workflows using Palantir Foundry.
- Exposure to compliance frameworks like SOC 2, ISO 27001, or financial regulations
- Knowledge of MLOps frameworks (e.g., MLflow, Kubeflow, SageMaker Pipelines)
- Ability to automate deployments, testing, and monitoring at scale
Benefits
- Work on real-world AI applications with high-impact clients
- Collaborate with world-class data scientists, engineers, and product leaders
- Flat org structure, high trust, high autonomy
- Competitive salary + performance-based incentives
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
DevOpsSREbackend engineeringscripting in Pythonscripting in BashLinux environmentsMLOps frameworksobservability stacksCI/CD pipelinesmonitoring ML models
Soft Skills
problem-solvingincident responsedocumentationcollaboration
Certifications
SOC 2ISO 27001