
Principal Data Scientist
Walmart
full-time
Posted on:
Location Type: Office
Location: Bangalore • 🇮🇳 India
Visit company websiteJob Level
Lead
Tech Stack
Distributed SystemsKubernetesNumpyPandasPrometheusPythonScikit-LearnSQL
About the role
- Build, train, and deploy time-series models for smart and predictive autoscaling of Kubernetes workloads.
- Traffic and resource demand forecasting.
- Seasonality detection (daily/weekly/annual patterns).
- Anomaly detection in metrics, logs, and traces.
- Perform deep exploratory data analysis (EDA) on large-scale telemetry data (CPU, memory, latency, errors, throughput).
- Select, implement, and tune statistical and ML techniques (ARIMA, Prophet, tree-based models, deep learning as appropriate).
- Continuously evaluate models using offline metrics and live production feedback.
- Write production-grade Python code for model training, inference, and evaluation.
- Integrate ML outputs directly into SRE workflows, including: Kubernetes HPA/VPA and custom autoscaling controllers, alerting and incident detection pipelines, capacity planning and cost optimization tools.
- Define safeguards, fallback logic, and confidence thresholds to ensure safe autonomous actions.
- Debug model and data issues using real production incidents and postmortems.
- Build and maintain feature pipelines from observability data sources (Prometheus, OpenTelemetry, logs, traces).
- Work with streaming and batch data pipelines to process high-cardinality, high-volume time-series data.
- Ensure data quality, freshness, and correctness for real-time decision systems.
- Design schemas and feature stores optimized for time-series ML workloads.
- Own models end to end: development → deployment → monitoring → retraining.
- Implement monitoring for model accuracy and drift, data drift and pipeline failures, impact on system reliability and scaling behavior.
- Automate retraining and validation pipelines where appropriate.
- Act as the go-to expert for applied ML in SRE contexts.
- Review and improve ML and data science code written by other team members.
- Partner closely with SREs to translate reliability problems into concrete modeling tasks.
- Drive adoption of ML solutions by proving value through metrics and outcomes.
Requirements
- 12+ years of experience in data science or applied machine learning.
- 5+ years deploying ML models in production, not just experimentation.
- Strong experience working with time-series data at scale.
- Proven track record of owning systems end to end in high-availability environments.
- Expert-level Python (NumPy, Pandas, SciPy, Scikit-learn).
- Strong experience with time-series forecasting and anomaly detection techniques.
- Practical understanding of Kubernetes autoscaling (HPA/VPA, custom metrics).
- Experience working with metrics, logs, and traces from distributed systems.
- Comfortable querying and analyzing large datasets using SQL and time-series databases.
- Strong understanding of distributed systems behavior (latency, load, failures, cascading effects).
Benefits
- Beyond our great compensation package, you can receive incentive awards for your performance.
- Other great perks include a host of best-in-class benefits maternity and parental leave, PTO, health benefits, and much more.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
time-series modelstraffic forecastingresource demand forecastinganomaly detectionexploratory data analysisstatistical techniquesmachine learning techniquesPythonKubernetesSQL
Soft skills
problem-solvingcollaborationcommunicationleadershipcritical thinkingadaptabilitymentoringownershipattention to detailproactive