TCGplayer (an eBay company)

Frontline Engineer

TCGplayer (an eBay company)

full-time

Posted on:

Location Type: Remote

Location: MontanaUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $103,200 - $178,400 per year

About the role

  • Serve as Incident Commander, leading real-time response efforts, managing communication across teams, triaging issues, and driving resolution of high-priority incidents to minimize downtime and business disruption.
  • Execute documented runbooks for troubleshooting and resolving production incidents involving AWS services (EC2, CloudWatch, IAM) and Kubernetes clusters (pods, deployments, scaling).
  • Collaborate closely with engineering teams post-incident, performing root cause analysis, documenting lessons learned, and driving the implementation of durable solutions.
  • Drive operational excellence by measuring and analyzing critical metrics (e.g., MTTR, SLA adherence) to identify improvement opportunities and implement impactful solutions.
  • Continuously refine and update operational runbooks and procedures, ensuring alignment with evolving technologies and business needs.
  • Proactively contribute to long-term strategic initiatives to improve incident management practices.

Requirements

  • A Bachelor’s degree in a technical field or equivalent experience (5+ years) in system administration, infrastructure engineering, or related roles; relevant certifications are a plus.
  • Direct experience as an incident commander, including managing live incident calls, coordinating triage efforts, and driving communications during high-pressure situations.
  • Strong communication skills with the ability to clearly articulate technical details and strategies to both technical and non-technical stakeholders.
  • Excellent problem-solving capabilities, able to stay composed and decisive under pressure during high-impact incidents.
  • Hands-on operational experience with AWS in a production environment, specifically executing runbooks, restarting EC2 instances, checking alarms, and pulling logs from CloudWatch.
  • Proficiency with Kubernetes, including troubleshooting containerized workloads, understanding pod health, managing deployments, and interacting directly with Kubernetes clusters.
  • Experience with scripting (Python, PowerShell, or Bash) to automate operational tasks or assist in incident resolution workflows.
Benefits
  • full range of medical, financial, and/or other benefits
  • 401(k) eligibility
  • various paid time off benefits, such as PTO and parental leave
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
AWSKubernetesPythonPowerShellBashincident managementtroubleshootingroot cause analysisrunbooksoperational metrics
Soft Skills
communicationproblem-solvingleadershipdecisivenesscollaborationstrategic thinkingcomposure under pressuredocumentationanalytical skillsteam coordination
Certifications
Bachelor’s degree in technical fieldrelevant certifications