Detect and document defects bugs and errors for assigned component module and conducts analysis to determine the sources under guidance.
Troubleshoot performance and availability bottlenecks for assigned application under guidance.
Utilize established criteria for example probability of failure frequency of failure to measure site reliability.
Monitors site reliability conditions and new reliability requirements.
Assists in the design and development of a reliability program plan for a specific site environment.
Applies appropriate tools services or applications for reliability prediction and other site improvements.
Researches and assesses various reliability models for different site environments.
Assist in the creation of simple modular extensible and functional design for the product solution in adherence to the requirements.
Evaluate tradeoffs while designing across multiple components in a system based on the business requirements.
Convert HLD to create detailed design for specific modules components of a product system.
Understand nuances of designing for disaster recovery.
Undertake infrastructure coding automation.
Create and configure minimalistic Less Complex Highly Robust and high-quality code for a component module under guidance.
Maintain records by documenting program development and revisions.
Stay updated on the prevalent coding languages and frameworks in the industry outside the immediate scope of delivery.
Identify repetitive and routine tasks in Continuous Integration Continuous Delivery CICD Testing or any other process that can be automated.
Implement telemetry features as required under guidance.
Apply security policy requirements to component module during code development configuration.
Work with business partners to identify and document critical applications.
Interprets and follows procedures in contingency plans.
Explain the contingency and disaster recovery plans for assigned environment.
Execute established procedures necessary to continue operations in an emergency.
Participate in the design of a minimum operating environment for a computer based facility.
Suggest metrics to monitor software or system performance.
Monitor current performance data to ensure compliance with defined SLOs for multiple applications systems.
Determine thresholds for monitoring metrics and triggers alerts based on thresholds.
Supervise specific procedures to proactively check the health of applications and infrastructure including a variety of operating systems hardware and software.

Requirements

Master’s degree or equivalent in computer science, computer engineering, computer information systems, software engineering, or related area and 1 year of experience in site reliability engineering, site and system administration, infrastructure management, or related area; OR Bachelor’s degree or equivalent in computer science, computer engineering, computer information systems, software engineering, or related area and 3 years of experience in site reliability engineering, site and system administration, infrastructure management, or related area.
Experience designing and implementing performance test strategies for complex web, mobile, API, and backend systems for Jira and Confluence data center instances.
Experience building and maintaining automated performance test scripts using tools including JMeter, Gatling, LoadRunner, and k6.
Experience performing root cause analysis of performance issues in production and test environments for Jira and Confluence Data Center Instances, identifying CPU, memory, database, thread, and network bottlenecks.
Experience monitoring system health, performance, and usage using tools including Grafana, Splunk, and Dynatrace, and ensuring compliance with internal SLAs.
Experience designing and implementing observability (monitoring, logging, alerting) and ensuring SLAs and SLOs are met.
Experience designing, implementing, and supporting large-scale Jira Software, Jira Service Management, and Confluence instances.
Experience performing upgrades, patching, plugin management, and performance tuning for Atlassian platforms.
Experience in integrating enterprise platforms with CI/CD pipelines, and observability tools to automate workflows, improve incident response, and enhance system reliability.
Experience managing infrastructure components including Linux servers, databases, and storage supporting Atlassian tools in both on-prem and cloud environments.
Experience working on scripting languages including Groovy, Bash and PowerShell to automate tasks on Linux and Windows.
Experience implementing and maintaining backup, recovery, and disaster recovery plans for Atlassian tools.

Benefits

Health benefits include medical, vision and dental coverage.
Financial benefits include 401(k), stock purchase, and company-paid life insurance.
Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting.
Other benefits include short-term and long-term disability, education assistance with 100% company paid college degrees, company discounts, military service pay, adoption expense reimbursement, and more.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

site reliability engineeringperformance testingroot cause analysisinfrastructure managementdisaster recoveryscripting languagesautomationobservabilitymonitoringlogging

Soft Skills

troubleshootinganalytical skillscommunicationcollaborationproblem-solvingattention to detailorganizational skillsadaptabilitycritical thinkingtime management