Job Description Summary
The Site Reliability Engineering team is responsible for the availability and reliability of our worldwide Cloud based applications and platform. We obsess over availability by building tools, engineering new systems to automate our platform/apps, and are given the freedom to cut across all organizations to identify availability impediments and drive them to closure. We are software engineers with full visibility and influence across the entire technical stack. We strive to ensure some of the biggest companies / Industries in the world always have reliable access to the software & solutions that power their businesses.
Job Description
Roles and Responsibilities
In this role, you will:
• Develop automated solutions to predict and address potential problems before they result in a service interruption
• Utilize monitoring and alerting systems to construct the foundations for self-healing systems
• Participate as part of an On-Call team to address production outages
• Identify potential process improvements, and drive them to resolution, across the entire NOC, Support, and Engineering organizations
• Define and drive architectural enhancements into system to mitigate potential failure points
• Investigate root cause of severe and systemic outages, identify corrective actions
• Establish performance baseline, capacity thresholds, correlate events, and define monitoring/alerting criteria
• Collaborate with different business units worldwide, providing a bastion technical expertise
• Provide technical coaching and direction to more junior teammates
Eligibility Requirements: Legal authorization to work in the U.S. is required. We will not sponsor individuals for employment visas, now or in the future, for this job.
Education Qualification
Bachelor's Degree in Computer Science, Information Management or in “STEM” Majors (Science, Technology, Engineering and Math) with minimum 2 years of experience
Desired CharacteristicsTechnical Expertise:
• Experience with configuring, customizing, and extending monitoring tools (AppD, Splunk, NewRelic, Sensu, Graphite, etc.)
• Demonstrated ability to script around repeatable tasks (Go, Ruby, Python, Bash)
• Excellent knowledge of UNIX system internals
• Strong analytical and problem solving skills
• Experience with all stages of an agile software development lifecycle (CI/CD)
• Experience with developing cloud-native applications (High Availability)
• Able to dive into any level of a modern internet service (schedulers, containers, Linux kernel, caching, object storage, distributed filesystems, RDBMS, NoSQL, etc.)
• Comfortable with network troubleshooting (tcpdump, routing, proxies, firewalls, load balancers, etc.)
• Able to troubleshoot and debug applications (C, Java, Go)
• Proficient in configuration management systems (Chef, Terraform, Ansible, Puppet, Salt)
• Experience deploying and managing infrastructure on public clouds (AWS, GCP, or Azure)
• Comfortable using Git on the command line
Leadership:
• Influences through others; builds direct and "behind the scenes" support for ideas.
• Preemptively sees downstream consequences and effectively tailors influencing strategy to support a positive outcome.
• Able to verbalize what is behind decisions and downstream implications.
• Continuously reflecting on success and failures to improve performance and decision-making.
• Understands and encourages change when needed.
• Proactively identifies and removes project obstacles or barriers on behalf of the team.
• Able to navigate accountability in a matrixed organization.
• Self-starter; communicates and demonstrates a shared sense of purpose. Learns from failure.
Personal Attributes:
• Critical thinker; able to quickly adapt to changing environments
• A hacker or tinkerer at heart
• Risk taker, not afraid to think outside the box or challenge the status quo
• Emotional Intelligence, ability to influence up and out and the ability to work independently
• Must be a team player with a strong desire to win
• Passionate about continuously learning
• Highly organized and efficient; able to balance competing priorities and execute accordingly
• Strong oral and written communication skills.