Manager Site Reliability Engineering

Sandy Springs, GA · Information Technology

This is an exciting opportunity for a Manager in the Consumer Site Reliability Engineer (SRE) Team at IMT. IMT is a division of our client, which operates numerous financial and commodity marketplaces and exchanges, including the New York Stock Exchange (NYSE).
This position is for a hands-on technical manager to lead a team of SRE engineers, focused on providing resilient, secure, scalable and supportable services for mortgage borrowers and lenders. You will contribute to the strategy and delivery of the team, as well as managing the day-to-day workload. This role requires building a close relationship with our customer support, operations, engineering, database and product organizations.
You will be involved in the design of resilient systems, the definition and monitoring of SLI/SLOs, creating pro-active actionable alerts, and also drive production incidents. We operate in a hybrid multi-cloud environments supporting Windows, Linux and container-based applications.

Responsibilities

Provide thought-leadership; set the technical direction for the SRE Team
Define and manage projects to meet Team objectives.
Set individual goals and manage personal growth of team members.
Manage and troubleshoot a diverse set of SaaS Applications and internal services
Serve as the face of a team responsible for the overall health, performance, and capacity of our business applications
Develop sustainable SRE practices around simplification and standardization
Drive of the cultural standard for SRE including defining ways of working, runbooks and accountability across people, processes and technology
Lead Incident Response and Root Cause Analysis.
Partner with other SRE teams and lead by example

Knowledge and Experience

3+ years of managing high-performance teams in
10+ years of Application/Systems engineering in 24x7 Production Services environments
BS in Computer Science, Computer Engineering, Math, or equivalent professional experience
Experience in designing, deploying and operating SaaS applications and cloud infrastructure (AWS or equivalent & On-Premise virtualized environments)
Excellent troubleshooter spanning systems, networks and code, utilizing a systematic problem-solving approach
Proven track record decreasing MTTR (Meant-Time-To-Recovery), increasing MTTF (Mean-Time-To-Failure), and improving overall service quality
Demonstrate the ability to lead Incident Response and root cause analysis (RCA)
Fluency with one or more current generation scripting language used by SRE/DevOps professionals (Powershell, Python, Perl, PHP, Ruby) + Java/.NET development
Strong communication skills

Manager Site Reliability Engineering

Share This Job