View all jobs

Manager Site Reliability Engineering

Sandy Springs, GA · Information Technology
This is an exciting opportunity for a Manager in the Consumer Site Reliability Engineer (SRE) Team at IMT. IMT is a division of our client, which operates numerous financial and commodity marketplaces and exchanges, including the New York Stock Exchange (NYSE).
This position is for a hands-on technical manager to lead a team of SRE engineers, focused on providing resilient, secure, scalable and supportable services for mortgage borrowers and lenders. You will contribute to the strategy and delivery of the team, as well as managing the day-to-day workload. This role requires building a close relationship with our customer support, operations, engineering, database and product organizations.
You will be involved in the design of resilient systems, the definition and monitoring of SLI/SLOs, creating pro-active actionable alerts, and also drive production incidents. We operate in a hybrid multi-cloud environments supporting Windows, Linux and container-based applications.
  • Provide thought-leadership; set the technical direction for the SRE Team
  • Define and manage projects to meet Team objectives.
  • Set individual goals and manage personal growth of team members.
  • Manage and troubleshoot a diverse set of SaaS Applications and internal services
  • Serve as the face of a team responsible for the overall health, performance, and capacity of our business applications
  • Develop sustainable SRE practices around simplification and standardization
  • Drive of the cultural standard for SRE including defining ways of working, runbooks and accountability across people, processes and technology
  • Lead Incident Response and Root Cause Analysis.
  • Partner with other SRE teams and lead by example
Knowledge and Experience
  • 3+ years of managing high-performance teams in
  • 10+ years of Application/Systems engineering in 24x7 Production Services environments
  • BS in Computer Science, Computer Engineering, Math, or equivalent professional experience
  • Experience in designing, deploying and operating SaaS applications and cloud infrastructure (AWS or equivalent & On-Premise virtualized environments)
  • Excellent troubleshooter spanning systems, networks and code, utilizing a systematic problem-solving approach
  • Proven track record decreasing MTTR (Meant-Time-To-Recovery), increasing MTTF (Mean-Time-To-Failure), and improving overall service quality
  • Demonstrate the ability to lead Incident Response and root cause analysis (RCA)
  • Fluency with one or more current generation scripting language used by SRE/DevOps professionals (Powershell, Python, Perl, PHP, Ruby) + Java/.NET development
  • Strong communication skills


Share This Job

Powered by