The Chief Technology Office (CTO) of this major Investment Bank is leading technology transformation in a multi-cloud technology ecosystem, delivering products and innovative solutions at scale that are s, secure, and always on. We are hiring a Senior Site Reliability Engineering (SRE) Manager to build and grow the CTO Site Reliability Engineering (SRE) teams and practice at Company. We believe that "Hope is not a Strategy" and operational issues should be solved through code.
The Senior SRE Manager will launch the SRE Practice across Company's Technology, working with CIOs and operate leaders to build embedded SRE teams within the lines of business (LOBs). The person in this role will build and manage a team of 15+ highly capable full stack Software Engineers (SWEs) and Site Reliability Engineers (SREs) to ensure Company's products and services are always on. This manager will ensure applications on-boarded to SRE are instrumented for full-stack observability and continuous testing, introduce continuous improvement, integrate into IT Service Operations, and share support responsibilities for critical customer journeys, business flows, and applications. They will also forge the strategy for AIOps through AI/ML and NoOps, delivering strategic innovation to improve availability, stability, and resiliency.
This leader will be responsible for the following:
· Launch enterprise SRE practice to drive transformation of traditional IT Service Operations organizations
· Establish SRE practices and principles, and foster a culture of curiosity, credibly challenge assumptions, and influence change
· Manage, mentor, and inspire a team of highly-talented software and site reliability engineers in CTO, and 60 embedded SREs in LOBs
· Gather requirements and onboard customer journeys to new enterprise observability, CI/CD, continuous testing platforms
· On-board 50+ customer journeys and apps to SRE in 2020 with growth to 300 applications over 3 years
· Lead team to conduct SRE assessments, identify SLOs/SLIs, instrument for observability, testing, continuous improvement through automation and integration, and share support responsibilities for critical customer journeys and applications
· Establish & Measure through data key metrics for customer impact, critical business flow availability, customer satisfaction
· Launch the error budget, blameless post mortems, and drive remediation through Agile methodology
· Innovate, develop, build, buy, POC, and pilot the latest technology to improve stability, availability, and
· Establish AIOPS/NoOps practice leveraging AI/ML for operate data, build self healing capabilities or autonomic healing, automating cognitive processes for resolving systemic issues
Proven Technical Expertise with one or more of the following:
· Software Development: Java, Go, C/C++, Angular, R, Scala
· OS and Platform: AWS, Lamda, EMR, PCF, Kubernetes, OpenShift, Linux, Azure, Windows, VMware
· CI/CD and Automation: Jenkins, Gitlab, SonarQube, Artifactory, Ansible, Puppet, Apigee
· Observability and AIOps: DataDog, Grafana, Prometheus, ELK, Elastic, Kibana, Kafka, CloudWatch, Jaeger, Zipkin, Kinesis, Apache Airflow, AppDynamics
Experience in one or more of the following areas is desired:
· AIOps: Moogsoft, BigPanda, Robotic Process Automation (RPA), UIpath, Artificial Intelligence (AI) and
Machine Learning (ML) Frameworks
· Operations Tools: ServiceNow, PagerDuty, Microsoft Teams, Symphony/Slack, Remedy, IBM Netcool
· Data/Data Structures: Oracle, SQL, Mongo, Hadoop, Cloudera, Spark
· Testing: Gremlin, Chaos Monkey, Selenium, jmeter, Blazemeter, Performance Center, Quality Center/ALM,
As a Team Member Manager, you are expected to achieve success by leading yourself, your team, and the
business. Specifically you will:
· Lead your team with integrity and create an environment where your team members feel included, valued, and supported to do work that energizes them.
· Accomplish management responsibilities which include sourcing and hiring talented team members, providing ongoing coaching and feedback, recognizing and developing team members, identifying and managing risks, and completing daily management tasks.
Locations: Position is required to sit in one of the locations listed on job posting.
Relocation assistance is available for this position.
· 3+ years of Incident Management System experience
· 5+ years of development experience with languages such as Python, Java, Scala, or R
· 12+ years of application development and implementation experience
· 5+ years of management experience in technology
· Excellent verbal, written, and interpersonal communication skills
· Knowledge and understanding of data management and infrastructure production engineering leadership, this includes delivering reliable and responsive systems, and discipline to continually root out issues at the core · 2+ years of Configuration Management Tools experience
· Experience with Agile Scrum (Daily Standup, Sprint Planning and Sprint Retrospective meetings) and Kanban
· 3+ years of experience with Cloud technologies
· Ability to interact with all levels of an organization
· Management experience including large/multiple application development efforts within a small-to-medium size line of business
Other Desired Qualifications
· Experience with system administration across multiple platforms
· Experience with Observability/Monitoring technologies: DataDog, Elastic Stack/ELK, Grafana, Prometheus, Kafka, Cloudwatch
· Experience with Container technologies: Kubernetes, Docker, PKS
· Experience with one or more Technology Platforms (Cloud, o/s, etc.): Pivotal Cloud Foundry (PCF), AWS,
Azure, Linux, VMware
· Experience with one or more CI/CD and Automation tools: Jenkins, Gitlab, SonarQube, Artifactory, Ansible, Puppet, Apigee
· Experience with one or more Testing tools/concepts: Gremlin, Chaos Monkey, Selenium, jmeter, Blazemeter, Performance Center, Quality Center/ALM, DevTest