Sr Lead/Architect - Site Reliability Engineer

Mid / Senior

In Office

Analyse existing, create and maintain new Service Level Objectives.
Troubleshoot, evaluate, and resolve operational challenges contributing to defined SLOs.
Define, improve, and engage in adapting architectural application bottlenecks as observed in the landscape.
Work with other engineering stakeholders on resolving larger architectural bottlenecks.
Work in close collaboration with software development teams to consult on scaling concerns.
Contribute to the future roadmap of software development teams and establish strong operational readiness across teams.
Scale systems through automation, improving change velocity and reliability.
Leverage technical skills to partner with team members and be comfortable diving into a problem as needed.
Work to enable other teams to scale through automation, knowledge-sharing, and self-service activities.
Automating every operational task is a core requirement for this role. For example, package updates, configuration changes across all environments, creating tools for automatic provisioning of user facing services, etc.
Responding to platform emergencies, alerts, and escalations from Customer Support.
Ensure systems exist to manage software life cycles (e.g. Operating Systems) with a minimum of manual effort.
Develop a fully automated multi-environment observability stack based on available tools sets in the landscape and extend it to predict capacity needs based on the usage patterns.
Plan for new service rollouts, expansion and capacity management of existing services, and work with users to optimise their resource consumption.
Establish clear ongoing cloud efficiency metrics, highlighting both how we should measure success and identifying methods to achieve those improved results.
Implement tools, practise, and process to enable other teams to contribute to efficiency in their areas.
Plan and implement needed changes in cloud environments to drive better observability of usage and improved efficiency.

Desired Profile:

Configuration management: use Chef and Ansible to effectively manage our infrastructure.
Infrastructure as code: use Terraform and Azure DevOps CI/CD for automation, containerize our environments (Kubernetes), and leverage cloud technologies to meet our goals.
Systems: manage, configure, and troubleshoot operating system issues, storage (block and object), networking, Security, Load balancer, Azure Defender, Application Gateway.
Monitoring and instrumentation: implement metrics in Prometheus, Grafana, log management and related system, and Slack/PagerDuty integrations.
Engineering practices: availability, reliability, and scalability, as well as disaster recovery
Work in a variety of languages: Shell, Ruby, GoLang, Python
Advanced knowledge of cloud services
Kubernetes: cluster provisioning and new services, troubleshooting
Prometheus, Thanos, and Grafana: service catalog metrics and recording rules for alerts.
Log shipping pipelines and incident debugging visualizations
Operating system (Linux) configuration, package management, startup, and troubleshooting
Block and object storage configuration and debugging
Terraform syntax and Azure DevOps CI/CD configuration, pipelines, jobs.

{"group":"Organization","title":"Sr Lead/Architect - Site Reliability Engineer","zohoId":"","endDate":"2023-11-01T14:16:51.148Z","isDraft":false,"jobType":"Full Time","job_url":"686-tiger-analytics-sr-lead-architect-site-reliability-engineer","agencyId":1,"clientId":"35","location":[{"lat":13.0826802,"lon":80.2707184,"zip":"","city":"Chennai","text":"Chennai, Tamil Nadu, India","state":"Tamil Nadu","country":"India","is_city":true,"is_state":false,"is_country":false,"state_code":"TN","countryCode":"IN","isLocationSet":true,"isLocationResolved":true}],"maxSalary":"","minSalary":"","questions":[],"startDate":"2023-11-01T14:16:51.148Z","hiringSPOC":"Web Imitation","onBehalfOf":"59","companyName":"Meytier","description":"<ul><li>Analyse  existing, create and maintain new Service Level Objectives.</li><li>Troubleshoot,  evaluate, and resolve operational challenges contributing to defined SLOs.</li><li>Define,  improve, and engage in adapting architectural application bottlenecks as  observed in the landscape.</li><li>Work  with other engineering stakeholders on resolving larger architectural  bottlenecks.</li><li>Work  in close collaboration with software development teams to consult on scaling  concerns.</li><li>Contribute  to the future roadmap of software development teams and establish strong  operational readiness across teams.</li><li>Scale  systems through automation, improving change velocity and reliability.</li><li>Leverage  technical skills to partner with team members and be comfortable diving into  a problem as needed.</li><li>Work  to enable other teams to scale through automation, knowledge-sharing, and  self-service activities.</li><li>Automating  every operational task is a core requirement for this role. For example,  package updates, configuration changes across all environments, creating  tools for automatic provisioning of user facing services, etc.</li><li>Responding  to platform emergencies, alerts, and escalations from Customer Support.</li><li>Ensure  systems exist to manage software life cycles (e.g. Operating Systems) with a  minimum of manual effort.</li><li>Develop  a fully automated multi-environment observability stack based on available  tools sets in the landscape and extend it to predict capacity needs based on  the usage patterns.</li><li>Plan for new service rollouts, expansion and capacity management of existing  services, and work with users to optimise their resource consumption.</li><li>Establish  clear ongoing cloud efficiency metrics, highlighting both how we should  measure success and identifying methods to achieve those improved results.</li><li>Implement  tools, practise, and process to enable other teams to contribute to efficiency  in their areas.</li><li>Plan  and implement needed changes in cloud environments to drive better  observability of usage and improved efficiency.</li></ul><p><strong>Desired Profile:</strong></p><ul><li>Configuration  management: use Chef and Ansible to effectively manage our infrastructure.</li><li>Infrastructure  as code: use Terraform and Azure DevOps CI/CD for automation, containerize  our environments (Kubernetes), and leverage cloud technologies to meet our  goals.</li><li>Systems:  manage, configure, and troubleshoot operating system issues, storage (block  and object), networking, Security, Load balancer, Azure Defender, Application  Gateway.</li><li>Monitoring and  instrumentation: implement metrics in Prometheus, Grafana, log management and  related system, and Slack/PagerDuty integrations.</li><li>Engineering practices:  availability, reliability, and scalability, as well as disaster recovery</li><li>Work in a  variety of languages: Shell, Ruby, GoLang, Python</li><li>Advanced  knowledge of cloud services</li><li>Kubernetes:  cluster provisioning and new services, troubleshooting</li><li>Prometheus,  Thanos, and Grafana: service catalog metrics and recording rules for alerts.</li><li>Log shipping  pipelines and incident debugging visualizations</li><li>Operating  system (Linux) configuration, package management, startup, and  troubleshooting</li><li>Block and  object storage configuration and debugging</li><li>Terraform  syntax and Azure DevOps CI/CD configuration, pipelines, jobs.</li></ul>","isHybridJob":false,"isRemoteJob":false,"exclusiveJob":false,"hiringManager":"Web Imitation","jobCatagories":[],"maxExperience":"12","minExperience":"8","isOnPremiseJob":true,"onBehalfOfName":"Tiger Analytics","otherlocations":[{"lat":12.9715987,"lon":77.5945627,"zip":"","city":"Bengaluru","text":"Bangalore, Karnataka, India","state":"Karnataka","country":"India","is_city":true,"is_state":false,"is_country":false,"state_code":"KA","countryCode":"IN","isLocationSet":true,"isLocationResolved":true},{"lat":17.385044,"lon":78.486671,"zip":"","city":"Hyderabad","text":"Hyderabad, Telangana, India","state":"Telangana","country":"India","is_city":true,"is_state":false,"is_country":false,"state_code":"TS","countryCode":"IN","isLocationSet":true,"isLocationResolved":true}],"experienceLevel":"Mid / Senior","numberOfOpenings":"1","maxSeniorityLevel":6,"minSeniorityLevel":3,"otherJobReference":"","sharpenedJobTitle":"Sr Lead/Architect - Site Reliability Engineer","job_category_group":3,"growthOppurtunities":[],"portalLocationDisplay":"Chennai / Bengaluru / Hyderabad","educationQualification":"Baccalaureate Degree","expertise_coreskill_or_product":["Other"],"job_id":"2664"}

Sr Lead/Architect - Site Reliability Engineer

Thank you for your application!