Sr Lead/Architect - Site Reliability Engineer

Mid / Senior

|

In Office

Meytier Premier Employer

Working there

About This Workplace

Meytier Partner

  • Analyse  existing, create and maintain new Service Level Objectives.
  • Troubleshoot,  evaluate, and resolve operational challenges contributing to defined SLOs.
  • Define,  improve, and engage in adapting architectural application bottlenecks as  observed in the landscape.
  • Work  with other engineering stakeholders on resolving larger architectural  bottlenecks.
  • Work  in close collaboration with software development teams to consult on scaling  concerns.
  • Contribute  to the future roadmap of software development teams and establish strong  operational readiness across teams.
  • Scale  systems through automation, improving change velocity and reliability.
  • Leverage  technical skills to partner with team members and be comfortable diving into  a problem as needed.
  • Work  to enable other teams to scale through automation, knowledge-sharing, and  self-service activities.
  • Automating  every operational task is a core requirement for this role. For example,  package updates, configuration changes across all environments, creating  tools for automatic provisioning of user facing services, etc.
  • Responding  to platform emergencies, alerts, and escalations from Customer Support.
  • Ensure  systems exist to manage software life cycles (e.g. Operating Systems) with a  minimum of manual effort.
  • Develop  a fully automated multi-environment observability stack based on available  tools sets in the landscape and extend it to predict capacity needs based on  the usage patterns.
  • Plan for new service rollouts, expansion and capacity management of existing  services, and work with users to optimise their resource consumption.
  • Establish  clear ongoing cloud efficiency metrics, highlighting both how we should  measure success and identifying methods to achieve those improved results.
  • Implement  tools, practise, and process to enable other teams to contribute to efficiency  in their areas.
  • Plan  and implement needed changes in cloud environments to drive better  observability of usage and improved efficiency.

Desired Profile:

  • Configuration  management: use Chef and Ansible to effectively manage our infrastructure.
  • Infrastructure  as code: use Terraform and Azure DevOps CI/CD for automation, containerize  our environments (Kubernetes), and leverage cloud technologies to meet our  goals.
  • Systems:  manage, configure, and troubleshoot operating system issues, storage (block  and object), networking, Security, Load balancer, Azure Defender, Application  Gateway.
  • Monitoring and  instrumentation: implement metrics in Prometheus, Grafana, log management and  related system, and Slack/PagerDuty integrations.
  • Engineering practices:  availability, reliability, and scalability, as well as disaster recovery
  • Work in a  variety of languages: Shell, Ruby, GoLang, Python
  • Advanced  knowledge of cloud services
  • Kubernetes:  cluster provisioning and new services, troubleshooting
  • Prometheus,  Thanos, and Grafana: service catalog metrics and recording rules for alerts.
  • Log shipping  pipelines and incident debugging visualizations
  • Operating  system (Linux) configuration, package management, startup, and  troubleshooting
  • Block and  object storage configuration and debugging
  • Terraform  syntax and Azure DevOps CI/CD configuration, pipelines, jobs.

© 2024 Meytier - All Rights Reserved.
   Privacy Policy    Terms Of Use