Analyse existing, create and maintain new Service Level Objectives.
Troubleshoot, evaluate, and resolve operational challenges contributing to defined SLOs.
Define, improve, and engage in adapting architectural application bottlenecks as observed in the landscape.
Work with other engineering stakeholders on resolving larger architectural bottlenecks.
Work in close collaboration with software development teams to consult on scaling concerns.
Contribute to the future roadmap of software development teams and establish strong operational readiness across teams.
Scale systems through automation, improving change velocity and reliability.
Leverage technical skills to partner with team members and be comfortable diving into a problem as needed.
Work to enable other teams to scale through automation, knowledge-sharing, and self-service activities.
Automating every operational task is a core requirement for this role. For example, package updates, configuration changes across all environments, creating tools for automatic provisioning of user facing services, etc.
Responding to platform emergencies, alerts, and escalations from Customer Support.
Ensure systems exist to manage software life cycles (e.g. Operating Systems) with a minimum of manual effort.
Develop a fully automated multi-environment observability stack based on available tools sets in the landscape and extend it to predict capacity needs based on the usage patterns.
Plan for new service rollouts, expansion and capacity management of existing services, and work with users to optimise their resource consumption.
Establish clear ongoing cloud efficiency metrics, highlighting both how we should measure success and identifying methods to achieve those improved results.
Implement tools, practise, and process to enable other teams to contribute to efficiency in their areas.
Plan and implement needed changes in cloud environments to drive better observability of usage and improved efficiency.
Configuration management: use Chef and Ansible to effectively manage our infrastructure.
Infrastructure as code: use Terraform and Azure DevOps CI/CD for automation, containerize our environments (Kubernetes), and leverage cloud technologies to meet our goals.
Systems: manage, configure, and troubleshoot operating system issues, storage (block and object), networking, Security, Load balancer, Azure Defender, Application Gateway.
Monitoring and instrumentation: implement metrics in Prometheus, Grafana, log management and related system, and Slack/PagerDuty integrations.
Engineering practices: availability, reliability, and scalability, as well as disaster recovery
Work in a variety of languages: Shell, Ruby, GoLang, Python
Advanced knowledge of cloud services
Kubernetes: cluster provisioning and new services, troubleshooting
Prometheus, Thanos, and Grafana: service catalog metrics and recording rules for alerts.
Log shipping pipelines and incident debugging visualizations
Operating system (Linux) configuration, package management, startup, and troubleshooting
Block and object storage configuration and debugging
Terraform syntax and Azure DevOps CI/CD configuration, pipelines, jobs.