Sr Support Engineer (Site Reliability Engineering (SRE) E3)

Job Location:

Toronto - Canada

Monthly Salary: Not Disclosed

Posted on: 17 hours ago

Vacancies: 1 Vacancy

Job Summary

Sr Support Engineer (Site Reliability Engineering (SRE) E3)
Toronto ON Hybrid - 2 -3 Days WFO
6-12 months

Key Responsibilities
Observability SRE DevOps roles with proven expertise across infrastructure and application-level reliability. Dynatrace ELK Splunk and PagerDuty; SLI/SLO frameworks. Azure Kubernetes Service TerraformAzure managed services

What will you do
Design and implement observability-as-code solutions using Terraform to deploy monitoring pipelines dashboards and alerting strategies across distributed observability improvements leveraging industry-leading tools (Dynatrace ELK Splunk PagerDuty) to achieve real-time performance insights and comprehensive system applications for end-to-end observabilityimplementing distributed tracing metrics collection and log aggregation across microservices and event-driven complex incidents in production environments diagnosing root causes across multiple service layers databases caches and APIs under load using SLISLO and resolve Azure Kubernetes Service (AKS) infrastructure ensuring reliability and scalability of containerized workloads with deep proficiency in Terraform and Azure managed services (SQL MI Redis Functions Event Grid).Translate business requirements into observable resilient systems that meet defined SLIsSLOs and drive reliability operational tasks to reduce toil and improve system resilience through infrastructure-as-code and CICD best incident response and remediation for mission-critical systems conducting blameless postmortems and building resilience through chaos engineering and tabletop cross-functionally with development platform and business teams to improve service availability scalability and operational do you need to succeedMust-have8 years hands-on experience in observability SRE or DevOps roles with proven expertise across infrastructure and application-level expertise in observability tooling Dynatrace ELK Splunk and PagerDuty demonstrated understanding of observability principles (instrumentation correlation IDs SLISLO frameworks).Advanced proficiency with Azure Kubernetes Service (AKS) Terraform and Azure managed services (SQL MI Redis Functions Event Grid) proven ability to design and implement infrastructure-as-code hands-on experience instrumenting applications for comprehensive observability distributed tracing metrics collection and log aggregation across applications in microservices and event-driven troubleshooting expertise in distributed systemsdiagnosing root causes across multiple service layers databases caches and APIs in production incident management skills hands-on experience with PagerDuty and ServiceNow ability to resolve high-severity incidents rapidly and conduct effective root cause of incident problem and change management processes including SRE principles blameless postmortems and chaos engineering communication and leadership skills to coordinate across business and IT teams ability to lead remote