Site Reliability Engineer

NOV

Job Location:

Kochi - India

Monthly Salary: Not Disclosed

Posted on: 14 hours ago

Vacancies: 1 Vacancy

Department:

Engineering

Job Summary

Description

We are looking for an experienced SRE to lead production reliability performance tuning and operational excellence across our platform. Youll work at the intersection of software engineering and systems engineeringdriving improvements that directly impact product uptime velocity and user satisfaction.

If you are passionate about reliability automation and scalabilityand have the technical depth to back it upthis role is for you.

Responsibilities:

As a Site Reliability Engineer you will be responsible for Operational Excellence & Incident Management.
Maintain and monitor production systems for availability latency and performance.
Lead incident response efforts including communication resolution and postmortem documentation.
Design and implement health checks alerting systems and automated remediation workflows.
Drive root cause analysis and implement permanent resolutions for recurring issues.
Set up and maintain full observability stacks (logging metrics tracing) using tools like Prometheus Grafana Datadog OpenTelemetry or ELK.
Analyze telemetry and logs to identify trends anomalies and opportunities for improvement.
Conduct post-incident reviews and use insights to inform future engineering investments.
Tune and optimize distributed systems including actors for performance and resource efficiency.
Work with developers to evolve architecture and improve system throughput latency and stability.
Optimize PostgreSQL performance queries and maintenance strategies.
Design and maintain modern CI/CD pipelines using GitHub Actions Azure Pipelines or GitLab CI.
Automate deployment testing and rollback processes to reduce friction and increase deployment frequency.
Standardize infrastructure as code practices across environments.

Requirements:

510 years of experience in SRE DevOps or Infrastructure Engineering roles.
Expertise in Kubernetes and container orchestration at scale.
Strong experience with or similar actor-based frameworks.
Proficiency with scripting and automation (Bash PowerShell Python).
Experience with observability tools (PhobosDatadog Prometheus Grafana OpenTelemetry ELK).
Hands-on experience with cloud platforms (AWS Azure or GCP).
Strong PostgreSQL knowledgeperformance tuning query optimization maintenance.
Proven ability to lead incident management and drive postmortem processes.
A builders mindset with high standards for operational excellence and technical ownership.
Preferred Tools & Ecosystem Experience.
CI/CD: GitHub Actions Azure Pipelines GitLab CI.
Infrastructure: Kubernetes Docker Terraform.
Monitoring: Phobos () Datadog Prometheus.
Source Control: GitHub GitLab Azure DevOps.
Programming: C# Python Bash PowerShell.

Required Experience:

DescriptionWe are looking for an experienced SRE to lead production reliability performance tuning and operational excellence across our platform. Youll work at the intersection of software engineering and systems engineeringdriving improvements that directly impact product uptime velocity and user ...

Description

If you are passionate about reliability automation and scalabilityand have the technical depth to back it upthis role is for you.

Responsibilities:

As a Site Reliability Engineer you will be responsible for Operational Excellence & Incident Management.
Maintain and monitor production systems for availability latency and performance.
Lead incident response efforts including communication resolution and postmortem documentation.
Design and implement health checks alerting systems and automated remediation workflows.
Drive root cause analysis and implement permanent resolutions for recurring issues.
Set up and maintain full observability stacks (logging metrics tracing) using tools like Prometheus Grafana Datadog OpenTelemetry or ELK.
Analyze telemetry and logs to identify trends anomalies and opportunities for improvement.
Conduct post-incident reviews and use insights to inform future engineering investments.
Tune and optimize distributed systems including actors for performance and resource efficiency.
Work with developers to evolve architecture and improve system throughput latency and stability.
Optimize PostgreSQL performance queries and maintenance strategies.
Design and maintain modern CI/CD pipelines using GitHub Actions Azure Pipelines or GitLab CI.
Automate deployment testing and rollback processes to reduce friction and increase deployment frequency.
Standardize infrastructure as code practices across environments.

Requirements:

510 years of experience in SRE DevOps or Infrastructure Engineering roles.
Expertise in Kubernetes and container orchestration at scale.
Strong experience with or similar actor-based frameworks.
Proficiency with scripting and automation (Bash PowerShell Python).
Experience with observability tools (PhobosDatadog Prometheus Grafana OpenTelemetry ELK).
Hands-on experience with cloud platforms (AWS Azure or GCP).
Strong PostgreSQL knowledgeperformance tuning query optimization maintenance.
Proven ability to lead incident management and drive postmortem processes.
A builders mindset with high standards for operational excellence and technical ownership.
Preferred Tools & Ecosystem Experience.
CI/CD: GitHub Actions Azure Pipelines GitLab CI.
Infrastructure: Kubernetes Docker Terraform.
Monitoring: Phobos () Datadog Prometheus.
Source Control: GitHub GitLab Azure DevOps.
Programming: C# Python Bash PowerShell.

Required Experience:

Apply Now

About Company

NOV

Every day, the oil and gas industry’s best minds put more than 150 years of experience to work to help our customers achieve lasting success.We Power the Industry that Powers the WorldThroughout every region in the world and across every area of drilling and production, our family of ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click