Senior Cloud Engineer

Job Location:

San Francisco, CA - USA

Monthly Salary: Not Disclosed

Posted on: 9 hours ago

Vacancies: 1 Vacancy

Job Summary

About Gridware

Gridware is a San Francisco-based technology company dedicated to protecting and enhancing the electrical grid. We pioneered a groundbreaking new class of grid management called active grid response (AGR) focused on monitoring the electrical physical and environmental aspects of the grid that affect reliability and safety. Gridwares advanced Active Grid Response platform uses high-precision sensors to detect potential issues early enabling proactive maintenance and fault mitigation. This comprehensive approach helps improve safety reduce outages and ensure the grid operates efficiently. The company is backed by climate-tech and Silicon Valley investors. For more information please visit.

Role Description

Were scaling the deployment of critical infrastructure monitoring devices to detect real-world fault events that lead to wildfires. The platform youll build and operate ingests millions of events per day from devices in the field powers customer-facing dashboards and alerting and supports the data science work that turns raw signals into grid intelligence.

You will own AWS infrastructure Kubernetes (EKS) CI/CD and observability end-to-end partnering with our Cloud Security team to keep the platform safe and compliant and with backend firmware and data teams to keep them shipping fast. As an early member of the DevOps team youll have a direct hand in shaping how Gridware builds deploys and runs production systems for years to come.

Responsibilities

Design build and operate scalable secure and highly available cloud infrastructure across AWS.
Own and evolve our Kubernetes platform enabling reliable application deployment and operations through GitOps best practices.
Build and maintain CI/CD systems that improve developer velocity release quality and operational reliability.
Manage and optimize event-driven infrastructure powering high-volume telemetry and device data pipelines.
Define and maintain Infrastructure as Code standards ensuring consistency repeatability and scalability across environments.
Develop and enhance observability monitoring and incident response capabilities to support reliable production operations.
Partner closely with Security and Engineering teams to strengthen platform security access management and operational resilience.
Troubleshoot complex production issues drive root cause analysis and turn lessons learned into automation tooling and operational improvements.

Required Skills

5 years of experience in DevOps SRE or Platform Engineering operating production AWS environments
Deep expertise with Kubernetes (EKS preferred) GitOps workflows (Argo CD/Flux) and Infrastructure as Code (Terraform)
Strong experience building and maintaining CI/CD pipelines ideally with GitHub Actions
Hands-on experience operating distributed systems and cloud-native platforms (e.g. Kafka/MSK)
Solid understanding of networking DNS TLS identity/access management and cloud security best practices
Experience with observability monitoring and logging tools such as Grafana Prometheus Loki or similar
Strong Linux scripting and troubleshooting skills with the ability to debug complex production issues end-to-end

Bonus Skills

Experience operating Apollo Router / GraphQL federation gateways in production.
Experience operating Argo Workflows or similar Kubernetes-native job / pipeline runners in production.
Familiarity with Databricks or ML Ops pipelines for data and model deployment.
Experience designing operating and exercising Disaster Recovery (DR) environments including cross-region replication backups and tested failover runbooks.
Experience with Tailscale or other zero-trust networking tools.
Experience supporting IoT / embedded fleets at scale including secure device-to-cloud connectivity.
Experience in high-growth startup environments where you must wear many hats.

$190000 - $215000 a year

This describes the ideal candidate; many of us have picked up this expertise along the way. Even if you meet only part of this list we encourage you to apply!

Benefits

Health Dental & Vision (Gold and Platinum with some providers plans fully covered)

Paid parental leave

Alternating day off (every other Monday)

Off the Grid a two week per year paid break for all employees.

Commuter allowance

Company-paid training

Required Experience:

Senior IC

About GridwareGridware is a San Francisco-based technology company dedicated to protecting and enhancing the electrical grid. We pioneered a groundbreaking new class of grid management called active grid response (AGR) focused on monitoring the electrical physical and environmental aspects of the gr...

About Gridware

Role Description

Responsibilities

Design build and operate scalable secure and highly available cloud infrastructure across AWS.
Own and evolve our Kubernetes platform enabling reliable application deployment and operations through GitOps best practices.
Build and maintain CI/CD systems that improve developer velocity release quality and operational reliability.
Manage and optimize event-driven infrastructure powering high-volume telemetry and device data pipelines.
Define and maintain Infrastructure as Code standards ensuring consistency repeatability and scalability across environments.
Develop and enhance observability monitoring and incident response capabilities to support reliable production operations.
Partner closely with Security and Engineering teams to strengthen platform security access management and operational resilience.
Troubleshoot complex production issues drive root cause analysis and turn lessons learned into automation tooling and operational improvements.

Required Skills

5 years of experience in DevOps SRE or Platform Engineering operating production AWS environments
Deep expertise with Kubernetes (EKS preferred) GitOps workflows (Argo CD/Flux) and Infrastructure as Code (Terraform)
Strong experience building and maintaining CI/CD pipelines ideally with GitHub Actions
Hands-on experience operating distributed systems and cloud-native platforms (e.g. Kafka/MSK)
Solid understanding of networking DNS TLS identity/access management and cloud security best practices
Experience with observability monitoring and logging tools such as Grafana Prometheus Loki or similar
Strong Linux scripting and troubleshooting skills with the ability to debug complex production issues end-to-end

Bonus Skills

Experience operating Apollo Router / GraphQL federation gateways in production.
Experience operating Argo Workflows or similar Kubernetes-native job / pipeline runners in production.
Familiarity with Databricks or ML Ops pipelines for data and model deployment.
Experience designing operating and exercising Disaster Recovery (DR) environments including cross-region replication backups and tested failover runbooks.
Experience with Tailscale or other zero-trust networking tools.
Experience supporting IoT / embedded fleets at scale including secure device-to-cloud connectivity.
Experience in high-growth startup environments where you must wear many hats.

$190000 - $215000 a year

This describes the ideal candidate; many of us have picked up this expertise along the way. Even if you meet only part of this list we encourage you to apply!

Benefits

Health Dental & Vision (Gold and Platinum with some providers plans fully covered)

Paid parental leave

Alternating day off (every other Monday)

Off the Grid a two week per year paid break for all employees.

Commuter allowance

Company-paid training

Required Experience:

Senior IC

Apply Now

About Company

Gridware

This describes the ideal candidate; many of us have picked up this expertise along the way. Even if you meet only part of this list, we encourage you to apply! Benefits Health, Dental & Vision (Gold and Platinum with some providers plans fully covered) Paid parental leave Alternating ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click