Senior SRE | Platform Engineering | AI Systems

I engineer calm inside complex systems. Now I am applying that reliability mindset to the next generation of AI platforms.

Over the last 13+ years, I have designed cloud platforms, observability systems, disaster recovery programs, and automation layers for high-scale production environments. My edge is turning operational complexity into products that teams can trust.

AWS + AzureKubernetes + GitOpsObservability + SLOsAI-ready operations
13+
years designing resilient systems
100+
production apps modernized or migrated
40%
faster MTTR through observability design
25%
cloud cost reduction through platform discipline
Saroj Priyadarshi
Current role
Senior Site Reliability Engineer at OPENLANE
Operating Focus

From reliability engineering to AI-native infrastructure

I build the layers that make teams faster without making production riskier: self-service platform paths, deep telemetry, incident guardrails, and increasingly, AI-assisted operational workflows.

Platform systems
Operational intelligence
Human-centered automation
AI-native operations
Based in Indianapolis, Indiana
Leading reliability and platform work across AWS, Azure, ECS, and EKS
Designing golden paths for GitOps, observability, and compliance by default
Applying AI patterns to incident triage, runbook intelligence, and platform automation
Story

My career started in traditional infrastructure, matured in cloud-native reliability, and is now bending toward AI systems that need the same rigor as any critical platform.

I grew my operational instincts in global enterprise environments where uptime, security, and execution discipline were non-negotiable. That early work sharpened the habits I still rely on today: simplify failure paths, automate the repeatable, and make systems legible under pressure.

At OPENLANE, I have spent years modernizing and operating a large-scale marketplace platform across AWS and Azure. The work spans migrations, Kubernetes, GitOps, observability, on-call systems, cost optimization, and resilience engineering for more than one hundred production applications.

What excites me now is bringing that foundation into AI. I am especially interested in inference platforms, AIOps, evaluation pipelines, and the operational controls that make model-driven products reliable, explainable, and safe to evolve.

2012 — 2017

Enterprise foundations

Built operational depth across manufacturing and banking environments where reliability was measured in business continuity, not just dashboards.

Managed critical infrastructure programs at Infosys and Wells Fargo.
Learned to treat runbooks, escalation, and recovery as design problems.
2017 — today

Cloud-native scale

Evolved into platform and reliability leadership at OPENLANE, spanning migrations, multi-cloud architecture, GitOps, SLOs, and chaos engineering.

Modernized more than 100 applications with zero-downtime migration patterns.
Reduced MTTR by 40% while improving telemetry quality and incident response.
Current vector

AI with operational discipline

Applying SRE thinking to AI workflows: guarded automation, trustworthy observability, resilient inference, and human-in-the-loop control planes.

AWS Certified AI Practitioner with a reliability-first lens on AI systems.
Focused on infrastructure roles where platform engineering and AI intersect.
Signal
AWS AI Practitioner
2025
Signal
Certified Kubernetes Administrator
2023
Signal
MS, IT Management
Indiana University Kelley School of Business
Signal
Innovation Awards Jury
Business Intelligence Group, 2025
Capabilities

Engineering depth presented as systems, not as a flat list of tools.

My toolkit matters because of what it lets teams achieve: calmer operations, cleaner delivery paths, and more confidence in the systems they are responsible for.

Reliability Systems

Designing platforms that stay understandable under stress

I treat observability, incident response, SLOs, and failure testing as one connected operating system rather than separate tools.

Outcome
40% MTTR reduction
SLO programs, chaos drills, error budgets, incident design
Experience signal

A quick visual cue for where this capability has the most depth. It is not a scored metric.

OpenTelemetry
Prometheus
Grafana
Splunk
AppDynamics
Datadog
Platform Engineering

Creating golden paths that scale across teams and services

I build reusable delivery systems so teams inherit guardrails, compliance, and velocity without needing a ticket for every decision.

Outcome
100+ apps modernized
Terraform modules, GitOps, cluster lifecycle, self-service delivery
Experience signal

A quick visual cue for where this capability has the most depth. It is not a scored metric.

Terraform
ArgoCD
FluxCD
Helm
Kustomize
GitHub Actions
Cloud Economics

Operating with performance and cost in the same frame

Reliability is stronger when capacity, autoscaling, and FinOps are designed together instead of traded off after the fact.

Outcome
25% cloud cost reduction
Capacity planning, autoscaling, right-sizing, migration economics
Experience signal

A quick visual cue for where this capability has the most depth. It is not a scored metric.

AWS
Azure
EKS
ECS
RDS
CloudFront
AI Transition

Bringing infrastructure-grade rigor into AI operations

My current direction is building AI-assisted operational workflows and the platform controls needed to run model-powered systems responsibly.

Outcome
AI-ready operational workflows
Runbook retrieval, guardrails, evaluation loops, inference resilience
Experience signal

A quick visual cue for where this capability has the most depth. It is not a scored metric.

Python
Go
VSVector Search
PGPrompt Guardrails
Telemetry
AUAutomation
Tools & Technology

The operating stack behind the systems I build, scale, and keep dependable.

These are the platforms, runtimes, delivery systems, and telemetry tools I have used across cloud migration, Kubernetes, GitOps, observability, and automation-heavy reliability engineering.

35
tools in active use
6
stack categories
13+
years across infra and SRE

I use tools as part of an operating system, not as a collection of disconnected badges. The goal is always the same: delivery paths that are faster, safer, more observable, and easier for teams to trust.

Current Focus
Multi-cloud platformsKubernetes operationsGitOps deliveryObservability at scaleInfrastructure automationAI-ready operations
Cloud Platforms

4 tools

Stack

Multi-cloud platforms and edge services I have used to modernize, scale, and harden production environments.

AWS
Microsoft Azure
GCP
Cloudflare
Container Orchestration

7 tools

Stack

Container orchestration and service delivery tooling used for cluster operations, packaging, and traffic management.

Kubernetes (EKS/AKS)
Docker
Docker Swarm
OpenShift
Helm
Kustomize
Linkerd
Infrastructure as Code

7 tools

Stack

Declarative provisioning and configuration systems that make infrastructure repeatable, reviewable, and auditable.

Terraform
Pulumi
Ansible
Puppet
Chef
CloudFormation
CDK
CI/CD & GitOps

5 tools

Stack

Delivery pipelines and GitOps tooling that turn deployment workflows into reliable operating paths.

GitHub Actions
Azure DevOps
Jenkins
ArgoCD
FluxCD
Observability

6 tools

Stack

Telemetry, tracing, and monitoring tools that make large systems diagnosable under pressure.

Splunk
AppDynamics
Prometheus
Grafana
Datadog
OpenTelemetry
Scripting & Automation

6 tools

Stack

Languages and operating environments I use to automate toil, build internal tooling, and debug complex systems.

Python
Go
Bash
Ruby
Linux
PowerShell
Selected Work

Four case studies that show how I think about scale, failure, and the future of AI-assisted operations.

Each project is framed around the operating problem, the architectural response, and the outcomes that mattered to the business and the teams shipping inside the system.

Orbit Control Plane generated product visual
Platform orchestration
Concept render
Golden paths, GitOps lanes, and multi-cloud control at product scale.
Role + context
Senior SRE / Platform Architect
OPENLANE | 2022-2025
Stack
AWSAzureTerraformEKSAKSArgoCDFluxCDHelm
Problem

Application teams were moving at different speeds, infrastructure standards were inconsistent, and provisioning still depended on manual handoffs. The result was slow onboarding, uneven security posture, and too much operational drift.

Approach

I designed a Git-centric control plane built on reusable Terraform modules, GitOps deployment flows, and cluster abstractions that encoded the preferred path. Teams could request environments through versioned templates while platform policies enforced consistency behind the scenes.

Architecture
Reusable landing-zone and service modules for AWS, Azure, EKS, and AKS.
GitHub Actions and Azure DevOps pipelines feeding ArgoCD and FluxCD deployment lanes.
Helm and Kustomize overlays that standardized application, secrets, and observability wiring.
Impact
Cut environment setup time from multiple days to less than 30 minutes.
Standardized deployment patterns across more than 100 applications.
Created a cleaner path for security, compliance, and cost guardrails by default.
Experience

A career arc built on high-consequence systems, now aimed at AI and platform leverage.

My track record combines operational sharpness, enterprise credibility, and the product thinking needed to build systems other engineers actually want to use.

2017 — Present

OPENLANE

Indianapolis, Indiana

Featured chapter

Reliability and platform leadership for a large-scale digital marketplace.

Led SRE initiatives across AWS and Azure for a cloud-native platform serving North America. The work spans multi-cloud operations, Kubernetes, GitOps, observability, migrations, cost discipline, on-call systems, and resilience engineering.

Directed zero-downtime migration efforts for more than 100 applications.
Reduced MTTR by 40% and service downtime by 20% through better observability and incident design.
Automated Kubernetes cluster lifecycle, GitOps delivery, and internal tooling in Python and Go.
Mentored a team of six engineers while improving system reliability and operational maturity.
AWSAzureKubernetesTerraformArgoCDOpenTelemetryPythonGo
2015 — 2017

Wells Fargo

Charlotte, North Carolina

Mission-critical infrastructure in a high-consequence financial environment.

Managed banking infrastructure where uptime, control, and execution quality were tightly coupled to customer trust and regulatory rigor.

Improved system performance by 20% and reliability by 26% through infrastructure modernization.
Built operational discipline around deployments, resilience, and cross-team coordination.
LinuxPythonBashJenkins
2012 — 2015

Infosys

Bangalore, India

Global infrastructure programs across manufacturing and enterprise systems.

Built my early systems engineering instincts on transformation programs for BMW and Baker Hughes, learning how to improve reliability in complex, multi-team environments.

Improved resilience by 30% on large infrastructure transformation efforts.
Recognized for both technical delivery and high-trust client execution.
LinuxWindows ServerShell ScriptingOperations
MS in IT Management
Indiana University Kelley School of Business
AWS AI Practitioner
Certified in 2025
CKA + Terraform + PagerDuty
Operational depth across cloud-native delivery
Industry recognition
BIG Innovation Awards jury member and Indian Achiever Award recipient
Insights

A few principles that guide how I design platform, reliability, and AI systems.

These are the ideas I keep coming back to when I am shaping architecture, reviewing tradeoffs, or helping teams move from manual heroics to resilient delivery.

01

Reliability is a product surface

The strongest platform teams do not treat reliability as a background activity. They design it into onboarding, deployment, observability, and recovery so that engineers feel quality through the product itself.

02

AI needs the discipline SRE already learned the hard way

Inference systems, evaluation loops, and agent workflows still need guardrails, traceability, rollback paths, and failure budgets. AI becomes more trustworthy when its operating model is engineered with the same seriousness as production infrastructure.

03

Runbooks should evolve into software

Every repeated operational decision is an opportunity to move knowledge out of chat history and into tooling. That is the bridge between manual heroics and calm, scalable engineering systems.

Build With Intention

Building premium infrastructure for teams that cannot afford fragile systems.

If you are hiring for platform engineering, reliability, or AI infrastructure roles, I bring a rare mix of operational depth, architectural judgment, and strong product instincts.

Staff / Principal SREPlatform EngineeringAI InfrastructureCloud Architecture
Contact Me
Start the conversation

Share the role, team, or problem space and I'll reply with the best next step.

I usually respond within 1-2 business days. If your note is urgent, feel free to email me directly instead.