Engineer, Site Reliability
TMUS Global Solutions
Job Description
About T-Mobile: T-Mobile US, Inc. (NASDAQ: TMUS), headquartered in Bellevue, Washington, is Americas supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience. TMUS Global Solutions: TMUS Global Solutions is a world-class technology powerhouse accelerating the companys global digital transformation.
With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking. About the Role: As an Engineer Site Reliability (Gateway & AI Infrastructure), you will own the reliability, scalability, and operational excellence of T-Mobiles service mesh and gateway platform built on Gloo Gateway and Istio as well as the emerging AI infrastructure layer that supports AI agents, Model Context Protocol (MCP) servers, and large language model (LLM) gateways. You will bridge traditional SRE disciplines (automation, observability, incident response) with the fast-moving demands of AI-native workloads, ensuring traffic routing, policy enforcement, and AI service reliability operate at carrier grade.
Success is measured by gateway uptime, reduction in manual interventions, latency SLOs for AI workloads, and rapid recovery from incidents. We pride ourselves on encouraging a culture of innovation, agile ways of working, and transparency in all we do. Join us in embodying the spirit of the Un-carrier and make a tangible impact!
What Youll Do: Deploy, configure, and operate Gateway and Istio service mesh across Kubernetes clusters; manage VirtualServices, AuthPolicies, RateLimitConfigs, and RouteOptions to enforce traffic policies at scale. Engineer and maintain API and AI gateway infrastructure including LLM gateways, MCP server routing, and AI agent ingress ensuring secure, low-latency, and policy-compliant connectivity between consumers and AI backends. Build and maintain observability solutions (dashboards, distributed tracing with Envoy/Jaeger/OpenTelemetry, log aggregation) tailored to service mesh and AI traffic patterns to surface issues before they impact customers.
Automate gateway lifecycle management certificate rotation, Helm/GitOps upgrades, canary rollouts to reduce toil and improve change safety across environments. Define and track SLIs, SLOs, and error budgets for API gateway and AI gateway workloads; drive data-informed reliability decisions and balance feature velocity against stability. Manage incident response triage, mitigation, and recovery for gateway-layer and AI service outages; lead blameless postmortems and implement corrective and preventive actions.
Develop internal tooling and scripts (Python, Go, or Bash) to automate gateway configuration validation, AI traffic policy auditing, and runbook execution. Collaborate with platform, security, and AI engineering teams to design gateway ingress/egress patterns for AI agent-to-agent communication, MCP tool call routing, and LLM provider failover. Build and improve CI/CD pipelines for gateway configuration promotion across dev, staging, and production environments using GitOps workflows (Argo CD, Flux).
Apply AI coding tools, prompt engineering, and agentic patterns as a core part of SRE workflows runbook automation, incident summarization, gateway config generation, and agentic remediation in production-grade implementations beyond prototypes. Adapt to and adopt emerging gateway and AI infrastructure standards; work in Agile ways with cross-functional engineering teams across the portfolio. What Youll Bring: Bachelors degree in Computer Science, Engineering, or a related field; or equivalent demonstrated experience.
Minimum 3 years of related SRE, platform, or infrastructure engineering experience (or an advanced degree with 1 year of related experience). 24 years of hands-on experience with Istio or Envoy-based service meshes; familiarity with Gloo Gateway (Solo.io) is strongly preferred. Working knowledge of Kubernetes networking (Ingress, Gateway API, mTLS, network policies) and experience operating workloads on AWS, Azure, or GCP. Hands-on experience or strong understanding of AI gateway concepts: LLM proxy routing, MCP server connectivity, AI agent traffic patterns, token-rate limiting, and semantic caching.
Proficiency in Python, Go, or Bash for automation, tooling, and configuration management. Skilled in building and interpreting observability stacks: metrics (Prometheus/Grafana), tracing (Jaeger, OpenTelemetry), and log aggregation (ELK, Loki). 24 years of experience developing and maintaining CI/CD pipelines; GitOps experience with Argo CD or Flux is a plus. Strong incident management skills: triage, mitigation, RCA, and on-call readiness for gateway and AI services.
Demonstrated proficiency using AI coding tools, prompt engineering, and agentic patterns in real engineering workflows applied in the SDLC beyond prototypes. Understanding of Agile methodologies; self-directed, adaptable, and motivated in fast-moving environments. Preferred certifications: Certified Kubernetes Administrator (CKA), Istio Certified Associate, AWS Certified DevOps Engineer, or SRE Foundation Certification.
Must Have Skills: Gloo Gateway and/or Istio service mesh operations. Kubernetes networking, Gateway API, and Envoy proxy configuration. AI gateway, MCP server routing, and LLM gateway operations.
Monitoring, alerting, and observability tooling (Prometheus, Grafana, OpenTelemetry). Python, Go, or Bash scripting and process automation. CI/CD pipeline development and GitOps-based deployment workflows.
Incident response management and root cause analysis for gateway and AI services. AI coding tools, prompt engineering, and agentic skill application in the SDLC (production-grade, beyond prototypes).