Site Reliability Engineer
HDFC securities
Job Description
As a Site Reliability Engineer - Application Support, you will: Ensure System Reliability & Availability : Monitor, troubleshoot, and maintain critical backend applications and infrastructure to meet SLA/SLO targets and ensure high availability of trading platforms Implement SRE Best Practices : Design and implement monitoring, alerting, and observability solutions using tools like Grafana, Dynatrace, and Elasticsearch to proactively identify and resolve issues Automate Operations : Develop automation scripts and tools using Linux shell scripting and Python to reduce manual intervention, improve system efficiency, and eliminate toil Manage Cloud Infrastructure : Work with AWS services and terraform to provision, manage, and optimize cloud infrastructure while ensuring cost efficiency and security Container Orchestration : Manage and troubleshoot Kubernetes clusters and deployments, ensuring optimal performance and resource utilization Incident Response & Management : Participate in on-call rotations, lead incident response efforts, perform root cause analysis, and implement preventive measures to reduce recurrence Performance Optimization : Conduct performance testing, capacity planning, and load testing to ensure systems can handle peak trading hours and scale effectively CI/CD Pipeline Understanding : Work with CI/CD tools like GitLab Runner and Argo CD to ensure smooth and reliable deployment processes Database Support : Troubleshoot and optimize Redis caching layers and Oracle databases, including writing and debugging PL/SQL queries for performance tuning Collaboration & Documentation : Work closely with development teams to improve application reliability, create runbooks, SOPs, and maintain comprehensive technical documentation Continuous Improvement : Analyze system metrics, identify bottlenecks, and propose architectural improvements to enhance reliability and performance We are looking for someone with: โ 5-7 years of hands-on experience in SRE, DevOps, or Application Support roles, preferably in high-availability production environments โ Linux Administration: Strong experience with Linux systems, proficiency in shell scripting for automation, system monitoring, and troubleshooting โ Kubernetes: Hands-on experience managing Kubernetes clusters, troubleshooting pod issues, analyzing logs, configuring deployments, and understanding networking concepts โ AWS Cloud Services: Working knowledge of AWS services (EC2, S3, RDS, Lambda, CloudWatch, ECS, etc.) with experience in troubleshooting and optimizing cloud infrastructure โ Infrastructure as Code: Experience with Terraform or similar tools for provisioning and managing cloud resources โ Monitoring & Observability: Practical experience with APM tools (Dynatrace or similar), Grafana for dashboard creation, and log analysis using Elasticsearch/Kibana โ Database Management: Experience with Redis for caching solutions and Oracle databases, including basic PL/SQL querying and performance troubleshooting โ CI/CD Tools: Familiarity with GitLab, Jenkins, Argo CD, or similar CI/CD platforms for deployment automation โ Scripting & Programming: Proficiency in shell scripting; knowledge of Python/shell or other scripting languages is a plus โ Incident Management: Experience with ServiceNow or similar ITSM tools, understanding of ITIL framework for incident, problem, and change management โ SRE Principles: Understanding of SLIs, SLOs, SLAs, error budgets, and capacity planning concepts โ Problem-Solving Skills: Strong analytical and troubleshooting abilities with attention to detail โ Communication Skills: Ability to collaborate effectively with cross-functional teams and document technical processes clearly โ Education: Bachelors degree in computer science, Information Technology, or equivalent practical experience Following aspects would be a plus: Prior experience in FinTech, Banking, or Financial Services industries with understanding of regulatory compliance requirements Experience with containerization technologies (Docker, Podman) and container security best practices Knowledge of API Gateway technologies (Kong, AWS API Gateway, etc.) for managing microservices communication Familiarity with chaos engineering and failure injection practices Experience with configuration management tools (Ansible, Chef, Puppet) Understanding of networking concepts, load balancers, and CDN technologies ITIL Foundation certification or strong working knowledge of ITIL processes Experience with security scanning tools and implementing security best practices in DevOps pipelines Contributions to open-source projects or active participation in technical communities Experience with disaster recovery planning and business continuity processes.