As a Site Reliability Engineer, you will play a critical role in supporting application developers by providing expert guidance on Application and infrastructure best practices from reliability perspective. Your role covers the entire life cycle of a product/application. Your primary focus will be Automation, Observability, reliability and Release management with CICD with an emphasis on solving operations issues. At least 3+ years of SRE experience in large programs with focus on release engineering, observability tasks and reliability. Must have good understanding of Site Reliability Engineering (SRE) and release management processes. should possess strong analytical and troubleshooting skills. Should be a strong team player and enjoy collaborating with different people and profiles as well as share knowledge and strive for continuous development and learning. Excellent communication skills along with leadership skills
Improve reliability, quality, and time-to-market of our suite of products/applications. Define suitable metrics for system with SLO/SLI and setup observability mechanism to track it Define error budget as per the SLO Define strategy and setup up High Availability and Load Balancer based architecture Drive a metrics-driven culture and software delivery process using data to measure overall system quality and reliability. Balance feature development speed and reliability with well-defined service level objectives Provide primary operational support and engineering for products/applications Partner with solution architect and development teams to improve services reliability Participate in system design, infra management and capacity planning Participate in optimizing code, automating operational tasks and toil reduction Provide solutions for performance management, disaster recovery, monitoring and observability Work with business users to understand issues, develop root cause analysis and work with the development team for enhancements/fixes Working on distributed traces to visualize the entire workflow and analyze the cause of problems/incidents Improve security and performance of infrastructure and applications Provide support, improve, and implement infrastructure as code Define, evangelize, and maintain SRE best practices Solutionize and implement DevSecOps best practices Improve automation including system’s self-healing capability
Ability to develop value-creating strategies and models that enable clients to innovate, drive growth and increase their business profitability Good knowledge on software configuration management systems Awareness of latest technologies and Industry trends
- Logical thinking and problem-solving skills along with an ability to collaborate
- Understanding of the financial processes for various types of projects and the various pricing models available
- Ability to assess the current processes, identify improvement areas and suggest the technology solutions
- One or two industry domain knowledge
- Client Interfacing skills
- Project and Team management