Site Reliability Engineer

at

nClouds

Bangalore, India
Full Time
3y ago

Company Description

nClouds is a certified, award-winning provider of AWS and DevOps consulting and implementation services. - AWS Premier Consulting Partner. - We are an integrated team of skilled engineers, architects, developers, project managers, and sales & marketing professionals who are passionate about software excellence, innovation, and client success. We work with organizations of all sizes, in all industries, including some of the coolest startups and growth companies in Silicon Valley.

Job Description

SRE team is responsible for availability, reliability, performance, monitoring, change-management, emergency response for infrastructure or applications, and reducing manual work by implementing SRE principles and practices. SRE team directly works with Devs/DevOps teams, Operations teams, Product teams, and other teams to deploy new features, changes, and maintain infrastructure, operations, CI/CD, IAC  to achieve availability and reliability so that SLOs and SLAs can be protected. We utilize a variety of DevOps automation tools like Ansible, Docker, Kubernetes, Terraform, Jenkins, along with cloud vendor-specific tools like ECS, Cloudformation, EKS, Opsworks, beanstalk. The SRE engineer is capable of implementing Observability, SLO, SLI, SLA, and Disaster Recovery and Backup Plans in cloud environments mainly AWS.

  • Ensure the availability and reliability of distributed systems.
  • Help the L1 team to resolve the client’s infrastructure issues, escalations, alerts, tickets, and queries.
  • Works as a bridge between DevOps and other teams in order to build and maintain resilient systems.
  • Conduct, coordinate and oversee post incident Root Cause Analysis / Reviews.
  • Build and maintain documentation for all assigned clients / projects. 
  • Leverage DevOps, Agile methodology, and standards in day-to-day work. 
  • Adopt and propose automation of repetitive tasks to reduce/eliminate toil.
  • Implement and troubleshoot using observability tools like Datadog, New Relic, Splunk, CloudWatch etc. 
  • Adopt and ensure the SRE practices in Team.
  • Maintenance of AWS managed resources, CI/CD, IAC.
  • Planning and implementing disaster recovery and backup plans for AWS cloud platforms.
  • Proactively work on efficiency and capacity planning.
  • Untoiling repetitive tasks and keep a proactive approach to spotting problems, areas for improvement, and performance bottlenecks
  • Liaise and work closely with Layer-1 Oncall support, DevOps and Operations teams 
  • Drive availability and reliability by defining and implementing SLI, SLO, error budget, Observability, Disaster recovery, and backup to detect and mitigate issues.

Qualifications

  • Bachelor’s degree in computer science (preferred) or equivalent management, technical, scientific discipline
  • Ability to program (structured and OO) with one or more high-level languages, such as Python, Java, C/C++, Ruby, and JavaScript
  • A clear understanding of SRE principles and practices and Agile and DevOps methodologies.
  • Experience in AWS Well-Architected framework in order to implement the scalable and reliable infrastructure.
  • Knowledge of Kafka and event-driven architectures is a plus
  • Multi-cloud experience is a plus
  • Great team player with the flexibility to work in 24/7 rotating shifts.
  • Excellent written/verbal communication and leadership skills.

What to Expect:

First Week

  • Start with the onboarding process incorporating you into the SRE Team.
  • Set up all your accesses and security policies.
  • Learn about nClouds practices, values and solutions
  • Meet the Lead and get familiar with nClouds SRE offering and current Team structure
  • Meet the team and get familiar with team’s schedule
  • Complete onboarding process.

First Month

  • Complete all assigned trainings.
  • Projects get assigned and required access is arranged
  • Knowledge Transfer Session with Team Lead and other team members
  • Start joining customer calls.

First 3 Months

  • Become fully integrated with L-1 Support Team and help them in resolution of client’s infrastructure and application issues, escalations, tickets and queries 
  • Assist and oversee creation and maintenance of Runbooks, post-incident Root Cause Analysis (RCAs) and process documentation.
  • Build close liaison with client’s Product and Operations Teams. 
  • Develop clear understanding of client’s requirements and implement SLIs in line with clients SLOs and ensure that they conform with client’s SLAs.
  • Coordinate with the support team in implementing comprehensive monitoring of client’s application and infrastructure, ensuring strict monitoring of SLIs. 
  • Actively participate in development and implementation of CI/CD, Disaster Recovery and Backup plans and other relevant processes to ensure achievement of client’s Service Level Objectives (SLOs)

First Six Months

  • Take ownership of the SRE team’s practices and procedures and actively participate in their improvement.
  • Based on customer feedback, provide recommendations to improve nClouds service offerings.
  • In conjunction with the L-1 team, propose and implement automation of repetitive tasks to reduce/eliminate toil.
  • Closely collaborate with the team in implementing, tracking and achieving OKR goals
  • Get accreditation of your skills by gaining relevant certifications.
  • Actively participate in nClouds Friday Demos and regularly contribute to initiatives like nCode library.

Additional Information

Please apply only if you have relevant experience.

We require a Site Reliability Engineer for working remotely and who can work in 24/7 rotating shifts.

Apply for this job

Click on apply will take you to the actual job site or will open email app.

Click above box to copy link
Copied
Get exclusive remote work stories and fresh remote jobs, weekly 👇
View all remote jobs
Onkar By: Onkar