SRE Team Lead
Shield is a global startup, with offices in Tel-Aviv, New-York, London, and Lisbon.
We’re growing and looking for another important piece of the puzzle.
Is it you?
Let’s get down to business:
What you will do
Key Responsibilities:
- Establish and nurture a culture of excellence within the SRE team, promoting best practices, effective work processes, and methodologies. Lead by example and mentor the team to foster a collaborative and
- performing environment. - Set clear team goals and priorities in alignment with organizational objectives. Ensure resources are available and allocated efficiently to meet project timelines and deliverables.
- Recruit, train, and develop team members, providing guidance and support to enhance their skills and career progression. Encourage continuous learning and adaptability to new technologies and methodologies.
- Design, implement, and maintain scalable and reliable infrastructure solutions.
- Develop and deploy monitoring, alerting, and logging systems to proactively identify and mitigate operational issues.
- Review and refine existing alerts, working closely with developers to automate responses and enable
- healing. - Develop and maintain monitoring dashboards that provide clear and actionable insights into application reliability and system performance.
- Conduct capacity planning and performance tuning to optimize system performance and resource utilization.
- Automate repetitive tasks and processes to streamline operations and improve efficiency.
- Lead incident response and resolution, including rapid troubleshooting, coordinating
- functional teams, root cause analysis, and
- mortem reviews. - Develop and maintain incident response procedures and runbooks to ensure efficient and effective handling of incidents.
- Communicate effectively with stakeholders during incidents, providing timely updates and managing expectations.
- Continuously evaluate and adopt new technologies and methodologies to enhance our infrastructure and operations.
- Oversee and optimize our cloud infrastructure on AWS, ensuring scalability, reliability, and
- effectiveness. - Regularly analyze cloud service usage and expenses, implementing strategies to optimize costs.
Minimum Qualifications:
- Bachelor’s degree in Computer Science, Information Technology, or a related field.
- 6+ years of experience as a site reliability or platform engineer, preferably in a
- scaling environment. - At least 2 years in a leadership role, demonstrating effective team management, mentorship, and strategic planning.
- Hands-on experience with Terraform and Terragrunt.
- Extensive knowledge of Kubernetes and containerization technologies.
- Hands-on experience with the Prometheus stack.
- Ability to design and develop code using Python or Go.
- Strong inclination toward automating manual tasks and processes to improve operational efficiency.
- Excellent troubleshooting abilities with a methodical approach to diagnosing and resolving issues.
- In-depth knowledge of cloud services, particularly AWS, including best practices in security and compliance.
- Excellent communication abilities to coordinate effectively with both technical and
- technical stakeholders.
Seja o primeiro a candidar-se à vaga de emprego oferecida!
-
Porque procurar um emprego no Vagas.pt?
Todos os dias oferecemos novas vagas de emprego. Pode escolher entre uma vasta gama de empregos: O nosso objectivo é oferecer a escolha mais vasta possível Receba novas ofertas por e-mail Ser o primeiro a responder a novas ofertas de emprego Todas as ofertas de emprego num só lugar (de empregadores, agências e outros portais de emprego) Todos os serviços para quem procura emprego são gratuitos Vamos ajudá-lo a encontrar um novo emprego