Site Reliability Engineer (SRE) - AWS
Senior Dev
Ops Engineer (AWS)
This company is a
- growing company in the finance space who are working on a fintech platform for wealth management and asset management organisations to provide solutions and strategies for their clients.
The Site Reliability Engineering team plays a crucial role in ensuring platform stability and delivering a seamless experience for clients. In this leadership position, you’ll bridge the gap between software development and operations, applying engineering best practices to infrastructure challenges.
What are we looking for?
- 5+ years of experience in Site Reliability Engineering or a related field, with at least 3 years working in AWS environments.
- Expertise in container orchestration, particularly Kubernetes, along with related ecosystem tools.
- Familiarity with databases such as Mongo
DB, Postgre
SQL, and Dynamo
DB. - Strong understanding of reliability engineering principles and distributed system behavior.
- Experience defining and implementing SLOs/SLIs to enhance system performance and reliability.
- Proven ability to design observability solutions that generate meaningful insights while minimizing unnecessary alerts.
- Proficiency in at least one
-
- code language (Terraform preferred) and one programming language such as Python, Ruby, or Java, with a focus on writing maintainable, testable code. - Deep understanding of modern monitoring and observability tools, including Prometheus, Grafana, Splunk, New Relic, Cloud
Watch, and ELK in cloud environments. - Strong incident response skills, including leading
- incident reviews and implementing
- term improvements. - Excellent troubleshooting skills, with experience debugging distributed systems.
- A track record of automating workflows to reduce operational overhead.
- Strong communication skills, with the ability to convey complex technical concepts to different audiences.
- Passion for collaboration, mentorship, and continuous learning.
What will you do?
- Define, implement, and manage service level objectives (SLOs) that align with business priorities and user expectations.
- Develop observability strategies, leveraging key performance metrics to drive actionable improvements.
- Design and deploy scalable infrastructure solutions using
- native technologies and
-
- code principles. - Lead automation efforts to reduce manual workload and enhance system reliability.
- Advocate for reliability best practices, providing guidance and tools to development teams.
- Oversee the design and operation of Kubernetes environments for efficient container orchestration.
- Manage incident response processes, conduct
- depth postmortems, and drive continuous system improvements. - Participate in
- call rotations with a focus on proactive service enhancements.
What's on offer?
- Competitive Salary
- Personal Performance Bonus
- Equity
- Pension Fund
- Health Insurance (for you and immediate family)
- Food Allowance
Interested?
Hit Apply!
-
Informações detalhadas sobre a oferta de emprego
Empresa: UMATR Localização: Lisboa
Lisboa, Lisboa, PortugalPublicado: 15. 3. 2025
Vaga de emprego atual
Seja o primeiro a candidar-se à vaga de emprego oferecida!