Principal Cluster Engineer, Training Infrastructure
Imagine a future where everyone has instant, low-cost access to intelligence. We’re building a fully featured European AI cloud - with everything one needs to train, experiment with, and deploy AI models. In addition, our GPUs run on 100% renewable energy.
We’re ambitious, curious, and gutsy doers. We practice a low hierarchy across the company and high morale in our teams. We’ve already achieved a lot, yet we’re only getting started. Now it’s your chance to join the ride. We offer more than just the job - we offer a career-defining opportunity to be part of building something big!
As a cherry on top, we’ve recently raised $64M in Series A and are ready to reach new heights.
About the role
We’re looking for an Infrastructure & SRE Manager to lead the reliability, scalability, and day-to-day resilience of Verda's global cloud platform.
In this role, you’ll sit at the intersection of engineering, operations, and the business - connecting teams across automation, networking, Linux, hardware, and data-centre operations to ensure our infrastructure is world-class.
You’ll own the infrastructure reliability roadmap, shaping how we minimise downtime, speed up deployments, and turn business priorities into technical execution. This is a high-impact leadership position where your decisions directly influence platform performance, customer experience, and our ability to scale at pace.
Why Verda
Generous cash + equity compensation, along with various fringe benefits (healthcare, lunch, wellbeing, etc.)
Flexible working hours and hybrid way of working
Profitable operations alongside fast growth
A role offering the opportunity to make a business-critical impact and grow within the team
A small yet mighty team of 61, challenging the status quo to positively impact the lives of many
27 nationalities in total, with 6 represented in the management team
A community-like atmosphere - we often enjoy each other’s company after work (e.g., BBQs by Ruben, our CEO by day and master chef by night)
Practicalities
Location: Helsinki, Finland
Start Date: As soon as possible
Contract Type: Full-time
Working Language: English
Your responsibilities
Lead and refine the end-to-end deployment lifecycle, ensuring smooth, predictable releases across all environments
Establish and uphold deployment standards, release criteria, and readiness checks that raise the bar for quality
Partner closely with automation, networking, Linux, hardware, and DC ops teams to deliver meaningful infrastructure improvements
Eliminate friction in the delivery pipeline through process optimisation and targeted automation
Measure and improve key engineering metrics - especially lead time for changes and release reliability
Own the availability and reliability roadmap, defining and maintaining SLIs/SLOs that guide engineering priorities
Oversee and evolve incident response, ensuring structured follow-ups and long-term fixes
Maintain an infrastructure risk register, surfacing and mitigating reliability risks before they impact customers
Drive platform resilience through better failover strategies, redundancy, capacity planning, and forward-looking architecture decisions
Your key competencies
Deep understanding of infrastructure, cloud platforms, and distributed systems
Demonstrated experience managing technical products or large-scale platforms in reliability-driven environments
Ability to translate business needs into clear infrastructure priorities and execution plans
Excellent communicator - able to simplify complex topics for both technical and non-technical audiences
Strong, data-led approach to measuring, tracking, and improving system performance
Highly organised, proactive, and comfortable owning outcomes end-to-end
Collaborative and adaptable - thrives in fast-moving, cross-functional teams
Naturally curious about new technology and motivated by building scalable, future-proof systems
Required Experience
Background in infrastructure, SRE, DevOps, observability, or related engineering domains
Strong technical foundation in Linux, networking, data-centre operations, distributed systems, and core cloud concepts (containers, orchestration, etc.)
Hands-on experience with observability tooling - using metrics, logs, and traces to drive diagnostics, insights, and performance improvements
Strong project and program management capabilities - able to prioritise effectively, coordinate across teams, and influence outcomes without formal authority
Preferred Experience
Experience working in high-availability production environments or platform reliability teams
Understanding of infrastructure hardware lifecycles - including networking, compute, storage, and data-centre capacity planning
Strong bias toward automation, reliability, and continuous improvement
Comfortable operating in mission-critical, always-on environments
- Department
- Research & Development
- Role
- Principal Engineer
- Locations
- Helsinki
- Remote status
- Hybrid
About Verda
Verda (formerly DataCrunch) is a technology company building the next generation of cloud infrastructure for AI – compute that's instant, on-demand and at scale. Headquartered in Helsinki, the company operates globally across Europe, the US and Asia. Verda employs over 100 people from nearly 30 nationalities and has raised over $200M in total funding from investors including Lifeline Ventures, byFounders, J12 Ventures, Skaala, Varma and Tesi, alongside leading financial institutions.