About Verda

Verda is reimagining cloud infrastructure for AI workloads. We are a full-stack AI cloud company, meaning we install, operate, and optimize our compute for training and inference of AI models.

Join Verda while it’s still being built - not once it’s finished!

Your responsibilities

In this role, you will focus on improving the networking and communication layer behind large-scale LLM training workloads. You will optimize collective communication performance across distributed GPU clusters, helping improve throughput, utilization, and reliability for communication-bound workloads.

You will debug and analyze bottlenecks across the networking stack, building tooling and infrastructure for benchmarking, profiling, and regression testing of distributed training performance.

You will work closely with training, infrastructure, hardware, and networking teams to improve how workloads scale across clusters, contributing to both system reliability and overall training efficiency.

This role is highly collaborative and research-adjacent, requiring curiosity, initiative, and willingness to go deep into low-level communication systems and distributed training infrastructure.

Your key competencies

Experience with distributed systems, networking, or large-scale ML training infrastructure
Experience with communication libraries such as NCCL, MPI, NVSHMEM, or similar technologies
Experience with profiling and debugging tools such as Nsight Systems, NCCL logs, PyTorch Profiler, or perf
Strong systems thinking and ability to analyze performance bottlenecks across distributed environments
Self-starter mindset with ability to independently define and drive technical projects
Strong curiosity about low-level systems, networking, and large-scale AI infrastructure

Representative projects

Build tools to identify NCCL bottlenecks, slow ranks, and communication tail latency
Build dashboards and regression infrastructure for training network health and performance
Implement fault-tolerance mechanisms to reduce cluster idle time and improve training efficiency

Practicalities

Location: Helsinki, Finland or London, UK

Hybrid mode: Working from either our Helsinki or London office for three days a week

Employment type: Full-time and permanent

What's next

We’re building fast and this role needs the right person behind it. There’s no artificial deadline, but when we find who we’re looking for, we move.

If this sounds like your next move, apply now.

Please submit your application through our Careers page. We don’t accept applications sent by email.

Member of Technical Staff, AI Infrastructure Team

About Verda

Your responsibilities

Your key competencies

Representative projects

Practicalities

What's next

About Verda

Member of Technical Staff, AI Infrastructure Team