Software Engineer – Evals Team(visa sponsorship available)

Uk
September 2, 2025

Job Description

About Apollo Research

AI systems are advancing rapidly, creating enormous opportunities but also serious risks—such as deliberate misuse or the deployment of misaligned models. At Apollo Research, our core focus is on deceptive alignment: when a model appears aligned but is actually misaligned and able to evade oversight.

We study this problem through behavioral evaluations of language model (LM) agents, applied interpretability, and controlled experiments with model organisms. Our work aims to make cutting-edge AI systems safer and more trustworthy.

At Apollo, we foster a culture built on truth-seeking, goal orientation, constructive feedback, and collaboration.

The Opportunity

We’re hiring a Software Engineer to strengthen our ability to evaluate Large Language Models (LLMs). You’ll design and build the internal tools and libraries that power our research workflows, directly shaping how our team evaluates AI safety risks.

Your work will help us detect subtle changes in LLM behavior, ensure the reliability of experimental results, and accelerate groundbreaking research on frontier AI systems.

What You’ll Achieve in Your First Year

Design and implement tools and libraries that enable faster, more reliable LLM evaluations.

Build systems that track subtle behavioral shifts in models to ensure experimental integrity.

Shape the vision and architecture of our internal software platform.

Improve team productivity by unblocking researchers with technical guidance and support.

Gain deep expertise in working with state-of-the-art AI models and the unique challenges of AI safety research.

Key Responsibilities

Rapidly prototype and iterate on internal tools for running and scaling LLM evaluations.

Lead the development of major features from concept to production.

Define and prioritize the software roadmap alongside researchers.

Promote strong software design practices and codebase health.

Create reusable patterns for building new evaluation types.

Develop LLM agents to automate internal engineering and research workflows.

Collaborate closely with researchers, providing debugging help and implementation support.

Communicate clearly about technical trade-offs and decisions.

What We’re Looking For

Essential:

Strong Python development skills with production-quality code experience.

Proven track record of ownership—examples might include:

Leading development of a software tool/product for 1+ years.

Building the tech stack at a startup.

Progressing through a large organization to influence major parts of the codebase.

Creating or maintaining a widely used open-source project.

5+ years professional software engineering experience.

Nice-to-have:

Experience with LLM agents or evaluation frameworks.

Background in infosecurity or cybersecurity.

Experience working with AWS.

Interest in AI safety and alignment research.

Don’t meet every requirement? We still encourage you to apply. We know that great candidates come from diverse backgrounds.

Example Projects You Might Work On

Building an internal job orchestration system for running evaluations on remote machines.

Developing a queryable database for storing all evaluation results.

Designing tools to interact with transcripts from eval runs.

Creating LLM agents that automate internal workflows.

Establishing conventions for new types of evaluations in the Inspect framework.

Optimizing CI pipelines to speed up testing and eliminate flaky runs.

About the Team

You’ll join the Evals Team, currently led by Marius Hobbhahn (manager & advisor) and including engineers and researchers such as Rusheb Shah, Andrei Matveiakin, Mikita Balesni, Jérémy Scheurer, and others.

The team focuses on:

Running evaluations with Inspect (our primary evals framework).

Research on scheming-related properties (e.g., situational awareness, deceptive reasoning).

Publishing results on frontier models (e.g., OpenAI’s o1-preview system card).

Exploring model organisms to study deceptive alignment.

Designing and testing future AI control protocols (planned for 2025).

Logistics

Start date: 2–3 months after first interview

Location: London office (shared with the London Initiative for Safe AI). Primarily in-person.

Visa sponsorship: Available for UK work visas.

Benefits

Competitive UK-based salary

Flexible working hours

Unlimited vacation & sick leave

Daily lunch, dinner, and snacks provided

Paid work trips, including staff retreats, conferences, and business travel

Annual $1,000 professional development budget

Equality Statement

Apollo Research is an Equal Opportunity Employer. We value diversity and welcome applications from all backgrounds, regardless of age, disability, gender, race, religion, or sexual orientation.

How to Apply

Submit your CV via the application form. A cover letter is optional. You may also include links to relevant projects or open-source contributions.

Interview process:

Screening interview

Take-home test (~2 hours)

Three technical interviews (focused on real tasks, not leetcode-style)

Final interview with Marius (CEO)

We recommend exploring Inspect and experimenting with LLM evals if you’d like to prepare in advance.

Would you like me t