The Challenge

Jane Street uses OCaml as its primary development platform. For those unfamiliar, OCaml is a powerful functional language—but also an incredibly obscure one. Originally built in France, it's most commonly used for theorem proving, formal verification, and writing programming languages.

At Jane Street, OCaml is used for everything. Web applications? Written in OCaml and transpiled to JavaScript. Vim plugins? OCaml transpiled to Vimscript. Even FPGA code is written using an OCaml library rather than Verilog.

This creates a problem: off-the-shelf AI tools don't work well for us. The reasons are straightforward:

  1. Models aren't trained on OCaml. There's a good chance Jane Street has more OCaml code internally than exists in the entire outside world combined.
  2. We've built custom everything. Our own build systems, distributed build environment, and code review system (called Iron). We develop on a giant monorepo stored in Mercurial, not Git. And 67% of the firm uses Emacs.
  3. We want flexibility. We want to apply LLMs across our development workflow—resolving merge conflicts, writing feature descriptions, suggesting code reviewers—without being limited by boundaries between systems.

Training Custom Models

At first glance, training custom models seems like overkill. It's expensive, time-consuming, and easy to get wrong. But a paper from Meta about their "Code Compose" project convinced us otherwise. They fine-tuned a model specifically for Hack, which shares OCaml's characteristic of being used primarily at one company.

We were naive at first. We thought we could just show a model our code and get back something that understood our libraries and idioms. That's not how it works.

To get good results, the model needs to see examples that match the shape of questions you'll actually ask. We needed a clear goal: generate diffs given a prompt. A developer writes a description of what they want, and the model suggests a potentially multi-file diff—up to 100 lines—that applies cleanly and has a good chance of type-checking.

Collecting Training Data

We needed examples in the form of context, prompt, and diff. Where do you get these?

Features (pull requests) seem promising—they have descriptions and code changes. But feature descriptions are written differently than what you'd type in an editor. Instead of paragraphs, developers just want to say "fix that error." Plus, features are often hundreds of lines, too large for training.

Commits are smaller but have the same problems. At Jane Street, commits are mostly checkpoints without meaningful descriptions.

Workspace snapshotting is what actually worked. Every 20 seconds, we capture snapshots of developer workstations along with build status. We look for patterns: green → red → green often indicates an isolated change. Red → green captures where a developer encountered and fixed an error. We capture the build error at the red state, then the diff to green—that's our training data.

For descriptions, we used an LLM to write detailed change descriptions, then filtered them down to match how a human would actually write a prompt.

Reinforcement Learning and Evaluation

Training data is only half the picture. Reinforcement learning is where models gain real power—aligning their output to what humans consider good code.

What is good code? It parses. In OCaml, it type-checks. Ideally, it compiles and passes tests.