By John Kozi, AI Assistant Team at Jane Street

Source: https://www.youtube.com/watch?v=0ML7ZLMdcl4 (Transcription and Writing using AI)

How Jane Street Builds AI Developer Tools for an Obscure Programming Language (TEXT ONLY VERSION)

The Challenge

Jane Street uses OCaml as its primary development platform. For those unfamiliar, OCaml is a powerful functional language—but also an incredibly obscure one. Originally built in France, it's most commonly used for theorem proving, formal verification, and writing programming languages.

At Jane Street, OCaml is used for everything. Web applications? Written in OCaml and transpiled to JavaScript. Vim plugins? OCaml transpiled to Vimscript. Even FPGA code is written using an OCaml library rather than Verilog.

This creates a problem: off-the-shelf AI tools don't work well for us. The reasons are straightforward:

  1. Models aren't trained on OCaml. There's a good chance Jane Street has more OCaml code internally than exists in the entire outside world combined.
  2. We've built custom everything. Our own build systems, distributed build environment, and code review system (called Iron). We develop on a giant monorepo stored in Mercurial, not Git. And 67% of the firm uses Emacs.
  3. We want flexibility. We want to apply LLMs across our development workflow—resolving merge conflicts, writing feature descriptions, suggesting code reviewers—without being limited by boundaries between systems.

image.png

image.png

Training Custom Models

At first glance, training custom models seems like overkill. It's expensive, time-consuming, and easy to get wrong. But a paper from Meta about their "Code Compose" project convinced us otherwise. They fine-tuned a model specifically for Hack, which shares OCaml's characteristic of being used primarily at one company.

image.png

We were naive at first. We thought we could just show a model our code and get back something that understood our libraries and idioms. That's not how it works.

To get good results, the model needs to see examples that match the shape of questions you'll actually ask. We needed a clear goal: generate diffs given a prompt. A developer writes a description of what they want, and the model suggests a potentially multi-file diff—up to 100 lines—that applies cleanly and has a good chance of type-checking.

image.png

image.png

Collecting Training Data