GOOD RESEARCH REQUIRES GREAT DATA
Most researchers still collect and label their own datasets. It’s slow, messy, and often done by overworked grad students. We think you deserve better.
At Revelo, we already build high-quality code and reasoning datasets for frontier labs:

Now we’re opening that same pipeline to the research community — for free — exclusively for code-related research.
WHY WE'RE DOING THIS
Innovation shouldn’t require a billion‑dollar data budget. If your research pushes the frontier — code generation, reasoning, alignment, SWE‑bench‑style evaluation — we’ll help you design and build the dataset you need.
All we ask?
If your paper gets published, cite us for dataset support. That’s it. No contracts. No fine print. Just science and good manners.
WHY RESEARCHERS TRUST REVELO
We are not another labeling vendor — we are a technical data partner built by engineers
Built by Practitioners
Curated by engineers who’ve shipped production‑grade AI code — with build logs, reasoning traces, validation harnesses, and error metadata.
Precision > Volume
Rubric‑based annotation and automated diff validation ensure each example improves reasoning fidelity — not just dataset size.
Reproducibility at Scale
Versioning with hashing, prompt templates, and deterministic sampling scripts — your ablations survive peer review.
Quality Metrics, Not Vibes
Inter‑annotator agreement, error distributions, and benchmark validation reports — cite empirical quality, not anecdotes.
Full Ownership
We don’t reuse your data or mix it into commercial training sets. Your dataset remains yours — from schema to sample.
Engineer‑Grade Delivery
Datasets come in JSONL / Parquet / HF‑ready formats with tests and version control.
WHAT YOU'll GET
A custom dataset built by engineers, fully tested and version-controlled. It’s clean, reproducible, and delivered in standard formats like JSONL, Hugging Face, or Parquet.
We tailor it to your research goals—fine-tuning, evaluation, or ablations—and you own it completely. All we ask is a citation or acknowledgment when you publish.
WHO IT'S FOR
Anyone pushing the frontier without an Anthropic‑sized GPU cluster
ML researchers working on code or reasoning tasks
Grad students writing their first (or fifth) paper
Independent researchers with big ideas and small budgets
HOW IT WORKS
You tell us your idea in detail
If it’s ethical, feasible, and interesting, we’ll reach out
We help you design and deliver the dataset
You publish your work and cite us
ABOUT THIS PROGRAM (FAQ)
-
Wait, so it’s actually free?
Yep. Totally free.
We believe good research deserves good data — and not everyone has the budget of a frontier lab.
All we ask is that you cite Revelo in your publication or acknowledgments section.
That helps us justify doing more of this.
-
What kind of projects do you accept?
We focus on:
- Code generation and reasoning
- SWE-bench–style evaluations
- DPO/RLHF preference datasets
- Front-end or IaC (Infrastructure-as-Code) reasoning tasks
Basically, if it helps models reason about real code, we’re interested.
-
Who owns the data?
You do.
You get full ownership and control of your dataset.
We don’t reuse it, sell it, or train on it.
It’s yours — we just help you make it solid enough to publish.
-
How should I cite Revelo?
Something like this works:
“This research was supported by Revelo’s Human Data team, who provided dataset design and annotation assistance.”
(You can phrase it your way, as long as it gives credit where it’s due)
-
Why do you review ideas first?
Because we want to focus on projects that make an impact.
We look for:
- Research value (new, not redundant)
- Technical feasibility
- Ethical use of data
- Potential for publication
If it checks those boxes, we’ll help you build it.
-
Why are you doing this?
Because we build datasets for big labs every day — this is our way of helping the open research community too.
If great papers come out of it and we get cited? Everybody wins.