Welcome to Pico LM ๐, a research initiative dedicated to demystifying language model learning.
We create two complementary frameworks (pico-train and pico-analyze) for training and analyzing small to mid-scale language models (1Mโ1B parameters). Our mission is to provide a transparent, research-oriented workflow that illuminates how these models learn.
For full documentation and code, visit our two main repositories:
This HuggingFace organization hosts our pre-trained models and datasets, while the GitHub repository provides the code to train and analyze your own model suites from scratch.
All code and artifacts are licensed under a permissive Apache-2.0 license.
Pro Tip ๐ : To learn more about these libraries and explore detailed tutorials, visit our official website picolm.io and get fully acquainted with the Pico ecosystem.
Our complete suite of models from 11M to 570M parameters trained with Pico:
๐ง Disclaimer These models are still under construction. The models released in this repository have been trained for 125,000 steps (corresponding to ~250B tokens). Training will finalize after 200,000 steps.
๐ง Coming Soon! pico-decoder-xl (1B+ parameters) Watch this space or star our GitHub repository for updates!
All models are on the pretokenized-dolma dataset. They all see the same training data at each training step, use the same optimizatation process, and share the same model architecture; the only difference between models is the size of their hidden dimension.
In each model repository, we version control checkpoints every 1000 steps that contain:
We visualize the learning process in our Wandb.
Model Details:
| Aspect | Details |
|---|---|
| Architecture | - Llama-style transformer (decoder-only) - RMSNorm normalization - RoPE (Rotary Positional Embeddings) - Multi-head attention with KV-cache - SwiGLU activation function |
| Sequence Length | 2048 |
| Batch Size | 1024 |
| Optimizer | AdamW |
| Learning Rate | 3e-4 (one-cycle warmup) |
| Gradient Clipping | 1.0 |
| Precision | Mixed precision training |
| Vocabulary Size | 50,280 |
All datasets are tokenized using the OLMo Tokenizer
If you use Pico in academic or professional work, please cite it:
@software{pico2025,
author = {Diehl Martinez, Richard},
title = {Pico: A Lightweight Framework for Studying Language Model Learning Dynamics},
year = {2025,
url = {https://github.com/pico-lm}
}
Thanks for checking out Pico!
Star our GitHub repositories or join our community discussions to stay updated. If you find a bug or have questions, open an issueโcontributions are welcome!