IDI
banner
institutional.org
IDI
@institutional.org
A research center at Harvard working to strengthen society’s connection to knowledge by advancing our access to and understanding of the data that shapes AI.
We hope Institutional Books will be the beginning of a process that makes millions more books accessible to the public for a variety of uses.

We welcome feedback as we continue to expand this dataset, refine its contents, and sharpen our process.
www.institutionaldatainitiative.org/institutiona...
Institutional Books | Institutional Data Initiative
Institutional Books 1.0 is our first release of public domain books. This set was originally digitized through Harvard Library’s participation in the Google Books project..
www.institutionaldatainitiative.org
June 12, 2025 at 9:12 PM
We look forward to growing Institutional Books through community. We welcome collaboration from researchers and model makers as we:
- Evaluate the dataset’s impact on model outputs
- Continuing to refine our OCR pipelines

View the dataset on Hugging Face: huggingface.co/datasets/ins...
institutional/institutional-books-1.0 · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
June 12, 2025 at 9:12 PM
As part of our refinement work, we supplemented the original OCR-extracted text with a post-processed version that utilizes line detection to reassemble the text according to the line type.
June 12, 2025 at 9:12 PM
We included extensive volume-level metadata with both original and generated components, such as results from text-level language detection.
June 12, 2025 at 9:12 PM
We analyzed the dataset’s coverage across time, topic, and language and found:
- 40% of English text + long tail of 254 languages
- 20 clear topical tranches
- Largely published in the 19th and 20th centuries

Technical report here: arxiv.org/abs/2506.08300
June 12, 2025 at 9:12 PM