src-d/datasets
source{d} datasets ("big code") for source code analysis and machine learning on source code
🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools
source{d} datasets ("big code") for source code analysis and machine learning on source code
Synthetic data curation for post-training and structured data extraction
Deeplake is AI Data Runtime for Agents. It provides serverless postgres with a multimodal datalake, enabling scalable retrieval and training.
Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
The official Python client for the Hugging Face Hub.
1 capture since 2026-05-25