LLVM-IR Dataset Utilities Documentation#
LLVM-IR Dataset Utilities is a set of utilities for the construction of large LLVM IR-based datasets from multiple sources for the development of LLVM-focussed machine learning approaches. It is specifically designed to build corpora of bitcode out of language package indices. Built versions of the dataset are available from the LLVM-ML HuggingFace Organization.
Features#
Readily scalable build infrastructure, rapidly scaling with Ray to support the rapid compilation of 1000s of code bases across entire CPU clusters.
Extensive support for building from a variety of sources including C, C++, Rust, Swift, Julia, and more.
Enabling cross-language statistical analysis of across LLVM infrastructure-based programming languages, on their primitive usage patterns, pass mutations, and beyond.
Able to intercept the compilation process at every instance the IR gets mutated for in-depth analysis of the compilation process, and construction of IR compilation stages-based datasets.