llvm_ir_dataset_utils.tools#
Tool for aggregating and providing statistics on bitcode size. |
|
Tool for collecting license information on all projects and putting it into a JSON file. |
|
A script for collecting a large amount of textual IR into a single file, aimed primarily at training basic BPE tokenizers. |
|
A tool for counting various quantities like tokens from gathered statistics CSV files. |
|
Tool for deleting a lot of inodes in parallel. |
|
Tool to get build failure logs and copy them into a folder. |
|
Tool for extracting basic blocks from the corpus |
|
Tool to find all the logs for targets that failed to build from a corpus directory. |
|
Tool for getting common tokenizer constants from bitcode modules. |
|
Tool for getting all spack packages that are usable for producing LLVM bitcode. |
|
Tool for searching all the source files within a corpus |
|
Tool for running llvm-link over all bitcode files in a corpus. |
|
Tool for getting statistics on bitcode modules. |
|
A tool for downloading and parsing the crates.io database to get repositories and corpus descriptions out. |
|
This is a script that allows for the conversion of a deduplicated dataset into a parquet dataset for distribution. |
|
Search for strings in bc files that will be in the dataset distribution. |
|
A tool for finding spack build failures that break the most dependent packages. |
|
A tool for squashing the HF history. |
|
Tool for getting the top x constants from a constant frequency histogram. |
|
A script for uploading a dataset in the form of a folder of parquet files to huggingface. |
|
This script loads in a folder of parquet files from the process_to_parquet.py script and validates some of the fields. |