llvm_ir_dataset_utils.tools

llvm_ir_dataset_utils.tools#

llvm_ir_dataset_utils.tools.aggregate_build_sizes

Tool for aggregating and providing statistics on bitcode size.

llvm_ir_dataset_utils.tools.collect_license_information

Tool for collecting license information on all projects and putting it into a JSON file.

llvm_ir_dataset_utils.tools.collect_textual_ir

A script for collecting a large amount of textual IR into a single file, aimed primarily at training basic BPE tokenizers.

llvm_ir_dataset_utils.tools.count_values

A tool for counting various quantities like tokens from gathered statistics CSV files.

llvm_ir_dataset_utils.tools.delete_folder

Tool for deleting a lot of inodes in parallel.

llvm_ir_dataset_utils.tools.extract_build_failure_logs

Tool to get build failure logs and copy them into a folder.

llvm_ir_dataset_utils.tools.get_bbs

Tool for extracting basic blocks from the corpus

llvm_ir_dataset_utils.tools.get_build_failure_logs

Tool to find all the logs for targets that failed to build from a corpus directory.

llvm_ir_dataset_utils.tools.get_common_constants

Tool for getting common tokenizer constants from bitcode modules.

llvm_ir_dataset_utils.tools.get_spack_package_list

Tool for getting all spack packages that are usable for producing LLVM bitcode.

llvm_ir_dataset_utils.tools.grep_source

Tool for searching all the source files within a corpus

llvm_ir_dataset_utils.tools.link_files

Tool for running llvm-link over all bitcode files in a corpus.

llvm_ir_dataset_utils.tools.module_statistics

Tool for getting statistics on bitcode modules.

llvm_ir_dataset_utils.tools.parse_crates_database

A tool for downloading and parsing the crates.io database to get repositories and corpus descriptions out.

llvm_ir_dataset_utils.tools.process_to_parquet

This is a script that allows for the conversion of a deduplicated dataset into a parquet dataset for distribution.

llvm_ir_dataset_utils.tools.search_strings

Search for strings in bc files that will be in the dataset distribution.

llvm_ir_dataset_utils.tools.spack_analyze_failures

A tool for finding spack build failures that break the most dependent packages.

llvm_ir_dataset_utils.tools.squash_hf_history

A tool for squashing the HF history.

llvm_ir_dataset_utils.tools.top_x_constants

Tool for getting the top x constants from a constant frequency histogram.

llvm_ir_dataset_utils.tools.upload_dataset_hf

A script for uploading a dataset in the form of a folder of parquet files to huggingface.

llvm_ir_dataset_utils.tools.validate_parquet_db

This script loads in a folder of parquet files from the process_to_parquet.py script and validates some of the fields.