-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use mathlib without creating a lean project? #72
Comments
What I'm also confused is why
|
related: #73 |
You can use Mathlib itself as a Lean project. Mathlib has to exist somewhere on your computer.
This is faithful to how Lean itself works. After you build Mathlib, any file reliant on Mathlib's components will have to explicitly import Mathlib. You can also choose to partially import Mathlib. |
no worries, but I rarely have to do this many times so I forget. I assume many ML people experience the same thing, not sure though |
this is just too much work for the users, most people don't care how the specifics of lean (or python) installs package manegment etc install instructions here: copy pasting in case url goes down ever: # Data Selection for Language Models via Compression
[](https://opensource.org/licenses/MIT)
[](https://arxiv.org/abs/2410.18194)
This repository hosts the [ZIP-FIT](https://arxiv.org/abs/2410.18194) data selection framework, designed to effectively and efficiently select relevant training data for language models from any data source based on a specified target dataset.
ZIP-FIT is optimized for:
- Rapid, large-scale data selection from extensive raw text datasets.
- Identifying data that closely aligns with the distribution of a given target dataset (e.g., domain-specific data, HumanEval, etc.).
Compute needed:
- 1 CPU node

## Quickstart
Install with pip: pip install zip-fit
Executing this process will generate a jsonl file named 'top_k_sequences.jsonl', containing 10,000 documents. For optimal performance, it is recommended to use uncompressed jsonl files stored on local file storage for all data paths, and to utilize as many CPU cores as possible. You can provide custom functions for reading the data paths and extracting the text field from each example using the {source,target}_load_dataset_fn and {source,target}_parse_example_fn parameters in the constructor. ExamplesHuggingFace datasets can also be used in either from zip_fit import ZIPFIT
from datasets import load_dataset
source_dataset = f'/path/to/source.jsonl'
target_dataset = 'openai/openai_humaneval'
# Define the function to load the target dataset
def target_load_dataset_fn(dataset):
ds = load_dataset(dataset, split='test', trust_remote_code=True)
return ds
# Define the function to parse examples from the target dataset
def target_parse_example_fn(ex):
text = f"Problem description: {ex['prompt']} \nCanonical solution: {ex['canonical_solution']}"
return text
# Create an instance of ZIPFIT
zip_fit_instance = ZIPFIT(
source_dataset=source_dataset,
target_dataset=target_dataset,
target_load_fn=target_load_dataset_fn,
target_parse_fn=target_parse_example_fn,
k=100000,
output_file="top_k_sequences.jsonl",
compression_algorithm='gzip' # Change to 'lz4' if desired
)
# Run the ZIPFIT process
zip_fit_instance.run() You can specify different compression algorithms. The ZIP-FIT paper uses gzip, however other compression algorithms like lz4 are faster. Dev Install: ZIP-FIT + PyPantograph + Mathlib4 + Lean SetupBelow are comprehensive instructions for setting up everything in a conda environment named
1. Create & Activate the
|
I want pantograph to work out of the box with standard lean libraries. I'm only feeding strings of theorems. But the docs point me to lean docs that suggest to add mathlib to my lean project. I have no lean project, I just have random strings being created by llms on the fly. What is the suggested fix?
Error message:
code line:
The text was updated successfully, but these errors were encountered: