A dataset and benchmark for file to file parser construction with LLMs that write code.
For each filetype there is one directory with multiple subdirectories:
myfiletype
meta.json
implementations
1
meta.json
parser.py
inputs
1.your_extension
2.your_extension
outputs
1.json
2.json
Check data/zeopp-sa
for an example
Besides helping to advance science, meaningful contributions (i.e., merged PR adding an entry) will qualify for co-authorship on a paper (that might come out of this work).
Please focus on implementation examples in
- Python (preferred)
- JavaScript
- TypeScript
as our current infrastructure can only test code in these languages.
In the example implementations, please only use the standard libraries and the following additional dependencies:
Python:
JavaScript/TypeScript:
Please provide the implementation as function that accepts the file as string and returns the parsed json
string.
Install the package
pip install -e .
Then run the validation
parserbench.validate_dirs data/
-
chemical-files-registry: started as registry for filetypes that are commonly used in chemistry
-
metadata_extractors_registry: started as part of the MARDA extractors working group