This is in beta. It will work for fine about 60 percent of cards, overhighlight about 20 percent, underhighlight/highlight randomly about 15 percent, and freeze/crash Word on the rest. It works better on highlighting manually pre-underlined, shorter cards. It also tends to freeze or take a long time on very long cards.
You may get better results with more data, or training a higher-quality model (which costs more).
- Install Python for Mac. Would recommend version 3.10 or lower.
- Install conda.
- Clone this git repository. You need to keep this on a consistent location on your computer for inference to continue to work.
- Install dependencies (from the base conda environment):
pip install -r requirements.txt
- First, generate the training data from a collection of cards. This was tested on roughly 500 samples of consistent formatting and style. 200-1000 should be used for optimal results (more is better). Put your cards in documents in the
data
folder.- Optionally use the pre-generated .jsonl files in the
data
folder, in which case you can skip directly to the fine-tuning stage. - The model will automatically split up long cards for you (just as it does during inference).
- Optionally use the pre-generated .jsonl files in the
- Run the following command to generate data files for each model:
python parse.py data -o underlines.jsonl --field underlines
python parse.py data -o highlights.jsonl --field highlights --input_field underlines
python parse.py data -o emphasis.jsonl --field emphasis --input_field underlines
- Normalize the files with the
openai
tool:
openai tools fine_tunes.prepare_data -f emphasis.jsonl
openai tools fine_tunes.prepare_data -f highlights.jsonl
openai tools fine_tunes.prepare_data -f underlines.jsonl
Open them and make sure everything looks reasonable. Use these newly prepared files for the fine-tuning stage.
- Put your OpenAI API key into a file called
.env
in the root directory. It should look like this:
OPENAI_API_KEY=sk-your-key-here
- Upload each of your files to OpenAI, and start training (repeat for each model):
python finetune.py file -f emphasis.jsonl
python finetune.py tune -f YOUR_FILE_ID -m emphasis/highlight/underline
- You can monitor the progress of your fine-tuning or retrieve file IDs using the
list
command:
python finetune.py list -l files
python finetune.py list -l finetunes --finetune_id YOUR_FINETUNE_ID
- Replace the model IDs in
constants.py
with the IDs of your newly trained models.
You can use the utility below to calculate the cost of finetuning each of the OpenAI base models. The default (and recommended) is Babbage, which should cost less than $1 for a reasonably-sized dataset for each model. More advanced models may be more accurate, but will cost more.
python finetune.py cost -f emphasis.jsonl
- Move (or copy) the
openaipythoninterface.scpt
file to~/Library/Application Scripts/com.microsoft.Word/
. - Open in Script Editor. Replace the path on line 16 with the path to
completion.py
in this directory. - Install the macros in
macros.bas
to your desired location in Word. This can either be in the Verbatim template, your normal template, or individual documents for testing.- You can also associate keyboard shortcuts with each macro for ease of use.
- To run a macro, put your cursor in the tag of a card and run the macro. A few notes:
- It may be relatively slow on long cards, particularly for underlining. You will not be able to use Word while the macro is running.
- The formatting macros do better on consistently formatted and cleanly cited cards. In particular, it does not do well with two-line cites or paragraphs of text between the tag and cite.
- Though the emphasis/highlighting algorithms do their best to highlight only previously underlined words, at times earlier matches may be highlighted. This is particularly evident with acronyms (e.g. "ai" in "artificial intelligence" will often highlight the first "i" in the word "artificial", not the first letter of "intelligence").
- Save frequently – random crashes are possible.
- This will not work on the very last card in a document.
You can also test the models from the command line.
python finetune.py test -m underline
- Write installation instructions and requirements.txt
- Fix highlight bug where un-underlined text found earlier is highlighted
- Make it so you can only highlight the first letter individually.
- Deprecate substring_locations
- Remove path specific log files
- Weird Unicode encoding bug messes up parsing sometimes
- Generate highlight/underline/emphasis with one script