Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build embeddings with the load data script. #4402

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 39 additions & 7 deletions custom_dc/load_data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -81,10 +81,6 @@ function setup_python {
"https://download.pytorch.org/whl/cpu"
echo_log "Installing Python requirements from $embeddings_req"
run_cmd pip3 install -r "$embeddings_req"
# TODO: remove install once embeddings doesn't need nl_server/requirements.txt
nlserver_req="$WEBSITE_DIR/nl_server/requirements.txt"
echo_log "Installing Python requirements from $nlserver_req"
run_cmd pip3 install -r "$nlserver_req"
fi
fi
}
Expand Down Expand Up @@ -223,9 +219,45 @@ function generate_embeddings {
echo_log "Building embeddings for sentences in $NL_DIR"
local cwd="$PWD"
cd "$WEBSITE_DIR"
# TODO: Enable with new build_embeddings.py
# run_cmd python -m tools.nl.embeddings.build_custom_dc_embeddings \
# --input_file_path="$NL_DIR/sentences.csv" --output_dir="$NL_DIR"

NL_EMBEDDINGS_DIR="$NL_DIR/embeddings"
EMBEDDINGS_PATH="$NL_EMBEDDINGS_DIR/embeddings.csv"
CUSTOM_EMBEDDING_INDEX="user_all_minilm_mem"
CUSTOM_MODEL="ft-final-v20230717230459-all-MiniLM-L6-v2"
CUSTOM_MODEL_PATH="gs://datcom-nl-models/ft_final_v20230717230459.all-MiniLM-L6-v2"
CUSTOM_CATALOG_DICT=$(cat <<EOF
{
"version": "1",
"indexes": {
"$CUSTOM_EMBEDDING_INDEX": {
"store_type": "MEMORY",
"source_path": "$NL_DIR",
"embeddings_path": "$EMBEDDINGS_PATH",
"model": "$CUSTOM_MODEL"
}
},
"models": {
"$CUSTOM_MODEL": {
"type": "LOCAL",
"usage": "EMBEDDINGS",
"gcs_folder": "$CUSTOM_MODEL_PATH",
"score_threshold": 0.5
}
}
}
EOF
)
local start_ts=$(date +%s)
set -x
python -m tools.nl.embeddings.build_embeddings \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajaits - The run_cmd function did not work for argument values with spaces so calling the script directly here. Let me know if there's a better way or we can go with this?

Copy link
Contributor

@ajaits ajaits Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are no spaces within the json values in CUSTOM_CATALOG_DICT, you can try:

cmd=(python -m tools.nl.embeddings.build_embeddings)
cmd+=(--embeddings_name "$CUSTOM_EMBEDDING_INDEX")
cmd+=(--output_dir "$NL_EMBEDDINGS_DIR")
cmd+=(--catalog "$(echo $CUSTOM_CATALOG_DICT | sed -e 's/ //g')")
run_cmd ${cmd[@]}

--embeddings_name "$CUSTOM_EMBEDDING_INDEX" \
--output_dir "$NL_EMBEDDINGS_DIR" \
--catalog "$CUSTOM_CATALOG_DICT" >> $LOG 2>&1
set +x
status=$?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this before set, right after the python command so we get the status of python (not set).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - done.

local duration=$(( $(date +%s) - $start_ts))
[[ "$status" == "0" ]] || echo_fatal "Failed to build embeddings"
echo_log "Completed building embeddings with status:$status in $duration secs"
cd "$cwd"
}

Expand Down
7 changes: 6 additions & 1 deletion tools/nl/embeddings/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,9 @@ google-cloud-storage==2.15.0
lancedb==0.6.8
parameterized==0.8.1
sentence-transformers==2.2.2
torchvision==0.17.2
torchvision==0.17.2
# Downloading the named-entity recognition (NER) library spacy and the large EN model
# using the guidelines here: https://spacy.io/usage/models#production
# TODO: try using the large model
-f https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shifucun - Needed to add this to the embeddings requirements to eliminate installing the nl_server requirements. Let me know if this is ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was even thinking get rid of the lib here and make a shared requirements.txt. Duplicating the lib makes the version diverge very easily. Saw an bug before due to this.

en_core_web_sm==3.7.1
Loading