-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster by embedding #70
Conversation
9ed42e9
to
3febab8
Compare
elif args.encoding == "embedding": | ||
model = SentenceTransformer(args.model) | ||
|
||
# load the dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove comment
# load the dataset | ||
|
||
|
||
def get_orca_dataset(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you create one function get_dataset(dataset_name)
?
|
||
def get_flan_dataset(): | ||
|
||
flan = FlanModule( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this you don't need to use module, just load_dataset(dataset_name)
should work
enumerate(train_dataloader), total=len(train_dataloader), desc="dataset" | ||
): | ||
if "source" in batch: | ||
embedding = get_text_encode(batch["source"], model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add an argument from the command line that specifies which field to consider in the dataset?
--text_column_name "source"
add instruction clustering by using sentence transformer encoding.
https://huggingface.co/blog/mteb