Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster by embedding #70

Closed
wants to merge 2 commits into from
Closed

cluster by embedding #70

wants to merge 2 commits into from

Conversation

shuishen112
Copy link
Collaborator

add instruction clustering by using sentence transformer encoding.

https://huggingface.co/blog/mteb

matheper added a commit that referenced this pull request Jul 31, 2024
@shuishen112 shuishen112 force-pushed the get_cluster_by_embedding branch from 9ed42e9 to 3febab8 Compare August 4, 2024 14:14
elif args.encoding == "embedding":
model = SentenceTransformer(args.model)

# load the dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment

# load the dataset


def get_orca_dataset():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you create one function get_dataset(dataset_name) ?


def get_flan_dataset():

flan = FlanModule(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this you don't need to use module, just load_dataset(dataset_name) should work

enumerate(train_dataloader), total=len(train_dataloader), desc="dataset"
):
if "source" in batch:
embedding = get_text_encode(batch["source"], model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an argument from the command line that specifies which field to consider in the dataset?

--text_column_name "source"

@sordonia sordonia closed this Sep 5, 2024
@sordonia sordonia deleted the get_cluster_by_embedding branch February 13, 2025 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants