Sindhi Sentence Embedding #14138

ronit450 · 2024-01-17T16:33:15Z

ronit450
Jan 17, 2024

Hello Everyone,
I am working on a project where I need Sindhi Sentence level Embedding. For this I am using the Word2vec available pretrained model as described in the sample code. The code is only presented for the Word level embedding whereas I want it for entire Sentence and there can be any strategy, like Average or anything. However I am facing issues in my pipeline
documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

tokenizer = Tokenizer()
.setInputCols(["document"])
.setOutputCol("token")

Use WordEmbeddings instead of WordEmbeddingsModel

word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sd")
.setInputCols(["document", "token"])
.setOutputCol("embeddings")

Use SentenceEmbeddings for obtaining sentence embeddings

sentence_embeddings = SentenceEmbeddings()
.setInputCols(["document", "word_embeddings"])
.setOutputCol("sentence_embeddings")
.setPoolingStrategy("AVERAGE")

pipeline = Pipeline(stages=[documentAssembler, tokenizer, word_embeddings, sentence_embeddings])

data = spark.createDataFrame([["مون کي اسپارڪ اين ايل پي سان پيار آهي"]]).toDF("text")

result = pipeline.fit(data).transform(data)

Extract the final embeddings

sentence_embeddings = result.select("sentence_embeddings.result").first()[0]
print(sentence_embeddings)

The error is :

maziyarpanahi · 2024-01-17T17:38:08Z

maziyarpanahi
Jan 17, 2024
Maintainer

Hi @ronit450
Your SentenceEmbeddings annotator has input cols that one of them don't exist: .setInputCols(["document", "word_embeddings"])
The output of WordEmbeddingsModel is called embeddings, so you should either rename this to word_embeddings, or rename the one in SentenceEmbeddings to embeddings to match. (the output of one annotator will be fed into the next annotator)

9 replies

ronit450 Jan 17, 2024
Author

This is actually giving me same sindhi text as output

sentence_embeddings = result.select("sentence_embeddings.result").value()[0]

print(sentence_embeddings)

maziyarpanahi Jan 17, 2024
Maintainer

Thanks for sharing the notebook. My last question would be, what is the end use case? Classification? Similarity? Vector Database? This way I can share the notebooks we have as examples with you

ronit450 Jan 17, 2024
Author

So the end use case is classification of different poets based on their style and genre of poetry. The similar type of work is done in english, German, and other languages bit not in sindhi. This way we can preserve the historical context of Sindhi culture

maziyarpanahi Jan 17, 2024
Maintainer

So you have examples for classifications here:

So the training for text classification (multi-class/multi-label) happens inside the Spark NLP, what comes out of SentenceEmbeddings is compatible with those annotators. (also, the actual embeddings are inside sentence_embeddings.embeddings) if you want to look at them.

Following these notebooks you should be able to train a classifier either in Spark ML or Spark NLP.

ronit450 Jan 24, 2024
Author

Hello Maziyar!
I am very glad for your kind help but kind of stuck somewhere so need your help. I have tried to use the examples available for the classification but I dont want them. I am using the DL models such as Flattten, GRU, Multi layer perceptron, LSTM and Bi-directional. In these I am not getting the desired accuracy as the maximum accuracy I am getting is 28% which is not what I require. Is there any other way that I can work with these embeddings or any other example that I can look for.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sindhi Sentence Embedding #14138

{{title}}

Replies: 1 comment 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Sindhi Sentence Embedding #14138

ronit450 Jan 17, 2024

Use WordEmbeddings instead of WordEmbeddingsModel

Use SentenceEmbeddings for obtaining sentence embeddings

Extract the final embeddings

Replies: 1 comment · 9 replies

maziyarpanahi Jan 17, 2024 Maintainer

ronit450 Jan 17, 2024 Author

sentence_embeddings = result.select("sentence_embeddings.result").value()[0]

print(sentence_embeddings)

maziyarpanahi Jan 17, 2024 Maintainer

ronit450 Jan 17, 2024 Author

maziyarpanahi Jan 17, 2024 Maintainer

ronit450 Jan 24, 2024 Author

ronit450
Jan 17, 2024

Replies: 1 comment 9 replies

maziyarpanahi
Jan 17, 2024
Maintainer

ronit450 Jan 17, 2024
Author

maziyarpanahi Jan 17, 2024
Maintainer

ronit450 Jan 17, 2024
Author

maziyarpanahi Jan 17, 2024
Maintainer

ronit450 Jan 24, 2024
Author