-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need some testers #11
Comments
I would be delighted to! I am a great fan of your I wrote the blog post you mention in response to a question on RStudio forum about the best tokenizer for keras (https://community.rstudio.com/t/what-is-the-best-tokenizer-to-be-used-for-keras/) because I honestly believe it really is the best tokenizer for keras :) I am not familiar (yet) with the |
All credit regarding udpipe should really go to @foxik who created the UDPipe C++ library. No doubt about it he tuned the model well for Czech. |
I need a bit time to go through the embed_tagspace feature of Both in the sense that it trains faster than Keras, and that it is faster to set up a model, as it does not require extensive preprocessing. Thumbs up for that. But I need to develop some toy example to benchmark it against Keras for accuracy; this will take some time. I will report back once I am ready, but I can already tell that I am intrigued :) J. |
Take your time, it also took me Some time to get familiar with the Starspace methodology and the training parameters and format. |
Hi @systats
I have a political dataset of questions/answers in Belgian parliament at https://github.com/bnosac/ruimtehol/raw/master/inst/extdata/dekamer-2014-2018.RData |
Hi guys: sounds very good to gather the collective wisdom on NLP. Altough 9000 tweets seem to be a little bit few data at least for keras. But sure I will look into the project. |
@systats If you have other datasets which you think are interesting, let us know. |
Honestly i don't see value in this exercise. For text classification
almost always we can get nearly maximum accuracy using
bag-of-words/bag-of-ngrams and logistic regression.
…On Mon, Feb 25, 2019 at 2:24 PM jwijffels ***@***.***> wrote:
@systats <https://github.com/systats> If you have other datasets which
you think are interesting, let us know.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#11 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE4u3TVtcwkxWVcPuHkxuM1-WfRI_ZU-ks5vQ7nrgaJpZM4aYrtP>
.
--
Regards
Dmitriy Selivanov
|
Just to clarify. I believe there are many interesting things we can do with
current set of NLP related packages in R. Many thanks to Jan! But text
classification is really most boring and can be considered as "solved".
On Mon, Feb 25, 2019 at 2:27 PM Dmitriy Selivanov <
[email protected]> wrote:
… Honestly i don't see value in this exercise. For text classification
almost always we can get nearly maximum accuracy using
bag-of-words/bag-of-ngrams and logistic regression.
On Mon, Feb 25, 2019 at 2:24 PM jwijffels ***@***.***>
wrote:
> @systats <https://github.com/systats> If you have other datasets which
> you think are interesting, let us know.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#11 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AE4u3TVtcwkxWVcPuHkxuM1-WfRI_ZU-ks5vQ7nrgaJpZM4aYrtP>
> .
>
--
Regards
Dmitriy Selivanov
--
Regards
Dmitriy Selivanov
|
That is certainly what is coming out of the above exercise and it's always the 1st approach I try out. But I'm not 100% sure on that statement especially for larger datasets and for multi-label classification. |
sure, there are probably to many But it strongly depends on what you are personally interested in. |
I too think classification and regression is well documented (in other languages) everywhere, but some specific problems remain difficult (unbalanced data/ multi-outcome). I'd love to work on projects involving Seq2Seq models (text generation, summarization, autoencoders for topic models), but there is little R documentation out there. I thought it might be useful to build a unifying wrapper for all first party algorithms in R (like RTextTools). This would enable us to benchmark (purrr) all the datasets and models at once. |
FWIW. Starspace handles multi-outcome unbalanced classification data pretty gracefully. I'm in if you want to compare a seq2seq model to a combination of textrank with some embedding models (either starspace/fasttext or glove) in order to summarise data. Do you have some data in mind which makes this a nice exercise. |
Generally, I am interested in all the topics you mentioned. Some weeks ago I tried to build a Keras summariser but the predictions were bad. From my point of view, Keras is not the best suited library for this task, because one has to iterate whole parts of a model per token. textgenrnn might be easier for predicting next words in a sequence. At the moment I have to write a paper about insights from large collections of publications (N > millions), regarding a specific research field. Through unsupervised document clustering topicmodels (too slow)/ autoencoder (very complex), I hope to recover hidden structures in the corpus. Do you think starspace allows me to map the documents based on their vocabulary to a latent sub space [10 <= k <= 30], that I can access and work with? E.g. for predictions on new data. There is a package for Keras Autoencoders in R ruta which does exactly that. But I have to figure out how to adapt the model to text input. Maybe you guys have some suggestions on this topic. Finally, I added three model which perform okay on the celebrity-faceoff dataset like yours. And there are hundreds more to discover. Our results strongly depend on tokens that appear only once in the corpus, and therefore the models overfit massively? As supervised learning is still the backbone of most NLP applications, this effort could be scaled. We could standardize model inputs/interfaces, to easily run grid searches or special optimizers for hyperparameter selection. This would be also beneficial for automatic benchmarking models on a variety of datasets (comparing all R packages?) . This article suggest that the choice of dataset strongly matters for how a specific model performs (maybe starspace slightly underperforms only on small datasets?). Maybe you could provide pretrained word-embeddings for keras? |
Even more general (and better) than that: https://github.com/systats/tidykeras |
About the collection of publications. I haven't tried using the Starspace embeddings for clustering but it can give embeddings of words as well as sentences or paragraphs or full documents or authors which you can then cluster with DBscan e.g. or visualise with the projector R package. |
Thanks for your kind words & for your work on the celebrity-faceoff repo. I managed to have only a quick look at it and it looked inspiring indeed. If you do not mind I will rework it from rmd format to plain R code to keep it more in line with the other sample data. Unfortunately it seems (with the benefit of hindsight) that my choice of dataset was not a particularly good one. As intriguing as the idea of keeping up with the Kardashians seemed the sample was just too small. With regards of using StarSpace algorithm on millions of documents - this seems interesting, and could work. I was amazed by the speed with which ruimtehol / StarSpace worked, especially since I was used to working with Keras. Totally different ballpark!
|
Are starspace embeddings trained unsupervised? And do you think they provide a meaningful latent space (compared to topic proportions in topicmodels)?
Personally I'm interest in social science literature, which comes partly from scopus (a tone of labels and meta data) and a free doi dump (for abstracts). Of course you can reshape the file to your needs. Whats next for the celebrity-faceoff repo? |
They can be trained supervised, unsupervised or semi-supervised. Haven't compared them to LDA/BTM models yet. Currently only used it for document proposal and multi-label classification only. |
I will of course look into it in the coming weeks. Thanks so far and your project inspired this: https://github.com/systats/textlearnR . I'm going to focus on this lightweight R package for benchmarking NLP models on a range of datasets. Hoping not to reinvent the wheel. If you see it that way let me know please. |
I am open to advice in this matter. My current plan is to consolidate the various scripts to a more uniform format and publish a short summary post on some blog or other. Even though it proved too small to let the neural models truly shine I believe it could serve as a nice showcase of possible approaches.
|
I think there is certainly some value in comparing different approaches and best practices across different packages. I know there is https://github.com/tidymodels/textrecipes by @EmilHvitfeldt which tries to standardise NLP data preparation but mainly for non-neural supervised approaches. I'm not aware of similar neural tuning/comparison approaches on the R side. |
I am looking for testers to test out the package to compare it to other approaches. Inviting @jlacko I see you used keras in this blog post
http://www.jla-data.net/eng/vocabulary-based-text-classification/ it would be Nice to compare it to the embed_tagspace model. What do you think?
The text was updated successfully, but these errors were encountered: