Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need some testers #11

Open
jwijffels opened this issue Jan 29, 2019 · 22 comments
Open

Need some testers #11

jwijffels opened this issue Jan 29, 2019 · 22 comments

Comments

@jwijffels
Copy link
Contributor

I am looking for testers to test out the package to compare it to other approaches. Inviting @jlacko I see you used keras in this blog post
http://www.jla-data.net/eng/vocabulary-based-text-classification/ it would be Nice to compare it to the embed_tagspace model. What do you think?

@jlacko
Copy link

jlacko commented Jan 29, 2019

I would be delighted to!

I am a great fan of your udpipe package. The lemmatization feature is a godsend for people doing text analysis in languages with complicated grammatical inflections (like Czech). The Engish speaking people have it easy.

I wrote the blog post you mention in response to a question on RStudio forum about the best tokenizer for keras (https://community.rstudio.com/t/what-is-the-best-tokenizer-to-be-used-for-keras/) because I honestly believe it really is the best tokenizer for keras :)

I am not familiar (yet) with the ruimtehol package, but I will familiarize myself with it and give you feedback / suggestion / whatever on what I find.

@jwijffels
Copy link
Contributor Author

All credit regarding udpipe should really go to @foxik who created the UDPipe C++ library. No doubt about it he tuned the model well for Czech.
Looking forward to see your feedback on the ruimtehol to keras comparison. Fyi there is more info on the ruimtehol package which wraps starspace at
http://www.bnosac.be/index.php/blog/86-neural-text-modelling-with-r-package-ruimtehol

@jlacko
Copy link

jlacko commented Jan 31, 2019

I need a bit time to go through the embed_tagspace feature of ruimtehol package. My first impression is that it is fast.

Both in the sense that it trains faster than Keras, and that it is faster to set up a model, as it does not require extensive preprocessing. Thumbs up for that.

But I need to develop some toy example to benchmark it against Keras for accuracy; this will take some time.

I will report back once I am ready, but I can already tell that I am intrigued :)

J.

@jwijffels
Copy link
Contributor Author

Take your time, it also took me Some time to get familiar with the Starspace methodology and the training parameters and format.
I only did training on larger datasets myself. My starting point advice of a toy example dataset is that it should contain at least 25000 records of text.

@jwijffels
Copy link
Contributor Author

jwijffels commented Feb 19, 2019

Hi @systats
Are you interested in chiming in on this exercise comparing different text modelling approaches for modelling purposes. @jlacko made a repository for doing a basic simple balanced data 6-label classification problem at https://github.com/jlacko/celebrity-faceoff
Results from that setup were basically:

  • penalised multinomial regression with glmnet
    • 74.39%
  • naive bayes model with quanteda
    • 75.78%
  • fasttext
    • 74.11%
  • ruimtehol (Starspace)
    • vanilla: 71.39%
    • tuned: 73.28%
    • with transfer learning: 75.11%
  • keras
    • Single LSTM: 75.83%
    • Stacked LSTM: 76.44%

I have a political dataset of questions/answers in Belgian parliament at https://github.com/bnosac/ruimtehol/raw/master/inst/extdata/dekamer-2014-2018.RData
Would love to see other approaches and results popping out comparing different techniques on different types of datasets (preferable larger volumes of text) supervised as unsupervised. I've made an overview of NLP techniques at http://bnosac.be/index.php/blog/87-an-overview-of-the-nlp-ecosystem-in-r-nlproc-textasdata.

@systats
Copy link

systats commented Feb 25, 2019

Hi guys: sounds very good to gather the collective wisdom on NLP. Altough 9000 tweets seem to be a little bit few data at least for keras. But sure I will look into the project.

@jwijffels
Copy link
Contributor Author

@systats If you have other datasets which you think are interesting, let us know.

@dselivanov
Copy link

dselivanov commented Feb 25, 2019 via email

@dselivanov
Copy link

dselivanov commented Feb 25, 2019 via email

@jwijffels
Copy link
Contributor Author

jwijffels commented Feb 25, 2019

That is certainly what is coming out of the above exercise and it's always the 1st approach I try out. But I'm not 100% sure on that statement especially for larger datasets and for multi-label classification.
Either way I would be mainly interested in some embedding comparisons, because general entity embeddings is what this Starspace model is doing. If you have ideas on datasets & approaches, let me know.

@systats
Copy link

systats commented Feb 25, 2019

sure, there are probably to many

But it strongly depends on what you are personally interested in.

@systats
Copy link

systats commented Feb 25, 2019

I too think classification and regression is well documented (in other languages) everywhere, but some specific problems remain difficult (unbalanced data/ multi-outcome). I'd love to work on projects involving Seq2Seq models (text generation, summarization, autoencoders for topic models), but there is little R documentation out there.

I thought it might be useful to build a unifying wrapper for all first party algorithms in R (like RTextTools). This would enable us to benchmark (purrr) all the datasets and models at once.

@jwijffels
Copy link
Contributor Author

jwijffels commented Feb 25, 2019

FWIW. Starspace handles multi-outcome unbalanced classification data pretty gracefully.

I'm in if you want to compare a seq2seq model to a combination of textrank with some embedding models (either starspace/fasttext or glove) in order to summarise data. Do you have some data in mind which makes this a nice exercise.
Or I don't mind doing the exercise of using starspace in a similar fashion as an autoencoder to provide labels/categories to images.

@systats
Copy link

systats commented Feb 26, 2019

Generally, I am interested in all the topics you mentioned. Some weeks ago I tried to build a Keras summariser but the predictions were bad. From my point of view, Keras is not the best suited library for this task, because one has to iterate whole parts of a model per token. textgenrnn might be easier for predicting next words in a sequence.

At the moment I have to write a paper about insights from large collections of publications (N > millions), regarding a specific research field. Through unsupervised document clustering topicmodels (too slow)/ autoencoder (very complex), I hope to recover hidden structures in the corpus. Do you think starspace allows me to map the documents based on their vocabulary to a latent sub space [10 <= k <= 30], that I can access and work with? E.g. for predictions on new data. There is a package for Keras Autoencoders in R ruta which does exactly that. But I have to figure out how to adapt the model to text input. Maybe you guys have some suggestions on this topic.

Finally, I added three model which perform okay on the celebrity-faceoff dataset like yours. And there are hundreds more to discover. Our results strongly depend on tokens that appear only once in the corpus, and therefore the models overfit massively?

As supervised learning is still the backbone of most NLP applications, this effort could be scaled. We could standardize model inputs/interfaces, to easily run grid searches or special optimizers for hyperparameter selection. This would be also beneficial for automatic benchmarking models on a variety of datasets (comparing all R packages?) . This article suggest that the choice of dataset strongly matters for how a specific model performs (maybe starspace slightly underperforms only on small datasets?). Maybe you could provide pretrained word-embeddings for keras?

@systats
Copy link

systats commented Feb 26, 2019

Even more general (and better) than that: https://github.com/systats/tidykeras

@jwijffels
Copy link
Contributor Author

About the collection of publications. I haven't tried using the Starspace embeddings for clustering but it can give embeddings of words as well as sentences or paragraphs or full documents or authors which you can then cluster with DBscan e.g. or visualise with the projector R package.
The main advantage of starspace is that the embeddings of the labels and ngrams which construct the embedding space are lying in the same space and that it works on a multitude of setups (it's a generic entity embedding framework)
Interesting package https://github.com/systats/tidykeras. About a generic package for tuning nlp models. That would be interesting to have but out of my personal scope.
Is that collection of publications public domain?

@jlacko
Copy link

jlacko commented Feb 26, 2019

Thanks for your kind words & for your work on the celebrity-faceoff repo. I managed to have only a quick look at it and it looked inspiring indeed. If you do not mind I will rework it from rmd format to plain R code to keep it more in line with the other sample data.

Unfortunately it seems (with the benefit of hindsight) that my choice of dataset was not a particularly good one. As intriguing as the idea of keeping up with the Kardashians seemed the sample was just too small.

With regards of using StarSpace algorithm on millions of documents - this seems interesting, and could work. I was amazed by the speed with which ruimtehol / StarSpace worked, especially since I was used to working with Keras. Totally different ballpark!

Finally, I added three model which perform okay on the celebrity-faceoff dataset like yours. And there are hundreds more to discover. Our results strongly depend on tokens that appear only once in the corpus, and therefore the models overfit massively?

@systats
Copy link

systats commented Feb 26, 2019

The main advantage of starspace is that the embeddings of the labels and ngrams which construct the embedding space are lying in the same space and that it works on a multitude of setups (it's a generic entity embedding framework)

Are starspace embeddings trained unsupervised? And do you think they provide a meaningful latent space (compared to topic proportions in topicmodels)?

Is that collection of publications public domain?

Personally I'm interest in social science literature, which comes partly from scopus (a tone of labels and meta data) and a free doi dump (for abstracts).

Of course you can reshape the file to your needs. Whats next for the celebrity-faceoff repo?

@jwijffels
Copy link
Contributor Author

jwijffels commented Feb 26, 2019

Are starspace embeddings trained unsupervised? And do you think they provide a meaningful latent space (compared to topic proportions in topicmodels)?

They can be trained supervised, unsupervised or semi-supervised. Haven't compared them to LDA/BTM models yet. Currently only used it for document proposal and multi-label classification only.
@systats I'll add some examples in the documentation on semi-supervised learning and also on transfer learning.

@systats
Copy link

systats commented Feb 26, 2019

I will of course look into it in the coming weeks. Thanks so far and your project inspired this: https://github.com/systats/textlearnR . I'm going to focus on this lightweight R package for benchmarking NLP models on a range of datasets. Hoping not to reinvent the wheel. If you see it that way let me know please.

@jlacko
Copy link

jlacko commented Feb 26, 2019

I am open to advice in this matter. My current plan is to consolidate the various scripts to a more uniform format and publish a short summary post on some blog or other.

Even though it proved too small to let the neural models truly shine I believe it could serve as a nice showcase of possible approaches.

Whats next for the celebrity-faceoff repo?

@jwijffels
Copy link
Contributor Author

I think there is certainly some value in comparing different approaches and best practices across different packages. I know there is https://github.com/tidymodels/textrecipes by @EmilHvitfeldt which tries to standardise NLP data preparation but mainly for non-neural supervised approaches. I'm not aware of similar neural tuning/comparison approaches on the R side.
For datasets at textlearnR - last week I used the CRAN DESCRIPTION files for seeing which packages could be added to taskviews (https://github.com/jwijffels/starspace-examples)
I think that an approach where different package authors in NLP can publish an Rmd file on their best practices in a certain NLP area will certainly be a good thing for the nlp r community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants