Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

presubmission inquiry for pangoling: Access to word predictability using large language (transformer) models. #573

Closed
2 of 20 tasks
bnicenboim opened this issue Feb 13, 2023 · 12 comments

Comments

@bnicenboim
Copy link

Submitting Author Name: Bruno Nicenboim
Submitting Author Github Handle: @bnicenboim
Repository: https://github.com/bnicenboim/pangoling
Submission type: Pre-submission
Language: en


  • Paste the full DESCRIPTION file inside a code block below:
Package: pangoling
Type: Package
Title: Access to Large Language Model Predictions
Version: 0.0.0.9002
Authors@R: c(
    person("Bruno", "Nicenboim",
    email = "[email protected]",
    role = c( "aut","cre"),
    comment = c(ORCID = "0000-0002-5176-3943")),
    person("Chris", "Emmerly", role = "ctb"),
    person("Giovanni", "Cassani", role = "ctb"))
Description: Access to word predictability using large language (transformer) models.
URL: https://bruno.nicenboim.me/pangoling, https://github.com/bnicenboim/pangoling
BugReports: https://github.com/bnicenboim/pangoling/issues
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: false
Config/reticulate:
  list(
    packages = list(
      list(package = "torch"),
      list(package = "transformers")
    )
  )
Imports: 
    data.table,
    memoise,
    reticulate,
    tidyselect,
    tidytable (>= 0.7.2),
    utils,
    cachem
Suggests: 
    rmarkdown,
    knitr,
    testthat (>= 3.0.0),
    tictoc,
    covr,
    spelling
Config/testthat/edition: 3
RoxygenNote: 7.2.3
Roxygen: list(markdown = TRUE)
Depends: 
    R (>= 2.10)
VignetteBuilder: knitr
StagedInstall: yes
Language: en-US

Scope

  • Please indicate which category or categories from our package fit policies or statistical package categories this package falls under. (Please check an appropriate box below):

    Data Lifecycle Packages

    • data retrieval
    • data extraction
    • data munging
    • data deposition
    • data validation and testing
    • workflow automation
    • version control
    • citation management and bibliometrics
    • scientific software wrappers
    • field and lab reproducibility tools
    • database software bindings
    • geospatial data
    • text analysis

    Statistical Packages

    • Bayesian and Monte Carlo Routines
    • Dimensionality Reduction, Clustering, and Unsupervised Learning
    • Machine Learning
    • Regression and Supervised Learning
    • Exploratory Data Analysis (EDA) and Summary Statistics
    • Spatial Analyses
    • Time Series Analyses
  • Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:

The package is a wrapper around transformers python package, and it can tokenize, get word predictability and calculate perplexity which is text analysis.

NA

  • Who is the target audience and what are scientific applications of this package?

This is mostly for psycho/neuro/- linguists that use word predictability as a predictor in their research, such as ERP and reading.

Another R package that acts as a wrapper for transformers is text However, text is more general, and its focus
is on Natural Language Processing and Machine Learning. pangoling is much more specific and the focus is on measures used as predictors in analyses of data from experiments, rather than NLP.

Yes, the output of pkgcheck fails only because of the use of <<-. But this is done in order to use memoise, as it is recommended in its page. The <<- in the package appears inside .onLoad <- function(libname, pkgname) {

 # caching:
  tokenizer <<- memoise::memoise(tokenizer)
  lang_model <<- memoise::memoise(lang_model)
  transformer_vocab <<- memoise::memoise(transformer_vocab)

The pkgcheck output is the following:

── pangoling 0.0.0.9002 ────────────────────────────

✔ Package name is available
✔ has a 'codemeta.json' file.
✔ has a 'contributing' file.
✔ uses 'roxygen2'.
✔ 'DESCRIPTION' has a URL field.
✔ 'DESCRIPTION' has a BugReports field.
✔ Package has at least one HTML vignette
✔ All functions have examples.
✖ Package uses global assignment operator ('<<-').
✔ Package has continuous integration checks.
✔ Package coverage is 94.4%.
✔ R CMD check found no errors.
✔ R CMD check found no warnings.

ℹ Current status:
✖ This package is not ready to be submitted.

@bnicenboim bnicenboim changed the title presubmission inquiry for pangoling presubmission inquiry for pangoling: Access to word predictability using large language (transformer) models. Feb 14, 2023
@maurolepore
Copy link
Member

maurolepore commented Feb 14, 2023

Thanks @bnicenboim for your pre-submission. I'll come back to you ASAP.
Thanks also for explaining your use of <<-.

@maurolepore
Copy link
Member

maurolepore commented Feb 15, 2023

Dear @bnicenboim,

It's my first rotation as EiC and I'm still learning the nuances of assessing the eligibility of each submission. I need to show other editors the evidence that shows how pangoling meets our criteria for fit and overlap. Can you please help me by answering the following questions as succinctly and precisely as possible? I numbered the actionable items to help you respond to them specifically.

Package categories

scientific software wrappers: Packages that wrap non-R utility programs used for scientific research.

  • ml01. Must be specific to research fields, not general computing utilities.

What research field is pangoling specific for? Is that psycho/neuro/- linguists?

  • ml02. Must be non-trivial.

Can you please expand on what value pangoling adds? That is, above simple system() call or bindings, whether in parsing inputs and outputs, data handling, etc. Improved installation process, or extension of compatibility to more platforms, may constitute added value if installation is complex.

Other scope considerationos

  • ml03. Should be general in the sense that they should solve a problem as broadly as possible while maintaining a coherent user interface and code base. For instance, if several data sources use an identical API, we prefer a package that provides access to all the data sources, rather than just one.

The 'pangoling' package states that the overlaping package 'text' is more general. Can you please argue how despite this pangoling still meets rOpenSci's guidelines for fit and overlap?

Package overlap

  • ml04. Avoids duplication of functionality of existing R packages in any repo without significant improvements.

Can you please explain if 'pangoling' duplicates or not functionaliry in the 'text' or another R package, and if it does then how 'pangoling' represents a significant improvement (see our guide for details on what we mean by 'significant improvement')?

  • ml05. Also, as it becomes increasingly easy to call python packages from R, can you please explain how straightforward it would be to access the underlying python functionality without 'pangoling'?

Thanks for your patience :-)

@bnicenboim
Copy link
Author

bnicenboim commented Feb 15, 2023

Ok, sure I answer inline.

Dear @bnicenboim,

It's my first rotation as EiC and I'm still learning the nuances of assessing the eligibility of each submission. I need to show other editors the evidence that shows how pangoling meets our criteria for fit and overlap. Can you please help me by answering the following questions as succinctly and precisely as possible? I numbered the actionable items to help you respond to them specifically.

Package categories

scientific software wrappers: Packages that wrap non-R utility programs used for scientific research.

* [x]  ml01. Must be specific to research fields, not general computing utilities.

What research field is pangoling specific for? Is that psycho/neuro/- linguists?

Yes, it's common to use word predictability as a predictor in models, and pangoling extract predictability from the transformer models.

* [ ]  ml02. Must be non-trivial.

Can you please expand on what value pangoling adds? That is, above simple system() call or bindings, whether in parsing inputs and outputs, data handling, etc. Improved installation process, or extension of compatibility to more platforms, may constitute added value if installation is complex.

Transformer models are "meant" to be used for computational linguistic tasks. For example gpt-like models produce a (random) continuation given a context. That's trivial to get, since there is a short-cut call called pipeline() in python that does exactly that. The thing is that one can also get the probability of each word in a given text without generating anything, that's less trivial to obtain, but it's very useful in *-linguistics. It's less trivial because one needs to know how to set up the language model, then one obtains a huge tensor (which is not trivial to manipulate for most R users) and finally one needs to take care of the mapping between words and phrases (the important thing in *ling) and the tokens (which is how the model is encoding the words), and this correspondence might be one to one or one to many. For Bert-like model the challenges are similar. Crucially one needs to understand how these large language transformer models work. There are two contributors of the package, that was their role, explaining me how these models work so that I could figure out what were the python functions I needed :)

Also the point of using memoise is that the package is not object oriented (like R6) which are more confusing for most basic R users, and the package is completely functional-based. It just remembers what was the last type of language model that was used.

I hope it's clearer, but feel free to ask!

Other scope considerationos

* [ ]  ml03. Should be general in the sense that they should solve a problem as broadly as possible while maintaining a coherent user interface and code base. For instance, if several data sources use an identical API, we prefer a package that provides access to all the data sources, rather than just one.

The 'pangoling' package states that the overlaping package 'text' is more general. Can you please argue how despite this pangoling still meets rOpenSci's guidelines for fit and overlap?

text brings transformers python package to R, and adds some machine learning stuff. I would say that also here, the overlap is that text is more general and it doesn't allow for generating pangoling output in a straightforward way. In fact, I'm not sure if it's even possible since it seems more limited than transformers.
I would say that the users of pangoling would be mostly psycho/neuro linguists, while the users of text are computational linguists.

Package overlap

* [ ]  ml04. Avoids duplication of functionality of existing R packages in any repo without significant improvements.

Can you please explain if 'pangoling' duplicates or not functionaliry in the 'text' or another R package, and if it does then how 'pangoling' represents a significant improvement (see our guide for details on what we mean by 'significant improvement')?

I think I answered in the previous point. I'm not even sure that you can get the output of pangoling just using text. I think the overlap is that they are both wrappers of transformers.

* [ ]  ml05. Also, as it becomes increasingly easy to call python packages from R, can you please explain how straightforward it would be to access the underlying python functionality without 'pangoling'?

I'm not sure if I understand this. One would need to set up the models in python, then extract the tensors and manipulate them. Finally, one needs to take care of the mapping between words and tokens. But python is not needed in the last step.

Thanks for your patience :-)

ok, there was a lot of overlap in my answers, so feel free to ask me more specific questions if something is not clear.

@maurolepore
Copy link
Member

maurolepore commented Feb 17, 2023

@bnicenboim, I'm still discussing the scope with other editors.

  • ml06. Did you consider submitting pangoling as as stats package? If so, what convinced you to submit as a general package?

@bnicenboim
Copy link
Author

bnicenboim commented Feb 17, 2023 via email

@maurolepore
Copy link
Member

Thanks.

I ask because the category "text analysis" in the standard-package guide states:

Machine-learning and packages implementing NLP analysis algorithms should be submitted under statistical software peer review.

Knowing that you at least considered it I can now be sure the standard-package review is your informed decision.

@bnicenboim
Copy link
Author

bnicenboim commented Feb 17, 2023 via email

@maurolepore
Copy link
Member

@bnicenboim

I now have enough opinions from the editorial team to consider this package in scope. Please go ahead with a full submission.

Thanks for your patience.

@bnicenboim
Copy link
Author

Thanks, should I do something about Package uses global assignment operator ('<<-').. If I add pkgcheck in github actions it will just fail.

@maurolepore
Copy link
Member

Please use the same justification you wrote here.

@mpadge
Copy link
Member

mpadge commented Feb 21, 2023

@bnicenboim @maurolepore I've just updated pkgcheck via the issue linked above to allow <<- in an .onLoad function for use of memoise. The {pangoling} package still fails because it also has two .onLoad entrys for <<- reticulate::import. This is also recommended in the reticulate package. Although not yet permitted in pkgcheck, this use for reticulate imports will also be permitted soon, and the pangoling package will then pass all tests. In the meantime, @bnicenboim please simply add an explanatory note, and links to this comment or the pkgcheck issue as you see fit. thanks

@maurolepore
Copy link
Member

Closing because there is now a full submission at #575

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants