presubmission inquiry for pangoling: Access to word predictability using large language (transformer) models. #573

bnicenboim · 2023-02-13T22:20:05Z

Submitting Author Name: Bruno Nicenboim
Submitting Author Github Handle: @bnicenboim
Repository: https://github.com/bnicenboim/pangoling
Submission type: Pre-submission
Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: pangoling
Type: Package
Title: Access to Large Language Model Predictions
Version: 0.0.0.9002
Authors@R: c(
    person("Bruno", "Nicenboim",
    email = "[email protected]",
    role = c( "aut","cre"),
    comment = c(ORCID = "0000-0002-5176-3943")),
    person("Chris", "Emmerly", role = "ctb"),
    person("Giovanni", "Cassani", role = "ctb"))
Description: Access to word predictability using large language (transformer) models.
URL: https://bruno.nicenboim.me/pangoling, https://github.com/bnicenboim/pangoling
BugReports: https://github.com/bnicenboim/pangoling/issues
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: false
Config/reticulate:
  list(
    packages = list(
      list(package = "torch"),
      list(package = "transformers")
    )
  )
Imports: 
    data.table,
    memoise,
    reticulate,
    tidyselect,
    tidytable (>= 0.7.2),
    utils,
    cachem
Suggests: 
    rmarkdown,
    knitr,
    testthat (>= 3.0.0),
    tictoc,
    covr,
    spelling
Config/testthat/edition: 3
RoxygenNote: 7.2.3
Roxygen: list(markdown = TRUE)
Depends: 
    R (>= 2.10)
VignetteBuilder: knitr
StagedInstall: yes
Language: en-US

Scope

Please indicate which category or categories from our package fit policies or statistical package categories this package falls under. (Please check an appropriate box below):

Data Lifecycle Packages
- data retrieval
- data extraction
- data munging
- data deposition
- data validation and testing
- workflow automation
- version control
- citation management and bibliometrics
- scientific software wrappers
- field and lab reproducibility tools
- database software bindings
- geospatial data
- text analysis
Statistical Packages
- Bayesian and Monte Carlo Routines
- Dimensionality Reduction, Clustering, and Unsupervised Learning
- Machine Learning
- Regression and Supervised Learning
- Exploratory Data Analysis (EDA) and Summary Statistics
- Spatial Analyses
- Time Series Analyses
Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:

The package is a wrapper around transformers python package, and it can tokenize, get word predictability and calculate perplexity which is text analysis.

If submitting a statistical package, have you already incorporated documentation of standards into your code via the srr package?

NA

Who is the target audience and what are scientific applications of this package?

This is mostly for psycho/neuro/- linguists that use word predictability as a predictor in their research, such as ERP and reading.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

Another R package that acts as a wrapper for transformers is text However, text is more general, and its focus
is on Natural Language Processing and Machine Learning. pangoling is much more specific and the focus is on measures used as predictors in analyses of data from experiments, rather than NLP.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
Any other questions or issues we should be aware of?:

Yes, the output of pkgcheck fails only because of the use of <<-. But this is done in order to use memoise, as it is recommended in its page. The <<- in the package appears inside .onLoad <- function(libname, pkgname) {

 # caching:
  tokenizer <<- memoise::memoise(tokenizer)
  lang_model <<- memoise::memoise(lang_model)
  transformer_vocab <<- memoise::memoise(transformer_vocab)

The pkgcheck output is the following:

── pangoling 0.0.0.9002 ────────────────────────────

✔ Package name is available
✔ has a 'codemeta.json' file.
✔ has a 'contributing' file.
✔ uses 'roxygen2'.
✔ 'DESCRIPTION' has a URL field.
✔ 'DESCRIPTION' has a BugReports field.
✔ Package has at least one HTML vignette
✔ All functions have examples.
✖ Package uses global assignment operator ('<<-').
✔ Package has continuous integration checks.
✔ Package coverage is 94.4%.
✔ R CMD check found no errors.
✔ R CMD check found no warnings.

ℹ Current status:
✖ This package is not ready to be submitted.

The text was updated successfully, but these errors were encountered:

maurolepore · 2023-02-14T14:56:53Z

Thanks @bnicenboim for your pre-submission. I'll come back to you ASAP.
Thanks also for explaining your use of <<-.

maurolepore · 2023-02-15T11:39:37Z

Dear @bnicenboim,

It's my first rotation as EiC and I'm still learning the nuances of assessing the eligibility of each submission. I need to show other editors the evidence that shows how pangoling meets our criteria for fit and overlap. Can you please help me by answering the following questions as succinctly and precisely as possible? I numbered the actionable items to help you respond to them specifically.

Package categories

scientific software wrappers: Packages that wrap non-R utility programs used for scientific research.

ml01. Must be specific to research fields, not general computing utilities.

What research field is pangoling specific for? Is that psycho/neuro/- linguists?

ml02. Must be non-trivial.

Can you please expand on what value pangoling adds? That is, above simple system() call or bindings, whether in parsing inputs and outputs, data handling, etc. Improved installation process, or extension of compatibility to more platforms, may constitute added value if installation is complex.

Other scope considerationos

ml03. Should be general in the sense that they should solve a problem as broadly as possible while maintaining a coherent user interface and code base. For instance, if several data sources use an identical API, we prefer a package that provides access to all the data sources, rather than just one.

The 'pangoling' package states that the overlaping package 'text' is more general. Can you please argue how despite this pangoling still meets rOpenSci's guidelines for fit and overlap?

Package overlap

ml04. Avoids duplication of functionality of existing R packages in any repo without significant improvements.

Can you please explain if 'pangoling' duplicates or not functionaliry in the 'text' or another R package, and if it does then how 'pangoling' represents a significant improvement (see our guide for details on what we mean by 'significant improvement')?

ml05. Also, as it becomes increasingly easy to call python packages from R, can you please explain how straightforward it would be to access the underlying python functionality without 'pangoling'?

Thanks for your patience :-)

bnicenboim · 2023-02-15T12:05:59Z

Ok, sure I answer inline.

Dear @bnicenboim,

It's my first rotation as EiC and I'm still learning the nuances of assessing the eligibility of each submission. I need to show other editors the evidence that shows how pangoling meets our criteria for fit and overlap. Can you please help me by answering the following questions as succinctly and precisely as possible? I numbered the actionable items to help you respond to them specifically.

Package categories

scientific software wrappers: Packages that wrap non-R utility programs used for scientific research.
* [x]  ml01. Must be specific to research fields, not general computing utilities.
What research field is pangoling specific for? Is that psycho/neuro/- linguists?

Yes, it's common to use word predictability as a predictor in models, and pangoling extract predictability from the transformer models.

* [ ]  ml02. Must be non-trivial.
Can you please expand on what value pangoling adds? That is, above simple system() call or bindings, whether in parsing inputs and outputs, data handling, etc. Improved installation process, or extension of compatibility to more platforms, may constitute added value if installation is complex.

Transformer models are "meant" to be used for computational linguistic tasks. For example gpt-like models produce a (random) continuation given a context. That's trivial to get, since there is a short-cut call called pipeline() in python that does exactly that. The thing is that one can also get the probability of each word in a given text without generating anything, that's less trivial to obtain, but it's very useful in *-linguistics. It's less trivial because one needs to know how to set up the language model, then one obtains a huge tensor (which is not trivial to manipulate for most R users) and finally one needs to take care of the mapping between words and phrases (the important thing in *ling) and the tokens (which is how the model is encoding the words), and this correspondence might be one to one or one to many. For Bert-like model the challenges are similar. Crucially one needs to understand how these large language transformer models work. There are two contributors of the package, that was their role, explaining me how these models work so that I could figure out what were the python functions I needed :)

Also the point of using memoise is that the package is not object oriented (like R6) which are more confusing for most basic R users, and the package is completely functional-based. It just remembers what was the last type of language model that was used.

I hope it's clearer, but feel free to ask!

Other scope considerationos
* [ ]  ml03. Should be general in the sense that they should solve a problem as broadly as possible while maintaining a coherent user interface and code base. For instance, if several data sources use an identical API, we prefer a package that provides access to all the data sources, rather than just one.
The 'pangoling' package states that the overlaping package 'text' is more general. Can you please argue how despite this pangoling still meets rOpenSci's guidelines for fit and overlap?

text brings transformers python package to R, and adds some machine learning stuff. I would say that also here, the overlap is that text is more general and it doesn't allow for generating pangoling output in a straightforward way. In fact, I'm not sure if it's even possible since it seems more limited than transformers.
I would say that the users of pangoling would be mostly psycho/neuro linguists, while the users of text are computational linguists.

Package overlap
* [ ]  ml04. Avoids duplication of functionality of existing R packages in any repo without significant improvements.
Can you please explain if 'pangoling' duplicates or not functionaliry in the 'text' or another R package, and if it does then how 'pangoling' represents a significant improvement (see our guide for details on what we mean by 'significant improvement')?

I think I answered in the previous point. I'm not even sure that you can get the output of pangoling just using text. I think the overlap is that they are both wrappers of transformers.

* [ ]  ml05. Also, as it becomes increasingly easy to call python packages from R, can you please explain how straightforward it would be to access the underlying python functionality without 'pangoling'?

I'm not sure if I understand this. One would need to set up the models in python, then extract the tensors and manipulate them. Finally, one needs to take care of the mapping between words and tokens. But python is not needed in the last step.

Thanks for your patience :-)

ok, there was a lot of overlap in my answers, so feel free to ask me more specific questions if something is not clear.

maurolepore · 2023-02-17T16:29:02Z

@bnicenboim, I'm still discussing the scope with other editors.

ml06. Did you consider submitting pangoling as as stats package? If so, what convinced you to submit as a general package?

bnicenboim · 2023-02-17T16:50:28Z

On Fri, Feb 17, 2023, 5:29 PM Mauro Lepore ***@***.***> wrote: @bnicenboim <https://github.com/bnicenboim>, I'm still discussing the scope with other editors.

Ok, sure no problem.

- ml06. Did you consider submitting pangoling as as stats package <https://stats-devguide.ropensci.org/>? If so, what convinced you to submit as a general package? Sorry, why as a stats package? it doesn't do any statistics. I guess it's

an NLP package if I'm forced to put it in a category.

…

— Reply to this email directly, view it on GitHub <#573 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABNUQ6RPB3WRRSXXEKU3QUTWX6RNRANCNFSM6AAAAAAU22HXEU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

maurolepore · 2023-02-17T18:08:16Z

Thanks.

I ask because the category "text analysis" in the standard-package guide states:

Machine-learning and packages implementing NLP analysis algorithms should be submitted under statistical software peer review.

Knowing that you at least considered it I can now be sure the standard-package review is your informed decision.

bnicenboim · 2023-02-17T19:09:08Z

Thanks for the clarification, i found this under statistical software: 1. Bayesian and Monte Carlo Routines 2. Dimensionality Reduction, Clustering, and Unsupervised Learning 3. Machine Learning 4. Regression and Supervised Learning 5. Exploratory Data Analysis (EDA) and Summary Statistics 6. Spatial Analyses 7. Time Series Analyses And the package doesn't fall into any of those, it's not doing machine learning either. So I think it's fine under general.

…

On Fri, Feb 17, 2023, 7:08 PM Mauro Lepore ***@***.***> wrote: Thanks. I ask because the category "text analysis" in the standard-package guide states: Machine-learning and packages implementing NLP analysis algorithms should be submitted under statistical software peer review <https://stats-devguide.ropensci.org/>. Knowing that you at least considered it I can now be sure the standard-package review is your informed decision. — Reply to this email directly, view it on GitHub <#573 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABNUQ6VUVAWBBIFT3JCNLADWX65BVANCNFSM6AAAAAAU22HXEU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

maurolepore · 2023-02-21T10:53:58Z

@bnicenboim

I now have enough opinions from the editorial team to consider this package in scope. Please go ahead with a full submission.

Thanks for your patience.

bnicenboim · 2023-02-21T10:56:43Z

Thanks, should I do something about Package uses global assignment operator ('<<-').. If I add pkgcheck in github actions it will just fail.

maurolepore · 2023-02-21T10:59:18Z

Please use the same justification you wrote here.

mpadge · 2023-02-21T12:22:13Z

@bnicenboim @maurolepore I've just updated pkgcheck via the issue linked above to allow <<- in an .onLoad function for use of memoise. The {pangoling} package still fails because it also has two .onLoad entrys for <<- reticulate::import. This is also recommended in the reticulate package. Although not yet permitted in pkgcheck, this use for reticulate imports will also be permitted soon, and the pangoling package will then pass all tests. In the meantime, @bnicenboim please simply add an explanatory note, and links to this comment or the pkgcheck issue as you see fit. thanks

maurolepore · 2023-02-22T10:58:13Z

Closing because there is now a full submission at #575

bnicenboim changed the title ~~presubmission inquiry for pangoling~~ presubmission inquiry for pangoling: Access to word predictability using large language (transformer) models. Feb 14, 2023

mpadge mentioned this issue Feb 21, 2023

Allow exceptions on use of <<- ropensci-review-tools/pkgcheck#167

Open

bnicenboim mentioned this issue Feb 21, 2023

submission: pangoling: Access to word predictability using large language (transformer) models #575

Open

29 tasks

maurolepore added the 0/presubmission label Feb 22, 2023

maurolepore closed this as completed Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

presubmission inquiry for pangoling: Access to word predictability using large language (transformer) models. #573

presubmission inquiry for pangoling: Access to word predictability using large language (transformer) models. #573

bnicenboim commented Feb 13, 2023

maurolepore commented Feb 14, 2023 •

edited

Loading

maurolepore commented Feb 15, 2023 •

edited

Loading

bnicenboim commented Feb 15, 2023 •

edited

Loading

Package categories

Other scope considerationos

Package overlap

maurolepore commented Feb 17, 2023 •

edited

Loading

bnicenboim commented Feb 17, 2023 via email

maurolepore commented Feb 17, 2023

bnicenboim commented Feb 17, 2023 via email

maurolepore commented Feb 21, 2023

bnicenboim commented Feb 21, 2023

maurolepore commented Feb 21, 2023

mpadge commented Feb 21, 2023

maurolepore commented Feb 22, 2023

presubmission inquiry for pangoling: Access to word predictability using large language (transformer) models. #573

presubmission inquiry for pangoling: Access to word predictability using large language (transformer) models. #573

Comments

bnicenboim commented Feb 13, 2023

Scope

maurolepore commented Feb 14, 2023 • edited Loading

maurolepore commented Feb 15, 2023 • edited Loading

Package categories

Other scope considerationos

Package overlap

bnicenboim commented Feb 15, 2023 • edited Loading

Package categories

Other scope considerationos

Package overlap

maurolepore commented Feb 17, 2023 • edited Loading

bnicenboim commented Feb 17, 2023 via email

maurolepore commented Feb 17, 2023

bnicenboim commented Feb 17, 2023 via email

maurolepore commented Feb 21, 2023

bnicenboim commented Feb 21, 2023

maurolepore commented Feb 21, 2023

mpadge commented Feb 21, 2023

maurolepore commented Feb 22, 2023

maurolepore commented Feb 14, 2023 •

edited

Loading

maurolepore commented Feb 15, 2023 •

edited

Loading

bnicenboim commented Feb 15, 2023 •

edited

Loading

maurolepore commented Feb 17, 2023 •

edited

Loading