Why does GitHub still wrongly mislabel R / RMarkdown projects? #6869

EarlGlynn · 2024-06-07T06:48:10Z

EarlGlynn
Jun 7, 2024

Why does GitHub automatically "guess wrong" about repos with Rmarkdown code (Rmd) files?

Why can't you label this an "RStudio Notebook" like you label a "Jupyter Notebook"?

Why is the "solution" to show wrong information instead of fixing this very old problem? My Rmd files are not "Jekyll using Docker image" or "SLSA Generic generator".

Forcing a .gitattributes file is not a good fix when the default is to show wrong information.

lildude · 2024-06-07T07:06:19Z

lildude
Jun 7, 2024
Maintainer

Why does GitHub automatically "guess wrong" about repos with Rmarkdown code (Rmd) files?

In what way? Rmd files are recognised but are considered prose so Linguist doesn't include them in the language stats by default. The same applies to repos that contain nothing but Markdown files.

Why can't you label this an "RStudio Notebook" like you label a "Jupyter Notebook"?

Because Linguist considers Juptyer a programming language but not Rmd files. Why? See #5208. Additionally GitHub (independent of Linguist) has support for rendering Juptyer files but not Rmd.

Also, Linguist doesn't know a language called "RStudio Notebook". If this is something distinct from a standalone Rmd file, please feel free to submit a pull request to add support.

Why is the "solution" to show wrong information instead of fixing this very old problem? My Rmd files are not "Jekyll using Docker image" or "SLSA Generic generator".

This isn't likely to be your Rmd files. As I mentioned before, Linguist considers Rmd as prose so it is not included in the stats so won't appear in the side bar and won't contribute to the languages shown in the side bar. If you are seeing other names it is because other languages are being detected in your other files or your Rmd files are not using the .rmd or .qmd extensions Linguist expects:

linguist/lib/linguist/languages.yml

Lines 5759 to 5770 in e2012cd

    
           RMarkdown: 
        
             type: prose 
        
             color: "#198ce7" 
        
             wrap: true 
        
             ace_mode: markdown 
        
             codemirror_mode: gfm 
        
             codemirror_mime_type: text/x-gfm 
        
             extensions: 
        
             - ".qmd" 
        
             - ".rmd" 
        
             tm_scope: text.md 
        
             language_id: 313

If you provide a link to a repo showing the problem, I can take a look and explain the behaviour you are seeing more precisely.

Forcing a .gitattributes file is not a good fix when the default is to show wrong information.

This is the only way to override Linguist's default behaviour, in this case, to not count prose files, or to tell it what you really want if you're not using expected extensions.

0 replies

EarlGlynn · 2024-06-07T07:48:34Z

EarlGlynn
Jun 7, 2024
Author

So why don't you fix Linguist's hallucinations? Why does Linguist want to show biased language statistics? Perhaps some directive from the Ministry of Truth?

How is an RStudio notebook (an .Rmd file) with very distinctive R code chunks and comments between the code chunks, conceptually different from a C file with comments? Why does Linguist not recognize R code chunks -- easily parsable -- as code? Why does Linguist not classify a C file with comments as prose? Isn't Linguist being inconsistent?

Linguist is having an hallucination that needs to be fixed by looking at the code chunks in a notebook. If there are no code chunks, then perhaps "prose" is an adequate answer. If a C file has no C code but only comments do you label it as "prose"?

An .Rmd file (or a Jupyter notebook), can have R or Python code (or other languages, too). Why not classify a file by the percentages of code chunks in the file? You could have an .Rmd file that is 75% R and 25% Python or vice versa. And perhaps a percentage for "prose" for the comment chunks? Most .Rmd files are not 100% "prose" whether Linquist says so or not.

And I'd argue that you should have Quarto notebooks instead of "prose," too.

I'd accept no classification as a solution to .Rmd files instead of a wrong solution. Why is "wrong" the default? Why is a "wrong" default ever acceptable? Because the Ministry of Truth says so?

I've had this same argument about Pascal and Delphi and got nowhere in the past. Delphi is very different from Pascal, but not to the Linguist.

1 reply

lildude Jun 7, 2024
Maintainer

Please tone down your language. The aggressive nature of your comments is not appreciated.

So why don't you fix Linguist's hallucinations? Why does Linguist want to show biased language statistics?

I'm not entirely sure what you mean here. There's no hallicinating or bias. Linguist's purpose is to detect programming languages for GitHub. A design decision was made many many years ago that Linguist would concentrate on programming languages when calculating the usage statistics and the option of an override was provided for those that prefer everything to be considered and reflected in the stats.

How is an RStudio notebook (an .Rmd file) with very distinctive R code chunks and comments between the code chunks, conceptually different from a C file with comments? Why does Linguist not recognize R code chunks -- easily parsable -- as code? Why does Linguist not classify a C file with comments as prose? Isn't Linguist being inconsistent?

Linguist is having an hallucination that needs to be fixed by looking at the code chunks in a notebook. If there are no code chunks, then perhaps "prose" is an adequate answer. If a C file has no C code but only comments do you label it as "prose"?

An .Rmd file (or a Jupyter notebook), can have R or Python code (or other languages, too). Why not classify a file by the percentages of code chunks in the file? You could have an .Rmd file that is 75% R and 25% Python or vice versa. And perhaps a percentage for "prose" for the comment chunks? Most .Rmd files are not 100% "prose" whether Linquist says so or not.

There's the rub. You're expecting Linguist to do something it isn't designed to do.

Linguist uses a funnel-like approach of whittling down a list of languages to a single language using a series of strategies in decreasing order of specificity to determine the language of a file. It's important to note that Linguist identifies files in isolation; it does not take other files or directories into account.

Linguist doesn't do partial file classification either, and isn't likely to without a complete rewrite. It would be incredibly resource intensive for Linguist to, as it is currently written, analyse the content of every file uploaded to GitHub to perform a language detection approach which would be rarely needed when other simpler and less resource intensive approaches, like the ones Linguist uses, exist. Yes, they're not perfect, but they're good enough for the vast majority of cases. Overrides exist for those occasions where things go awry or differ from peoples desires or expectations.

I appreciate we live in an era of proliferating AI, but this isn't Linguist and won't be coming to Linguist. If a full file multi-language analysis method is implemented, with or without the help of AI, it will be in a new project outside of Linguist, and even then I'd expect things to sometimes go wrong.

So yes, Linguist is not perfect and makes opinionated decisions, however it also provides overrides to help with those imperfections and for those that don't like its decisions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does GitHub still wrongly mislabel R / RMarkdown projects? #6869

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Why does GitHub still wrongly mislabel R / RMarkdown projects? #6869

EarlGlynn Jun 7, 2024

Replies: 2 comments · 1 reply

lildude Jun 7, 2024 Maintainer

EarlGlynn Jun 7, 2024 Author

lildude Jun 7, 2024 Maintainer

EarlGlynn
Jun 7, 2024

Replies: 2 comments 1 reply

lildude
Jun 7, 2024
Maintainer

EarlGlynn
Jun 7, 2024
Author

lildude Jun 7, 2024
Maintainer