[A semi-feature request] Linguist for Jupyter notebook #5456

scaomath · 2021-07-11T19:52:03Z

scaomath
Jul 11, 2021

As is known, if we do not set up the Linguist in .gitattributes, then the generated codes from the IPython will be likely to take over the stats in a repo. Right now there are two "solutions" to handle this in .gitattributes.

Method 1

The first one is an unfavorable method by setting the type of lang for .ipynb to Python (or Julia, R, etc)

*.ipynb linguist-language=Python

Not only Linguist will be largely inflated by the javascript lines from the notebook, the commit will be messed up as well. The output cells will be counted as a line and comparing the difference will be a nightmare for memory, as sometimes a simple lite notebook with output cells displaying tensors may easily have over 100k lines.

Method 2

I believe right now the following is the method most people are using: adding

*.ipynb linguist-vendored

or

*.ipynb linguist-documentation

in .gitattributes. This excludes *.ipynb in stats once and for all.

~~However, there is a downside in that now any commits involving any changes in *.ipynb will just be counted as 1 line change~~ (UPDATE: this claim is wrong, it was associated with a specific file that somehow has messed-up lines...) . This is also a nightmare for difference comparing between commits.

A semi feature proposal

I am curious how difficult it is to implement a more adaptive detector for codes in Linguist to achieve the following:

Detect only the code cells as Python (or user set languages), line by line difference in VSCode, or others.
Detect output cells and markdown cells as linguist-documentation to exclude them from line by line difference.

lildude · 2021-07-12T08:34:31Z

lildude
Jul 12, 2021
Maintainer

I am curious how difficult it is to implement a more adaptive detector for codes in Linguist to achieve the following:

It's not trivial at all, primarily because of the performance impact this would have and how accurate it would be. It's come up a few times in the past in the discussions around detecting Python in Jupyter files in #3316, R in Rmarkdown files in #5208, and JavaScript in HTML files in #5248, and more generically in #5326

5 replies

scaomath Jul 15, 2021
Author

So the take is "don't do dev using notebook", otherwise you have to do eyeball diff 😂

lildude Jul 22, 2021
Maintainer

🤔 I'm not sure that's right, same with your original comment in the OP:

However, there is a downside in that now any commits involving any changes in *.ipynb will just be counted as 1 line change. This is also a nightmare for difference comparing between commits.

The diff with be collapsed but it should still be a normal multi-line diff when expanded. Do you have an example to demonstrate the behaviour you're seeing?

scaomath Jul 22, 2021
Author

Screencast.from.07-22-2021.mp4

Please refer to this video, apparently more than 1 addition and 1 deletion are committed, however the diff is not working properly, I do not know why...

lildude Jul 22, 2021
Maintainer

🤔 that single-line diff behaviour is unrelated to Linguist and would suggest your raw data may all be in a single line.

If you view your file and them click the "Raw" button, does it show all your code in a single line? I suspect it does.

scaomath Jul 22, 2021
Author

😂 you are absolutely correct. I didn't know how that happened to that specific file.

BOSOEK · 2021-07-23T14:32:22Z

BOSOEK
Jul 23, 2021

Hello. I am currently a student studying artificial intelligence!
I uploaded the .ipynb file to the repossitory repository, but github does not recognize it. Can you help?

2 replies

scaomath Jul 23, 2021
Author

😉 While it seems that this is the wrong place to ask this question. However, GitHub does recognize your ipynb files.

This is what I see on GitHub:

lildude Jul 24, 2021
Maintainer

@BOSOEK you really should have started a new discussion as your issue is completely unrelated to this discussion.

That said, your repo isn't analysed for the same reason as that discussed in #5445 . From my #5445 (comment)

The issue here is that repo is heeeeaaawg and has a massive object tree and a lot of files so it hitting the guard at:

https://github.com/github/linguist/blob/16c70aef8cd62ca071231a380c69050f5e83c900/lib/linguist/repository.rb#L132-L135

This has been implemented so that repos like this don't negatively impact the analysis of others or excessively hog resources.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[A semi-feature request] Linguist for Jupyter notebook #5456

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[A semi-feature request] Linguist for Jupyter notebook #5456

scaomath Jul 11, 2021

Method 1

Method 2

A semi feature proposal

Replies: 2 comments · 7 replies

lildude Jul 12, 2021 Maintainer

scaomath Jul 15, 2021 Author

lildude Jul 22, 2021 Maintainer

scaomath Jul 22, 2021 Author

lildude Jul 22, 2021 Maintainer

scaomath Jul 22, 2021 Author

BOSOEK Jul 23, 2021

scaomath Jul 23, 2021 Author

lildude Jul 24, 2021 Maintainer

scaomath
Jul 11, 2021

Replies: 2 comments 7 replies

lildude
Jul 12, 2021
Maintainer

scaomath Jul 15, 2021
Author

lildude Jul 22, 2021
Maintainer

scaomath Jul 22, 2021
Author

lildude Jul 22, 2021
Maintainer

scaomath Jul 22, 2021
Author

BOSOEK
Jul 23, 2021

scaomath Jul 23, 2021
Author

lildude Jul 24, 2021
Maintainer