Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curated dataset of RSE repos #41

Closed
2 tasks done
karacolada opened this issue Aug 21, 2023 · 2 comments
Closed
2 tasks done

Curated dataset of RSE repos #41

karacolada opened this issue Aug 21, 2023 · 2 comments
Assignees
Milestone

Comments

@karacolada
Copy link
Member

karacolada commented Aug 21, 2023

We have true and false positives for repositories created for the publication and those used in it / referenced as related work.

  • check whether we can observe clustering around the publication and repo creation dates - I'd expect used software to be older. This would also give us an interesting insight into how long it takes for software to be widespread. There might be a difference between whether software was cited as "used tools" or "related work".
  • bring labelled dataset into a publishable form
@karacolada karacolada added this to the Outputs milestone Aug 21, 2023
@karacolada karacolada self-assigned this Nov 6, 2023
karacolada added a commit that referenced this issue Nov 15, 2023
@karacolada
Copy link
Member Author

karacolada commented Nov 15, 2023

Dataset prepared at data/outputs/eprints_w_intent.csv. Schema:

Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   github_repo_id              130 non-null    object        
 1   mention_created             130 non-null    bool          
 2   pub_title                   130 non-null    object        
 3   pub_author_for_reference    130 non-null    object        
 4   pdf_url                     130 non-null    object        
 5   page_no                     130 non-null    int64         
 6   detected_github_url         130 non-null    object        
 7   pattern_matched_github_url  130 non-null    object        
 8   eprints_date                130 non-null    datetime64[ns]
 9   eprints_pub_year            130 non-null    int64         
 10  eprints_repo                130 non-null    object        
dtypes: bool(1), datetime64[ns](1), int64(2), object(7)
memory usage: 15.4+ KB

karacolada added a commit that referenced this issue Nov 16, 2023
karacolada added a commit that referenced this issue Nov 16, 2023
@karacolada
Copy link
Member Author

Regarding first checkbox: yes!

mention_type_timeline

Grey filled areas are one, two, three years respectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant