Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AI page rank repos #257

Merged
merged 6 commits into from
Oct 1, 2024
Merged

AI page rank repos #257

merged 6 commits into from
Oct 1, 2024

Conversation

Prudhvivuda
Copy link
Collaborator

Taking the top p% of contributors by their PageRank score and identify which repos they also belong to.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@Prudhvivuda Prudhvivuda marked this pull request as draft August 12, 2024 16:00
@Prudhvivuda Prudhvivuda marked this pull request as ready for review August 12, 2024 23:16
@cdolfi
Copy link
Contributor

cdolfi commented Aug 14, 2024

@Prudhvivuda Is this based off of one of the WASM notebooks? if so can you link it?

@Prudhvivuda
Copy link
Collaborator Author

@Prudhvivuda Is this based off of one of the WASM notebooks? if so can you link it?

sure. https://github.com/oss-aspen/Rappel/blob/main/notebooks/project_discovery/pagerank_top_repos/top_pagerank_common_repos.ipynb

@@ -0,0 +1,2609 @@
{
Copy link
Contributor

@cdolfi cdolfi Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert credit to the original author of this notebook and the context of this being the second iteration of this process, and summarize what was learned from the wasm iteration


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -0,0 +1,2609 @@
{
Copy link
Contributor

@cdolfi cdolfi Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this csv produced by running the Rappel/notebooks/emerging_ai_projects/collabs.ipynb notebook? if so include in this PR


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -0,0 +1,2609 @@
{
Copy link
Contributor

@cdolfi cdolfi Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add back accreditation. Would advise against taking out credit to original code writers.


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never had this intention (out of ignorance). I am so sorry about this. I am adding it back.

@@ -0,0 +1,2609 @@
{
Copy link
Contributor

@cdolfi cdolfi Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the NaN come from?


Reply via ReviewNB

Copy link
Collaborator Author

@Prudhvivuda Prudhvivuda Aug 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaN is coming from the contributor_repo table for the cntrb_id '0106ecc9-c100-0000-0000-000000000000'.
Should we drop? If yes, we might lose the data related to this cntrb_id

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide a screen shot of the column or a query for I can look at this? Im confused how there is a repo_git that is NaN

Copy link
Collaborator Author

@Prudhvivuda Prudhvivuda Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image These are the contributors to our seed repositories. One or two contributors from the known list are having this issue.

Copy link
Collaborator Author

@Prudhvivuda Prudhvivuda Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we take the top 10% of the known contributors and join them with the event stream, we are getting None for some(one or two) contributors.
image

Query used to fetch event stream:
event_stream_query = salc.sql.text( f""" SET SCHEMA 'augur_data'; SELECT c.cntrb_id, c.event_id, c.created_at, c.cntrb_repo_id as repo_id, c.repo_git, c.repo_name, c.gh_repo_id, c.cntrb_category as event_type FROM contributor_repo c """)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JamesKunstle @hemajv do these results make sense to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the reason we are seeing NaN values is because the event stream data only fetches the latest 2 months of data. Hence, the contributor may have had contributions which are older than 2 months and therefore getting captured as NaN

@JamesKunstle can confirm if this is accurate

Copy link
Collaborator

@JamesKunstle JamesKunstle Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the event stream just has NaN entries for individual contributors. I have no idea why this would be the case, but this isn't a critical consideration to me.

These results are cool because they find a couple of repos that I know would be interesting, like 'llvm' and 'triton-lang/triton' which are related to one another in the AI space, and also other like meta-llama/llama3.

Therefore I'm inclined to believe that this is an Augur peculiarity, perhaps rooted in what Hema suggested.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the reason we are seeing NaN values is because the event stream data only fetches the latest 2 months of data. Hence, the contributor may have had contributions which are older than 2 months and therefore getting captured as NaN

@JamesKunstle can confirm if this is accurate

@cdolfi event stream data is fetched from the augur db. But augur fetches data using GitHub APIs periodically. So, I assume it should have data older than 2 months? Please correct me if I am wrong!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgoggins ^ could you provide insight on this?

@@ -0,0 +1,2609 @@
{
Copy link
Contributor

@cdolfi cdolfi Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add analysis interpretation for these repositories. What are your takeaways? What do you notice?


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

@hemajv hemajv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Prudhvivuda overall looks good to me, I think it would be useful to add a comment about the event stream data i.e. it only captures the recent 2 months of data and how that might induce NaN values in our data

@@ -0,0 +1,2666 @@
{
Copy link
Collaborator

@JamesKunstle JamesKunstle Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For jax and xla, the what's happened is the identification of a pillar of the AI space that wasn't initially identified because of ecosystem ignorance in our assumptions. This is something that we could easily find out in documentation, but this confirms (and measures) the extent to which a non-tf and non-pt player is in the game.


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment. Adding this to the conclusion.

Copy link
Contributor

@cdolfi cdolfi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@cdolfi cdolfi merged commit 3d41fca into oss-aspen:main Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants