-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AI page rank repos #257
AI page rank repos #257
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@Prudhvivuda Is this based off of one of the WASM notebooks? if so can you link it? |
|
@@ -0,0 +1,2609 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insert credit to the original author of this notebook and the context of this being the second iteration of this process, and summarize what was learned from the wasm iteration
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -0,0 +1,2609 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this csv produced by running the Rappel/notebooks/emerging_ai_projects/collabs.ipynb notebook? if so include in this PR
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -0,0 +1,2609 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add back accreditation. Would advise against taking out credit to original code writers.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never had this intention (out of ignorance). I am so sorry about this. I am adding it back.
@@ -0,0 +1,2609 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NaN is coming from the contributor_repo table for the cntrb_id '0106ecc9-c100-0000-0000-000000000000'.
Should we drop? If yes, we might lose the data related to this cntrb_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide a screen shot of the column or a query for I can look at this? Im confused how there is a repo_git that is NaN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we take the top 10% of the known contributors and join them with the event stream, we are getting None for some(one or two) contributors.
Query used to fetch event stream:
event_stream_query = salc.sql.text( f""" SET SCHEMA 'augur_data'; SELECT c.cntrb_id, c.event_id, c.created_at, c.cntrb_repo_id as repo_id, c.repo_git, c.repo_name, c.gh_repo_id, c.cntrb_category as event_type FROM contributor_repo c """)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JamesKunstle @hemajv do these results make sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think the reason we are seeing NaN values is because the event stream data only fetches the latest 2 months of data. Hence, the contributor may have had contributions which are older than 2 months and therefore getting captured as NaN
@JamesKunstle can confirm if this is accurate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the event stream just has NaN entries for individual contributors. I have no idea why this would be the case, but this isn't a critical consideration to me.
These results are cool because they find a couple of repos that I know would be interesting, like 'llvm' and 'triton-lang/triton' which are related to one another in the AI space, and also other like meta-llama/llama3.
Therefore I'm inclined to believe that this is an Augur peculiarity, perhaps rooted in what Hema suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think the reason we are seeing NaN values is because the event stream data only fetches the latest 2 months of data. Hence, the contributor may have had contributions which are older than 2 months and therefore getting captured as NaN
@JamesKunstle can confirm if this is accurate
@cdolfi event stream data is fetched from the augur db. But augur fetches data using GitHub APIs periodically. So, I assume it should have data older than 2 months? Please correct me if I am wrong!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgoggins ^ could you provide insight on this?
@@ -0,0 +1,2609 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add analysis interpretation for these repositories. What are your takeaways? What do you notice?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
…repos_page_rank pull from main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Prudhvivuda overall looks good to me, I think it would be useful to add a comment about the event stream data i.e. it only captures the recent 2 months of data and how that might induce NaN values in our data
@@ -0,0 +1,2666 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For jax and xla, the what's happened is the identification of a pillar of the AI space that wasn't initially identified because of ecosystem ignorance in our assumptions. This is something that we could easily find out in documentation, but this confirms (and measures) the extent to which a non-tf and non-pt player is in the game.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comment. Adding this to the conclusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Taking the top p% of contributors by their PageRank score and identify which repos they also belong to.