-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NBs for cleaning and storing email data #11
base: master
Are you sure you want to change the base?
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@MichaelClifford how does this compare to https://github.com/aicoe-aiops/mailing-list-analysis-toolkit ? Is this essentially the same thing? |
@tumido these projects are somewhat similar : ) Insofar as they both are performing some analysis on the Fedora mailing list. I think the main difference at the moment is this is a specific "discriminatory text detection" project @cdolfi is working on for the OSPO team. And https://github.com/aicoe-aiops/mailing-list-analysis-toolkit is less specific as far as analysis goals. Does that answer your question? |
I wonder if there would be any benefit to unify it. @cdolfi can benefit from the automation (automated analysis runs and new data collection) we already have set up for the other project and the other project can benefit from additional analysis it can be capable of... WDYT? |
@tumido I would be happy to have this work integrated into the toolkit analysis as it appears to represent another analysis type that would likely be useful for many emailing lists. That said, @cdolfi is the owner of this project so I will leave it up to her if/when she wants to contribute this work into the mailing list toolkit project. @cdolfi if you decide you'd like us to merge these two efforts, let us know and we'd be happy to connect and figure out the best way to perform the integration. |
FTR: I don't want to block this PR by any means or question it by any way. I just got curious by the similarity of these projects. 🙂 |
@tumido @MichaelClifford I talked with @oindrillac today about working toward merging these two efforts together. I think this is a great idea and would definitely want to connect to see what is the best way to do that is. I feel like the work that each of us are doing is complimentary of each other and leaves a lot of room to save doing the same work twice. |
Great! and sorry for hijacking this PR. I've opened an issue here to discuss this integration stuff further. Lets move this discussion there and use this thread to discuss the PR. 😄 |
@@ -0,0 +1,172 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What bucket are you using for this data? I can't find a cdolfi
subdir on the DH-PLAYPEN instance.
Also the naming convention you are using is a bit confusing. I think s3_bucket_name
should be something like s3_prefix
since its the directory name you want to use inside of the bucket, right?
Reply via ReviewNB
@@ -0,0 +1,841 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rerun this cell so that it does not contain the error output? I tried to reproduce it but could not. I did find changing the year of interest from 2006 to 2010 made it run quickly without issue whereas 2006 seemed to hang indefinitely.
Reply via ReviewNB
@@ -0,0 +1,841 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is anything below this cell needed? It looks like its all duplicate of the cells above.
Also does this notebook include things from retrieve_mbox
and parsing_mbox
? If so, are the other notebooks necessary any longer?
Reply via ReviewNB
@@ -0,0 +1,841 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this threw an error for me since you already removed(name) above in an earlier cell. Do you want to delete this cell?
Reply via ReviewNB
@@ -2,7 +2,7 @@ | |||
"cells": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this wants to open an mbox file stored locally. The first argument should be a path name, not the body of a file. Saving the object to "data" first might help. See this notebook for another way to read the mbox files https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/blob/master/notebooks/stage/raw_to_text.ipynb
Reply via ReviewNB
/hold please read https://chat.google.com/room/AAAAQcVnQvs/kDNfLz86AMs before unholding |
these two notebooks take the data from hyper kitty, clean and save them as CSVs, and then combines them to a large data set