Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NBs for cleaning and storing email data #11

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

cdolfi
Copy link
Collaborator

@cdolfi cdolfi commented Jan 19, 2021

these two notebooks take the data from hyper kitty, clean and save them as CSVs, and then combines them to a large data set

@cdolfi cdolfi requested a review from oindrillac January 19, 2021 17:03
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@tumido
Copy link
Member

tumido commented Jan 19, 2021

@MichaelClifford how does this compare to https://github.com/aicoe-aiops/mailing-list-analysis-toolkit ? Is this essentially the same thing?

@MichaelClifford
Copy link
Member

@tumido these projects are somewhat similar : ) Insofar as they both are performing some analysis on the Fedora mailing list. I think the main difference at the moment is this is a specific "discriminatory text detection" project @cdolfi is working on for the OSPO team. And https://github.com/aicoe-aiops/mailing-list-analysis-toolkit is less specific as far as analysis goals.

Does that answer your question?

@tumido
Copy link
Member

tumido commented Jan 20, 2021

I wonder if there would be any benefit to unify it. @cdolfi can benefit from the automation (automated analysis runs and new data collection) we already have set up for the other project and the other project can benefit from additional analysis it can be capable of... WDYT?

@MichaelClifford
Copy link
Member

@tumido I would be happy to have this work integrated into the toolkit analysis as it appears to represent another analysis type that would likely be useful for many emailing lists. That said, @cdolfi is the owner of this project so I will leave it up to her if/when she wants to contribute this work into the mailing list toolkit project. @cdolfi if you decide you'd like us to merge these two efforts, let us know and we'd be happy to connect and figure out the best way to perform the integration.

@tumido
Copy link
Member

tumido commented Jan 20, 2021

FTR: I don't want to block this PR by any means or question it by any way. I just got curious by the similarity of these projects. 🙂

@cdolfi
Copy link
Collaborator Author

cdolfi commented Jan 20, 2021

@tumido @MichaelClifford I talked with @oindrillac today about working toward merging these two efforts together. I think this is a great idea and would definitely want to connect to see what is the best way to do that is. I feel like the work that each of us are doing is complimentary of each other and leaves a lot of room to save doing the same work twice.

@MichaelClifford
Copy link
Member

Great! and sorry for hijacking this PR. I've opened an issue here to discuss this integration stuff further. Lets move this discussion there and use this thread to discuss the PR. 😄

@@ -0,0 +1,172 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What bucket are you using for this data? I can't find a cdolfi subdir on the DH-PLAYPEN instance.

Also the naming convention you are using is a bit confusing. I think s3_bucket_name should be something like s3_prefix since its the directory name you want to use inside of the bucket, right?


Reply via ReviewNB

@@ -0,0 +1,841 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rerun this cell so that it does not contain the error output? I tried to reproduce it but could not. I did find changing the year of interest from 2006 to 2010 made it run quickly without issue whereas 2006 seemed to hang indefinitely.


Reply via ReviewNB

@@ -0,0 +1,841 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is anything below this cell needed? It looks like its all duplicate of the cells above.

Also does this notebook include things from retrieve_mboxand parsing_mbox? If so, are the other notebooks necessary any longer?


Reply via ReviewNB

@@ -0,0 +1,841 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this threw an error for me since you already removed(name) above in an earlier cell. Do you want to delete this cell?


Reply via ReviewNB

@@ -2,7 +2,7 @@
"cells": [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this wants to open an mbox file stored locally. The first argument should be a path name, not the body of a file. Saving the object to "data" first might help. See this notebook for another way to read the mbox files https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/blob/master/notebooks/stage/raw_to_text.ipynb


Reply via ReviewNB

@tumido
Copy link
Member

tumido commented Mar 22, 2021

/hold please read https://chat.google.com/room/AAAAQcVnQvs/kDNfLz86AMs before unholding

@sesheta sesheta added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants