Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean bogus in-reply-to headers before store it #65

Open
gpoo opened this issue Mar 10, 2016 · 3 comments
Open

Clean bogus in-reply-to headers before store it #65

gpoo opened this issue Mar 10, 2016 · 3 comments

Comments

@gpoo
Copy link
Member

gpoo commented Mar 10, 2016

There are some email clients that add noise to the header field in-reply-to. For example,
http://lists.openstack.org/pipermail/openstack-dev/2012-August.txt.gz :

From julien at danjou.info  Wed Aug  8 19:52:08 2012
From: julien at danjou.info (Julien Danjou)
Date: Wed, 08 Aug 2012 21:52:08 +0200
Subject: [openstack-dev] [ceilometer] weekly meeting - CloudWatch
        functionality
In-Reply-To: <[email protected]> (Nick Barcet's message of "Wed, 
 08 Aug 2012 17:45:54 +0100")
References: <[email protected]>
 <[email protected]>
Message-ID: <[email protected]>

On Wed, Aug 08 2012, Nick Barcet wrote:
[...]

The actual Message-id is <[email protected]>, without the garbage at the end (which is how it is in References).

@gpoo
Copy link
Member Author

gpoo commented Mar 10, 2016

BTW, the consequence of having a bogus In-Reply-To data is that you cannot reconstruct a discussion thread, unless you want to deal with this bogus format.

Something that MLStats does not handle is multiple Message-ID in that header, which is a possibility according to the RFC 2822.

@gpoo
Copy link
Member Author

gpoo commented Mar 10, 2016

Another example, the Message-ID in the 'In-Reply-To` field appears in a different order:

From yamamoto at valinux.co.jp  Sun Jun  1 03:18:49 2014
From: yamamoto at valinux.co.jp (YAMAMOTO Takashi)
Date: Sun,  1 Jun 2014 12:18:49 +0900 (JST)
Subject: [openstack-dev] [Neutron][L3] BGP Dynamic Routing Proposal
In-Reply-To: Your message of "Fri, 30 May 2014 16:50:09 -0700"
 <CABJepwg3DJT8ST0tS0Mi5P98ovgbjVB_n3DSvGniMYHEpqUkCw@mail.gmail.com>
References: <CABJepwg3DJT8ST0tS0Mi5P98ovgbjVB_n3DSvGniMYHEpqUkCw@mail.gmail.com>
Message-ID: <[email protected]>

@gpoo
Copy link
Member Author

gpoo commented Mar 11, 2016

Something like this should work:

diff --git a/pymlstats/analyzer.py b/pymlstats/analyzer.py
index 8c0fa63..7f98bdd 100644
--- a/pymlstats/analyzer.py
+++ b/pymlstats/analyzer.py
@@ -87,8 +87,7 @@ class MailArchiveAnalyzer:
                         'in-reply-to',
                         'subject',
                         'body']
-    common_headers = ['message-id', 'in-reply-to', 'list-id',
-                      'content-type', 'references']
+    common_headers = ['message-id', 'list-id', 'content-type', 'references']

     def __init__(self, archive=None):
         self.archive = archive
@@ -155,6 +154,9 @@ class MailArchiveAnalyzer:
             filtered_message['date'] = msgdate
             filtered_message['date_tz'] = str(tz_secs)

+            in_reply_to = message.get('in-reply-to')
+            filtered_message['in-reply-to'] = re.sub(r'^.*[^<]*(<.*>).*', r'\1', in_reply_to)
+
             # Retrieve other headers requested
             for header in self.common_headers:
                 msg = message.get(header)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@gpoo and others