-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/multipart match #231
base: main
Are you sure you want to change the base?
Changes from all commits
ee46c52
076e1c4
e1eb8d4
e795893
6862559
a98f08f
f864423
45cca10
9768cb3
21ed331
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1855,12 +1855,31 @@ ruleset: | |
|
||
header: | ||
for _, t := range rs.HeadersRegexpCompiled { | ||
isSubjectMatch := t[0].MatchString("subject") | ||
for k, vl := range header { | ||
k = strings.ToLower(k) | ||
if t[0].MatchString("body") { // message body match | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You mentioned elsewhere that it may be good to separate the body-matching from header matching. And indeed that seems better, at the minimum to avoid confusion between potential headers called "body" and the actual body. I was thinking we could maybe use an empty header key to indicate matching the body, but HeadersRegexp is a map, and it will probably look weird in the config file, if it even works at all. Btw, for this code, shouldn't the "if" statement be before its for-loop ("range header")? It's not executed for each header key/value in the message. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK. Separating header match and body match is the correct way I think too.
Yes, after all, I noticed I wrote naive code... |
||
ws := PrepareWordSearch([]string{t[1].String()}, []string{}) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not so sure anymore that PrepareWordSearch is the best way to do the matching. It is used by IMAP search and webmail search, and it can require presence/absence of certain words, but that's not needed for these matches, and we want to match on regular expressions (at least for now, in the future, perhaps we could add more elaborate matching mechanisms, including "not"-matches). I think we can use https://pkg.go.dev/regexp#Regexp.MatchReader. The RuneReader interface is implemented by bufio.Reader: https://pkg.go.dev/bufio#Reader.ReadRune. So I think we can wrap the io.Reader returned by https://pkg.go.dev/github.com/mjl-/mox/message#Part.Reader in a bufio.Reader, and call MatchReader (or a similar method) on it. We would also do that for each Part.Parts (multipart messages) recursively (see https://pkg.go.dev/github.com/mjl-/mox/message#Part), until we have a match. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I referred the codes used in webmail search. I will check MatchReader. |
||
// todo: regexp match | ||
ok, err := ws.MatchPart(log, &p, true) | ||
if err != nil { | ||
log.Errorx("Failed to match body: %v", err) | ||
} | ||
if ok { | ||
continue header | ||
} | ||
} | ||
if !t[0].MatchString(k) { | ||
continue | ||
} | ||
for _, v := range vl { | ||
if isSubjectMatch { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Decoding RFC2047-encoded words is a good idea. Decoding should probably be done with mime.WordDecoder, as is done at https://www.xmox.nl/xr/v0.0.12/message/part.go.html#L480. The code at https://www.xmox.nl/xr/v0.0.12/message/part.go.html#L448 also handles the various character encodings (though perhaps more need to explicitly added: I think "ianaindex" misses a few characters sets, not sure about the japanese ones). I think rfc2047-decoding headers could be a separate PR, it isn't tied to matching words in the body. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you. I'll check them. |
||
// todo: memorize decoded text | ||
v, err = decodeRFC2047(v) | ||
if err != nil { | ||
log.Errorx("Failed to decode subject: %v", err, slog.String("v", v)) | ||
} | ||
} | ||
v = strings.ToLower(strings.TrimSpace(v)) | ||
if t[1].MatchString(v) { | ||
continue header | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,13 +2,22 @@ package store | |
|
||
import ( | ||
"bytes" | ||
"encoding/base64" | ||
"fmt" | ||
"io" | ||
"mime/quotedprintable" | ||
"regexp" | ||
"strings" | ||
"unicode" | ||
"unicode/utf8" | ||
|
||
"github.com/mjl-/mox/message" | ||
"github.com/mjl-/mox/mlog" | ||
|
||
"golang.org/x/text/encoding" | ||
"golang.org/x/text/encoding/japanese" | ||
encUnicode "golang.org/x/text/encoding/unicode" | ||
"golang.org/x/text/transform" | ||
) | ||
|
||
// WordSearch holds context for a search, with scratch buffers to prevent | ||
|
@@ -82,11 +91,26 @@ func (ws WordSearch) matchPart(log mlog.Log, p *message.Part, headerToo bool, se | |
} | ||
|
||
if len(p.Parts) == 0 { | ||
var tp io.Reader | ||
if p.MediaType != "TEXT" { | ||
// todo: for other types we could try to find a library for parsing and search in there too. | ||
return false, nil | ||
if p.MediaType == "MULTIPART" { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks suspicious: The "if" above, for "len(p.Parts) == 0" should cause this if-branch to only be taken if this is not a multipart (i.e. it is a leaf part). The multipart-matching should be handled by "for _, pp := range p.Parts {" below (called recursively). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also thought same when I see these codes. When I use my multipart mail sample, len(p.Parts) == 0 becomes true but can be something misunderstand. I'll check. |
||
// Decode and make io.Reader | ||
// todo: avoid to load all content | ||
content, err := io.ReadAll(p.RawReader()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would have to use p.Reader() (https://pkg.go.dev/github.com/mjl-/mox/message#Part.Reader), which should already decode the character set. If decoding doesn't yet work for the japanese encoding, it may require changing the "wordDecoder" as mentioned earlier. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK. Thanks. |
||
if err != nil { | ||
return false, err | ||
} | ||
tp, err = decodeMultiPart(string(content), p.GetBound()) | ||
if err != nil { | ||
return false, err | ||
} | ||
} else { | ||
// todo: for other types we could try to find a library for parsing and search in there too. | ||
return false, nil | ||
} | ||
} else { | ||
tp = p.ReaderUTF8OrBinary() | ||
} | ||
tp := p.ReaderUTF8OrBinary() | ||
// todo: for html and perhaps other types, we could try to parse as text and filter on the text. | ||
miss, err := ws.searchReader(log, tp, seen) | ||
if miss || err != nil || ws.isQuickHit(seen) { | ||
|
@@ -193,3 +217,148 @@ func toLower(buf []byte) []byte { | |
} | ||
return r | ||
} | ||
|
||
func decodeRFC2047(encoded string) (string, error) { | ||
// match e.g. =?(iso-2022-jp)?(B)?(Rnc6...)?= | ||
r := regexp.MustCompile(`(?i)=\?([^?]+)\?([BQ])\?([^?]+)\?=`) | ||
matches := r.FindAllStringSubmatch(encoded, -1) | ||
|
||
if len(matches) == 0 { // no match. Looks ASCII. | ||
return encoded, nil | ||
} | ||
|
||
var decodedStrings []string | ||
for _, match := range matches { | ||
charset := match[1] | ||
encodingName := match[2] | ||
encodedText := match[3] | ||
|
||
reader, err := decodeTransferEncodeAndCharset(encodingName, charset, encodedText) | ||
if err != nil { | ||
return encoded, err | ||
} | ||
|
||
decodedText, err := io.ReadAll(reader) | ||
if err != nil { | ||
return encoded, err | ||
} | ||
|
||
decodedStrings = append(decodedStrings, string(decodedText)) | ||
} | ||
|
||
// Concat multiple strings | ||
return strings.Join(decodedStrings, ""), nil | ||
} | ||
|
||
func decodeTransferEncodeAndCharset(encodingName string, charset string, encodedText string) (io.Reader, error) { | ||
decodedString, err := decodeTransferEncode(encodingName, encodedText) | ||
if len(decodedString) == 0 && err != nil { | ||
return nil, err | ||
} | ||
|
||
// try to decode even if unknown encoding | ||
reader, err := decodeCharset(charset, decodedString) | ||
if err != nil { | ||
return nil, err | ||
} | ||
return reader, nil | ||
} | ||
|
||
// Decode Base64 or Quoted Printable | ||
func decodeTransferEncode(encodingName string, encodedText string) (string, error) { | ||
// Decode Base64 or Quoted-Printable | ||
var decodedBytes []byte | ||
var err error | ||
switch strings.ToUpper(encodingName) { | ||
case "B": // Base64 | ||
decodedBytes, err = base64.StdEncoding.DecodeString(encodedText) | ||
if err != nil { | ||
return string(decodedBytes), fmt.Errorf("Base64 decode error: %w", err) | ||
} | ||
case "Q": // Quoted-Printable | ||
decodedBytes, err = io.ReadAll(quotedprintable.NewReader(strings.NewReader(encodedText))) | ||
if err != nil { | ||
return string(decodedBytes), fmt.Errorf("Quoted-Printable decode error: %w", err) | ||
} | ||
default: | ||
return encodedText, fmt.Errorf("not supported encoding: %s", encodingName) | ||
} | ||
return string(decodedBytes), nil | ||
} | ||
|
||
func decodeCharset(charset string, decodedString string) (io.Reader, error) { | ||
// Select charset | ||
var enc encoding.Encoding | ||
switch strings.ToLower(charset) { | ||
case "iso-2022-jp": | ||
enc = japanese.ISO2022JP | ||
case "utf-8": | ||
enc = encUnicode.UTF8 | ||
case "us-ascii": | ||
return strings.NewReader(decodedString), nil | ||
default: | ||
return nil, fmt.Errorf("not supported charset: %s", charset) | ||
} | ||
|
||
// Decode with charset | ||
reader := transform.NewReader(strings.NewReader(decodedString), enc.NewDecoder()) | ||
return reader, nil | ||
} | ||
|
||
func decodeMultiPart(body string, boundary string) (io.Reader, error) { | ||
encPattern := `Content-Transfer-Encoding:\s+(\w+)` | ||
charsetPattern := `charset="((?:\w|-)+)"` | ||
|
||
// Regexp for MIME encode type & Charset match | ||
encRe, err := regexp.Compile(encPattern) | ||
if err != nil { | ||
return nil, fmt.Errorf("error compiling regex:%v", err) | ||
} | ||
charsetRe, err := regexp.Compile(charsetPattern) | ||
if err != nil { | ||
return nil, fmt.Errorf("error compiling regex:%v", err) | ||
} | ||
|
||
// Split by boundary | ||
parts := strings.Split(body, boundary) | ||
var readers []io.Reader | ||
|
||
// Make decoded io.Readers for each part | ||
for _, part := range parts { | ||
part = strings.TrimSpace(part) | ||
if len(part) == 0 { | ||
continue | ||
} | ||
|
||
// Extract MIME header and body | ||
headerBody := strings.SplitN(part, "\r\n\r\n", 2) | ||
if len(headerBody) < 2 { | ||
// retry | ||
headerBody = strings.SplitN(part, "\n\n", 2) | ||
if len(headerBody) < 2 { | ||
continue | ||
} | ||
} | ||
|
||
mimeHeader := headerBody[0] | ||
encodedBody := headerBody[1] | ||
|
||
// Find encode types | ||
encMatches := encRe.FindStringSubmatch(mimeHeader) | ||
charsetMatches := charsetRe.FindStringSubmatch(mimeHeader) | ||
|
||
// Decode | ||
if len(encMatches) > 1 && len(charsetMatches) > 1 { | ||
reader, err := decodeTransferEncodeAndCharset(encMatches[1][0:1], charsetMatches[1], encodedBody) | ||
if err != nil { | ||
return nil, err | ||
} | ||
readers = append(readers, reader) | ||
|
||
} else { | ||
return nil, fmt.Errorf("failed to match encoding and charset in:\n%s", mimeHeader) | ||
} | ||
} | ||
|
||
return io.MultiReader(readers...), nil | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
package store | ||
|
||
import ( | ||
"fmt" | ||
"io" | ||
"log/slog" | ||
"os" | ||
"strings" | ||
"testing" | ||
|
||
"github.com/mjl-/mox/message" | ||
"github.com/mjl-/mox/mlog" | ||
) | ||
|
||
func TestSubjectMatch(t *testing.T) { | ||
// Auto detect subject text encoding and decode | ||
|
||
//log := mlog.New("search", nil) | ||
|
||
originalSubject := `テストテキスト Abc 123...` | ||
asciiSubject := "test text Abc 123..." | ||
|
||
encodedSubjectUTF8 := `=?UTF-8?b?44OG44K544OI44OG44Kt44K544OIIEFiYyAxMjMuLi4=?=` | ||
encodedSubjectISO2022 := `=?iso-2022-jp?B?GyRCJUYlOSVIJUYlLSU5JUgbKEIgQWJjIDEyMy4uLg==?=` | ||
encodedSubjectUTF8 = encodedSubjectUTF8 + " \n " + encodedSubjectUTF8 | ||
encodedSubjectISO2022 = encodedSubjectISO2022 + " \n " + encodedSubjectISO2022 | ||
originalSubject = originalSubject + originalSubject | ||
|
||
encodedTexts := map[string]string{encodedSubjectUTF8: originalSubject, encodedSubjectISO2022: originalSubject, asciiSubject: asciiSubject} | ||
|
||
for encodedSubject, originalSubject := range encodedTexts { | ||
|
||
// Autodetect & decode | ||
decodedSubject, err := decodeRFC2047(encodedSubject) | ||
|
||
fmt.Printf("decoded text:%s\n", decodedSubject) | ||
if err != nil { | ||
t.Fatalf("Decode error: %v", err) | ||
} | ||
|
||
if originalSubject != decodedSubject { | ||
t.Fatalf("Decode mismatch %s != %s", originalSubject, decodedSubject) | ||
} | ||
} | ||
} | ||
|
||
func TestMultipartMailDecode(t *testing.T) { | ||
log := mlog.New("search", nil) | ||
|
||
// Load raw mail file | ||
filePath := "../../data/mail_raw.txt" // multipart mail raw data | ||
wordFilePath := "../../data/word.txt" | ||
|
||
msgFile, err := os.Open(filePath) | ||
if err != nil { | ||
t.Fatalf("Failed to open file: %v", err) | ||
} | ||
defer msgFile.Close() | ||
|
||
// load word | ||
wordFile, err := os.Open(wordFilePath) | ||
if err != nil { | ||
t.Fatalf("Failed to open file: %v", err) | ||
} | ||
defer wordFile.Close() | ||
tmp, err := io.ReadAll(wordFile) | ||
if err != nil { | ||
t.Fatalf("Failed to load search word: %v", err) | ||
} | ||
searchWord := strings.TrimSpace(string(tmp)) | ||
|
||
// Parse mail | ||
mr := FileMsgReader([]byte{}, msgFile) | ||
p, err := message.Parse(log.Logger, false, mr) | ||
if err != nil { | ||
t.Fatalf("parsing message for evaluating rulesets, continuing with headers %v, %s", err, slog.String("parse", "")) | ||
} | ||
|
||
// Match | ||
ws := PrepareWordSearch([]string{searchWord}, []string{}) | ||
ok, _ := ws.MatchPart(log, &p, true) | ||
if !ok { | ||
t.Fatalf("Match failed %s", ws.words) | ||
} | ||
log.Debug("Check match", slog.String("word", string(searchWord)), slog.Bool("ok", ok)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this is used? Was it for testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be mistake. I will check.