slow matchfile parsing due to duplicate lines check #306

huispaty · 2023-08-08T10:40:05Z

Parsing match files takes longer than in previous versions due to the duplicate lines check in importmatch.py. The attached test file (converted to .txt) takes > 3hrs to load.
beethoven_op026_mv3.txt

The text was updated successfully, but these errors were encountered:

manoskary · 2023-08-10T14:58:52Z

I wrote the duplicate line check but that does not sound reasonable. Are you sure the issue is from the duplicate line checking?
I checked without the duplicate line validation but it still takes a long time (p.s. I didn't wait 3hrs though)

huispaty · 2023-08-10T15:48:52Z

Yes I'm sure the issue stems from that part. Without this check, the file takes ~27secs to load (as it is quite large). With the checks, it takes >3.5hrs. I noticed this mainly because I was working on different branches, one of which had not yet been merged into develop (thus main) and had a version without this duplicate line check functionality.
My proposed solution looks like this:

    with open(filename) as f:
        raw_lines = f.read().splitlines()

    version = get_version(raw_lines[0])

    from_matchline_methods = FROM_MATCHLINE_METHODSV1
    if version < Version(1, 0, 0):
        from_matchline_methods = FROM_MATCHLINE_METHODSV0

    raw_lines = list(set(raw_lines))
    parsed_lines = [
        parse_matchline(line, from_matchline_methods, version) for line in raw_lines
    ]
    
    parsed_lines = [pl for pl in parsed_lines if pl is not None]

    mf = MatchFile(lines=parsed_lines)

Using this approach the same file takes ~25secs to load. Currently this is on my local branch only - it's not yet pushed as I would like to first address some raised issues that also relate to match file importing.

This PR presents a solution for faster checking of empty and duplicate lines in matchfiles using numpy arrays and numpy vectorization.

huispaty added the bug Something isn't working label Aug 8, 2023

huispaty self-assigned this Aug 8, 2023

manoskary added a commit that referenced this issue Aug 10, 2023

Addresses #306

76381eb

This PR presents a solution for faster checking of empty and duplicate lines in matchfiles using numpy arrays and numpy vectorization.

manoskary mentioned this issue Aug 10, 2023

Faster Matchifile Parsing and Checks. #307

Merged

manoskary linked a pull request Aug 10, 2023 that will close this issue

Faster Matchifile Parsing and Checks. #307

Merged

CarlosCancino-Chacon closed this as completed Sep 21, 2023

huispaty mentioned this issue Sep 22, 2023

PR for Release 1.4.0 #320

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow matchfile parsing due to duplicate lines check #306

slow matchfile parsing due to duplicate lines check #306

huispaty commented Aug 8, 2023

manoskary commented Aug 10, 2023

huispaty commented Aug 10, 2023

slow matchfile parsing due to duplicate lines check #306

slow matchfile parsing due to duplicate lines check #306

Comments

huispaty commented Aug 8, 2023

manoskary commented Aug 10, 2023

huispaty commented Aug 10, 2023