Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow matchfile parsing due to duplicate lines check #306

Closed
huispaty opened this issue Aug 8, 2023 · 2 comments · Fixed by #307
Closed

slow matchfile parsing due to duplicate lines check #306

huispaty opened this issue Aug 8, 2023 · 2 comments · Fixed by #307
Assignees
Labels
bug Something isn't working

Comments

@huispaty
Copy link
Collaborator

huispaty commented Aug 8, 2023

Parsing match files takes longer than in previous versions due to the duplicate lines check in importmatch.py. The attached test file (converted to .txt) takes > 3hrs to load.
beethoven_op026_mv3.txt

@huispaty huispaty added the bug Something isn't working label Aug 8, 2023
@huispaty huispaty self-assigned this Aug 8, 2023
@manoskary
Copy link
Member

I wrote the duplicate line check but that does not sound reasonable. Are you sure the issue is from the duplicate line checking?
I checked without the duplicate line validation but it still takes a long time (p.s. I didn't wait 3hrs though)

@huispaty
Copy link
Collaborator Author

Yes I'm sure the issue stems from that part. Without this check, the file takes ~27secs to load (as it is quite large). With the checks, it takes >3.5hrs. I noticed this mainly because I was working on different branches, one of which had not yet been merged into develop (thus main) and had a version without this duplicate line check functionality.
My proposed solution looks like this:

    with open(filename) as f:
        raw_lines = f.read().splitlines()

    version = get_version(raw_lines[0])

    from_matchline_methods = FROM_MATCHLINE_METHODSV1
    if version < Version(1, 0, 0):
        from_matchline_methods = FROM_MATCHLINE_METHODSV0

    raw_lines = list(set(raw_lines))
    parsed_lines = [
        parse_matchline(line, from_matchline_methods, version) for line in raw_lines
    ]
    
    parsed_lines = [pl for pl in parsed_lines if pl is not None]

    mf = MatchFile(lines=parsed_lines)

Using this approach the same file takes ~25secs to load. Currently this is on my local branch only - it's not yet pushed as I would like to first address some raised issues that also relate to match file importing.

manoskary added a commit that referenced this issue Aug 10, 2023
This PR presents a solution for faster checking of empty and duplicate lines in matchfiles using numpy arrays and numpy vectorization.
@manoskary manoskary linked a pull request Aug 10, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants