Add tests suite #43

Serene-Arc · 2022-12-11T06:58:42Z

I thought it would be good to have this draft PR open so that we can discuss bugs that the tests reveal as I find them.

First one is a bug I've been trying to nail dain for a while: @Neurrone what is the reason that the path of the items were used to sort instead of the name? There's a comment that specifies that it needs to be that way but not why. I've made a bunch of test cases to check sorting under different naming schemes and the paths mess up, while the names do not (so far).

If you remember the use cases that necessitate using the path, could you please tell them so I can add them to the tests? It's possible that a more complex sorting algorithm will be required but it needs tests before I can start working on a test.

Serene-Arc · 2022-12-12T04:53:41Z

Also, question: this plugin seems to have a different mechanism of matching tracks than the normal beets algorithm, like the methods for importing them are different. Is this intentional? It's very likely that I'm not understanding the whole process but there are issues that this impacts.

Consider the ordering issue; if beets can normally order the tracks according to the returned album when searching (either on MB or Audible), then that might be the solution to the ordering issue for the tracks, but that doesn't happen, and instead there is the natsort that fails in a lot of cases.

Neurrone · 2022-12-13T05:42:44Z

Apologies for the delayed response.

I am trying to remember the reason for this.

I believe it was because I found it unreliable to rely on the track titles and it was usually far more accurate to rely on the file names to natsort them to get the correct order of the tracks. Otherwise, I would have stuck to using the track titles, per default Beets behaviour.

If the ordering of the files on disks that Beets detects doesn't match the data source from Audible, then Beets would try to assign the wrong files to the wrong chapter titles from Audible.

Serene-Arc · 2022-12-13T05:59:38Z

So I did some more testing and reimporting a bunch of books into my library to find those edge cases. Natural sorting works well for a lot of cases but not when bytes are used and not when there are different naming schemes (e.g. Chapter 1-10 and then a chapter named Prologue; the prologue will be placed at the end because P comes after C).

I'm not sure if there's any good way to sort this with a one-size fits all approach given the sheer breadth of possible track namings. I would like to try playing around with the track matching and distance algorithms, with your permission, to try and get that working correctly instead of this system with natsorting. There are a couple reasons for this:

I have had issues where the track sorting still fails and it renames chapters inappropriately, so this will have to be addressed anyway at some point
The Audible response is the master record of the chapters, and known to be correct. If we can match to that then we should; if we can't then natural sorting or a more complex algorithm can be a fallback.
This is somewhat at odds with the rest of beets and their methodology so intrinsically this method will not benefit from any further improvements that beets makes to their core functions.

It's up to you if you'd like to take this approach but right now, the tests I've got don't seem to have an immediate answer to get them working. They fail with the old method and they fail with a natural sort on the titles too.

Neurrone · 2022-12-13T06:34:32Z

I'd be happy to use a better solution.

I used the Natsort method on paths because it seemed to be the easiest way for it to work for most cases. Part of the trade-off was accepting it would never work all the time.

Serene-Arc · 2022-12-17T11:41:42Z

Ignore all the commits here, I need to upload them to github to test them on my system, I'm working out the kinks with a matching method, with some promising results! See how it goes.

Serene-Arc · 2023-06-01T04:41:58Z

It is finally complete! I have created a series of algorithms that allow for most books to be matched and which also allows for extension in the future. It is completely configurable and comes with tests that cover the entire chapter matching process along with a couple of other tests that cover the things that are likely to fail.

What do you think @Neurrone?

Neurrone · 2023-06-05T06:38:12Z

Thanks for working on this. Given this is a large change, I'll probably only be able to take a look at this in a week or two.

Serene-Arc · 2023-06-05T08:33:27Z

Sure no rush.

Neurrone

Thanks for the hard work.

Did an initial pass to understand what was done. Here are some initial questions.

Neurrone · 2023-07-01T05:50:36Z

README.md

-   scrub:
-     auto: yes # optional, enabling this is personal preference
-   ```
+```yaml


Could you re-indent this so that the code is correctly rendered as a code block as part of the list item? Also helps with reducing the size of the diff so that only the significant changes made show up.

beetsplug/audible.py

Neurrone · 2023-07-01T06:41:30Z

beetsplug/audible.py

@@ -104,11 +248,117 @@ def __init__(self):
        self.add_media_field("subtitle", subtitle)

    def album_distance(self, items, album_info, mapping):
-        dist = get_distance(data_source=self.data_source, info=album_info, config=self.config)
+        dist = beets.autotag.hooks.Distance()


What is the difference between this function and the previous one that was used?

It seems to just ignore the source weight set in config.

Neurrone · 2023-07-01T06:46:40Z

beetsplug/audible.py

        return dist

    def track_distance(self, item, track_info):
-        return get_distance(data_source=self.data_source, info=track_info, config=self.config)
+        dist = beets.autotag.hooks.Distance()
+        dist.add_string("track_title", item.title, track_info.title)


Curious why this is needed? I'd assume that the default distance algorithm already factors the distance between titles.

This is the better way to do it according to the beets source code when constructing a custom distance. I didn't actually do that in the end because it would have taken a while but when I said we could use a customised Levenshtein algorithm, this is where it would be called.

Neurrone · 2023-07-01T06:49:35Z

beetsplug/audible.py

+
+
+def is_continuous_number_series(numbers: Iterable[Optional[int]]):
+    return all([n is not None for n in numbers]) and all(b - a == 1 for a, b in zip(numbers, numbers[1:]))


Why does this need to accept a possibly blank list of None?

This typing means that there is an iterable which may have any combination of None or integers. This is because when the algorithm this code is called in is attempted, any files where there is no source numbering will return None. If this is the case, then it is rejected as there is no way to know where the unsourced file will fit into the numbering.

beetsplug/audible.py

Neurrone · 2023-07-01T06:57:20Z

beetsplug/audible.py

+        dist.add_string("track_title", item.title, track_info.title)
+        return dist
+
+    def attempt_match_trust_source_numbering(self, items: List[Item], album: AlbumInfo) -> Optional[List[Item]]:


Could you remove the unused album parameter?

Likewise for the other attempt match functions below.

No. All of the functions need to have the exact same signature to work in the given code. It doesn't call any function explicitly but reads the configuration and then matches it to the appropriate function. Should also make it so it's easy to add new methods as well.

Neurrone · 2023-07-01T07:01:28Z

beetsplug/audible.py

+        # if len(items) > len(album.tracks):
+        #     # TODO: find a better way to handle this
+        #     # right now just reject this match
+        #     return None


Since this is commented out, what happens in this situation? None should still be returned but the item would fail to match.

I believe this case is currently being handled correctly in the existing matching algorithm.

beetsplug/audible.py

Neurrone · 2023-07-08T09:27:12Z

beetsplug/audible.py

+        """If the input album has a single item, use that; if the album also has a single item, prefer that."""
+        if len(items) == 1:
+            # Prefer a single named book from the remote source
+            if len(album.tracks) == 1 and album.tracks[0].title != "Chapter 1":


Is the extra condition on chapter 1 needed? It seems safe enough to trust it if the audible source has a single track.

Would prefer removing this check if itss not needed to simplify the logic.

Neurrone · 2023-07-08T09:42:21Z

beetsplug/audible.py

+                matches = sorted_tracks
+            return matches
+
+    def attempt_match_starting_numbers(self, items: List[Item], album: AlbumInfo) -> Optional[List[Item]]:


I'm having some trouble understanding what this is trying to do, could you help walk me through this?

In what situations does this algorithm help? I.e, are there cases where this algorithm is better than the next one down the list (natural sorting) which is much simpler?

Neurrone · 2023-07-08T09:46:28Z

beetsplug/audible.py

+        # magic number here, it's a judgement call
+        if max(average_title_change) < 4:
+            # can't assume that the tracks actually match even when there are the same number of items, since lengths
+            # can be different e.g. an even split into n parts that aren't necessarily chapter-based so just natsort


In what situations is the extra logic to find and strip out common prefixes or suffixes needed, vs the previous method that just did a natural sort on the titles?

Neurrone · 2023-07-08T10:01:20Z

I've finished looking at the implementation but haven't gone through the tests in detail yet.

Really appreciate the concept of a list of strategies that you introduced, since it does simplify the overall logic and allows customization.

My main concern is the complexity of some of the algorithms relative to how the matching was done previously. The previous implementation used the file input if the number of tracks didn't match the album info from Audible. I suspect that trying to perform matches in these cases (e.g, chapter_levenshtein) is hard to debug and may not work properly.

What do you think of restoring that logic and having a simpler default list? So something like:

source_numbering
Instead of having a special case for matching on a single file, replace it with a strategy that replaces the info from Audible with the input list if the number of chapters don't match
natural_sort

Serene-Arc · 2023-07-09T00:26:48Z

Complexity is required for complex cases. The simple fact is that the old method for doing this didn't cope well with the varied track titles that I found and used personally. Simple methods don't work when the data input isn't simple.

The previous implementation used the file input if the number of tracks didn't match the album info from Audible.

To address your point, using the file input doesn't always work. It doesn't work when you don't have source numbering, which is a lot of audiobooks in my experience. Even if the number of tracks do match, that's not actually a guarantee that the actual tracks match. Consider the difference between a book with 20 chapters on audible, and a the actual audiobook which has been split into 20 tracks of the same size. There are 20 tracks in both cases, but using the audible method is objectively wrong and won't match the actual book.

The goal of the many different measures is that each person can change the ordering based on what audiobooks they're importing at the time. If you're importing a book that you know Levenshtein will work well at, you can make it the primary method that the plugin uses.

Instead of having a special case for matching on a single file, replace it with a strategy that replaces the info from Audible with the input list if the number of chapters don't match

This is the Levenshtein function, but it is very bad at doing this. Consider that you don't know which chapters match up, because there might not be any ordering. Any numbers in the title are ignored (they're considered in the natural sort) but you don't know which matches to what track from Audible. You don't know if any of them match at all. What the program does is calculate all of the distances from every track in the files to every track from Audible. Then, it matches the closest match with the file and overwrites the title. This isn't actually guaranteed to be a close match. If there's a single letter in common, then the closest match will be that, but it's not guaranteed to be accurate or even relevant.

I suspect that trying to perform matches in these cases (e.g, chapter_levenshtein) is hard to debug and may not work properly.

Your concern over the Levenshtein is a little strange to me. This is the method that was used before. The Levenshtein function is what was used before when you called the track distance function. I just made it explicit so that we could improve on it. It's the last in the list because it's the least effective and most prone to errors. That is the old method.

Neurrone · 2023-07-09T14:15:18Z

the old method for doing this didn't cope well with the varied track titles that I found and used personally. Simple methods don't work when the data input isn't simple.

I understand, I'm just trying to get a feel for what examples in the wild would work better with these algorithms. I'll admit the inputs that I have are generally quite simple - either they match on a chapter to chapter basis, or they don't.

Even if the number of tracks do match, that's not actually a guarantee that the actual tracks match. Consider the difference between a book with 20 chapters on audible, and a the actual audiobook which has been split into 20 tracks of the same size. There are 20 tracks in both cases, but using the audible method is objectively wrong and won't match the actual book.

This should already be handled currently. If the number of tracks matches, the previous method wouldn't override the match from Audible and it falls back to the default matching, so you'd see that the matches fail if the lengths are off for example.

This is the Levenshtein function, but it is very bad at doing this. Consider that you don't know which chapters match up, because there might not be any ordering. Any numbers in the title are ignored

What I think is missing here is the old behaviour where if the folder has 2 files while Audible returns 20 chapters, I want to have the ability to completely ignore the contents from Audible, since I know its not split by chapter. It is meant to prevent the poor results from falling back to the default Levenshtein function.

Your concern over the Levenshtein is a little strange to me. This is the method that was used before. The Levenshtein function is what was used before when you called the track distance function.

I didn't realize that, I'll take a look at the Beets source to understand this better.

Serene-Arc · 2023-07-10T02:18:58Z

The tests that I've included are real world examples. I got a bunch of different audiobooks (hundreds) and started importing them. When I ran into issues, I added a test for the condition and worked to make sure that they passed correctly. That's why the tests try the correct ordering, then reversing, then randomisation, to make sure that they come up with the correct answer every time.

This should already be handled currently. If the number of tracks matches, the previous method wouldn't override the match from Audible and it falls back to the default matching, so you'd see that the matches fail if the lengths are off for example.

I don't think it did this? The old method was to match if the number of tracks were the same, regardless of length. That wasn't part of the calculation.

What I think is missing here is the old behaviour where if the folder has 2 files while Audible returns 20 chapters, I want to have the ability to completely ignore the contents from Audible, since I know its not split by chapter. It is meant to prevent the poor results from falling back to the default Levenshtein function.

You can! This would be the natural sort, source numbering, or chapter numbering algorithm. The order in which the algorithms are listed in the configuration is the order in which they're tried. If you have two files that are numbered Part 1 Part 2 then the natsort function will work, as will the 'starting number'. Actually, my code mostly takes only the metadata for the album as a whole i.e. the data taken from Audible is rarely, if ever, the track data, because that's really hard to match. Instead it's all the album level data, like author, ID, narrator, etc.

The levenshtein algorithm is only called, with the default configuration I've provided if the following conditions are met:

All of the chapter titles are very different (have a Levenshtein difference more than 4)
There is no contiguous source matching
There are no common affixes to the titles that, when removed, result in a number e.g. 'Chapter 1', 'Chapter 2', etc

The Levenshtein option, which is basically the track distance, is the last option that is called because it is the most prone to errors. But again, that is the method that the beets library itself uses when you call the track_distance function.

The reason I made it explicit is because, when using the Levenshtein, there are certain replacements that lead to greater errors, particularly numbers. If the Levenshtein function is replacing a 2 with a 4 for example, that's almost certainly a bigger error than a space to a dash or whatever. My idea was to create a custom Levenshtein function that penalises number replacements more, which would make it more accurate imo, but I haven't actually done that yet because it's already a method of last resort.

Serene-Arc · 2023-09-12T09:25:38Z

I do apologise, I'm quite busy in my semester and it's hard finding the spare time to come back to this. I'm trying to knock some things off.

With regards to the source_weight option, which you mentioned earlier, do you know what it's for? Because looking through the code, I get the impression that there's no reason to have this exposed to the user at all, based on how I think you intend for this plugin to be used.

By that, I mean that the option is meant to rank different sources compared to Musicbrainz. Having it be zero means that it is compared equally to Musicbrainz. But the thing is, there will NEVER be an audiobook match from Musicbrainz; for audiobooks, it should always be zero.

If this plugin is used for a separate library (meaning that a beets instance with this plugin will never be used for music), then there's never a reason for this to be anything other than zero. In that case, I think it should be removed from the configuration and we should permanently set it to zero in the code.

Do you have a reason for it to be exposed to the user through the configuration? And do you expect for this to be used in the same library as music?

Neurrone · 2023-09-24T08:49:06Z

Np.

Agree that source weight shouldn't be exposed, since the goal is to ignore Musicbrainz-

This isn't really a bug with the audible plugin, but something that audnex returns. I don't really know why but since it could be misleading this ignores those cases.

Serene-Arc · 2023-09-26T00:24:51Z

Great, I removed that from the example config in the README. Do you have any other concerns about the new code?

Serene-Arc force-pushed the test_suite branch from 513994e to 55a1ba8 Compare February 16, 2023 04:00

Serene-Arc force-pushed the test_suite branch from faf4e1a to ee97b42 Compare May 22, 2023 12:28

Serene-Arc marked this pull request as ready for review June 1, 2023 04:27

Serene-Arc force-pushed the test_suite branch from dbf5e38 to 05fcf4f Compare June 1, 2023 04:33

Neurrone reviewed Jul 1, 2023

View reviewed changes

Neurrone reviewed Jul 8, 2023

View reviewed changes

Serene-Arc added 10 commits September 26, 2023 10:23

Add tests for chapter sorting

7303a8d

Add test for Audnex API call

e6588c1

Fix tests

0b49b90

Remove import

1bc63fd

Fix reference name

32e140a

Add additional test

5a8dbe6

Fix import

2d90c5f

Add test cases

94cce6c

Add typing for method

0205ebd

Add alternative method of sorting chapters

cb9c744

Serene-Arc and others added 28 commits September 26, 2023 10:23

Add comment test markers for convienience

8580b1b

Fix bug with ordering of trusted source orderings

9fc0a54

Refactor methods out

60d9722

Rename function

a25e599

Add comments for methods

85c1732

Rename readme to follow convention

70afac8

Re-indent YAML

11317e5

Switch to using configuration to determine chapter algorithm

ac55b38

Update pyproject with new README name

41a3c02

Continue on wrong algorithm specification

901cb29

Fix bug with options

191aba8

Refactor out methods

d058810

Add more tests

d60d524

Fix bug with zero indexed tracks

a338dcd

Catch error in levenshtein function

71680fc

Update README with information

37a9598

Reformat table

b6c83b6

Remove old test case

678d143

Fix indent

6611bfc

Reformat according to black

9938c81

Remove old tests

f6b6f99

Fix bug where a single file is named 'chapter 1'

b4dbd78

This isn't really a bug with the audible plugin, but something that audnex returns. I don't really know why but since it could be misleading this ignores those cases.

Add narrator field for albums

563e83f

Remove old option

2c0fd3b

Add requirement for testing

e133b5e

Fix issue with unused parameters

925960b

Fix album property

4b9fdaf

Remove source weight from example doc

5aecb02

Serene-Arc force-pushed the test_suite branch from 45f5e35 to 5aecb02 Compare September 26, 2023 00:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tests suite #43

Add tests suite #43

Serene-Arc commented Dec 11, 2022

Serene-Arc commented Dec 12, 2022

Neurrone commented Dec 13, 2022

Serene-Arc commented Dec 13, 2022

Neurrone commented Dec 13, 2022

Serene-Arc commented Dec 17, 2022

Serene-Arc commented Jun 1, 2023

Neurrone commented Jun 5, 2023

Serene-Arc commented Jun 5, 2023

Neurrone left a comment

Neurrone Jul 1, 2023

Neurrone Jul 1, 2023

Neurrone Jul 1, 2023

Serene-Arc Jul 2, 2023

Neurrone Jul 1, 2023

Serene-Arc Jul 2, 2023

Neurrone Jul 1, 2023

Serene-Arc Jul 2, 2023

Neurrone Jul 1, 2023

Neurrone Jul 8, 2023

Neurrone Jul 8, 2023

Neurrone Jul 8, 2023

Neurrone commented Jul 8, 2023 •

edited

Loading

Serene-Arc commented Jul 9, 2023

Neurrone commented Jul 9, 2023

Serene-Arc commented Jul 10, 2023

Serene-Arc commented Sep 12, 2023

Neurrone commented Sep 24, 2023

Serene-Arc commented Sep 26, 2023



		def is_continuous_number_series(numbers: Iterable[Optional[int]]):
		return all([n is not None for n in numbers]) and all(b - a == 1 for a, b in zip(numbers, numbers[1:]))

Add tests suite #43

Are you sure you want to change the base?

Add tests suite #43

Conversation

Serene-Arc commented Dec 11, 2022

Serene-Arc commented Dec 12, 2022

Neurrone commented Dec 13, 2022

Serene-Arc commented Dec 13, 2022

Neurrone commented Dec 13, 2022

Serene-Arc commented Dec 17, 2022

Serene-Arc commented Jun 1, 2023

Neurrone commented Jun 5, 2023

Serene-Arc commented Jun 5, 2023

Neurrone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Neurrone commented Jul 8, 2023 • edited Loading

Serene-Arc commented Jul 9, 2023

Neurrone commented Jul 9, 2023

Serene-Arc commented Jul 10, 2023

Serene-Arc commented Sep 12, 2023

Neurrone commented Sep 24, 2023

Serene-Arc commented Sep 26, 2023

Neurrone commented Jul 8, 2023 •

edited

Loading