Optimize the nameMap of ZipFile #378

Glavo · 2023-04-10T09:01:08Z

For most regular zip files, all of their entries will have different names.

The current implementation of ZipFile maps each name to a LinkedList in the nameMap in order to deal with the case of duplicate entry names. Normally (64-bit platform, compressed oops enabled), these LinkedLists occupy an additional 56 bytes of memory.

This PR optimizes the implementation of ZipFile, thereby:

For most zip files, this will save memory;
For most zip files, the speed of filling in nameMap becomes faster when opening ZipFile;
ZipFile::getEntry(String) will be faster.

For extremely rare situations, this PR may have slight negative effects, but I think this should not be a problem.

codecov-commenter · 2023-04-10T09:11:03Z

Codecov Report

Attention: 8 lines in your changes are missing coverage. Please review.

Comparison is base (fa1b16d) 80.55% compared to head (d0e5365) 80.53%.
Report is 20 commits behind head on master.

Files	Patch %	Lines
...apache/commons/compress/archivers/zip/ZipFile.java	65.21%	4 Missing and 4 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master     #378      +/-   ##
============================================
- Coverage     80.55%   80.53%   -0.03%     
- Complexity     6764     6766       +2     
============================================
  Files           342      342              
  Lines         24860    24874      +14     
  Branches       4017     4022       +5     
============================================
+ Hits          20027    20033       +6     
- Misses         3298     3302       +4     
- Partials       1535     1539       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kinow

If I understand it right, instead of having a reverse map with the names and list of duplicated entries, here we would have a new separate map to keep the duplicates. In case there are no entries with the same name, that data structure would not be used, and it would perform a little better.

getEntry will be definitely faster.

“For most zip files...” I think that might be correct, for most cases. +0 as I am not using the component with a large-ish project where I could test it.

Thanks!

garydgregory · 2023-04-10T12:45:24Z

No tests. Code coverage is down. No proof of performance change one way or another. Non-trivial additional complexity. So -1 as it stands.

"For extremely rare situations, this PR may have slight negative effects,"

Which are?

"But I think this should not be a problem."

Because?

Glavo · 2023-04-10T14:11:45Z

@garydgregory

No tests. Code coverage is down. No proof of performance change one way or another. Non-trivial additional complexity.

The core intention of this PR is to reduce memory allocation.

As mentioned earlier, currently we allocate a LinkedList for each entry in ZipFile, resulting in an additional 32 (LinkedList) + 24 (LinkedList$Node) bytes of memory overhead. This PR no longer allocate them.

As for the side effect - performance improvement, this is the result of a simple JMH benchmark:

Benchmark source code

public class Compress {

    private static final Path jarPath = Path.of(/* path of commons-compress-1.23.0.jar */);

    private static final ZipFile globalFile ;
    private static final String[] entryNames;

    static {
        try {
            globalFile = new ZipFile(jarPath);
        } catch (IOException e) {
            throw new RuntimeException(e);
        }

        entryNames = Collections.list(globalFile.getEntries())
                .stream()
                .map(ZipArchiveEntry::getName)
                .toArray(String[]::new);
    }

    @Benchmark
    public void testOpen() throws IOException {
        try (ZipFile file = new ZipFile(jarPath)) {
        }
    }

    @Benchmark
    public void testGetEntry(Blackhole blackhole) throws IOException {
        for (String entryName : entryNames) {
            blackhole.consume(globalFile.getEntry(entryName));
        }
    }

    @Benchmark
    public void testGetEntries(Blackhole blackhole) throws IOException {
        for (String entryName : entryNames) {
            blackhole.consume(globalFile.getEntries(entryName));
        }
    }

    @Benchmark
    public void testGetEntriesInPhysicalOrder(Blackhole blackhole) throws IOException {
        for (String entryName : entryNames) {
            blackhole.consume(globalFile.getEntriesInPhysicalOrder(entryName));
        }
    }
}

commons-compress 1.23.0:

Benchmark                                Mode  Cnt       Score      Error  Units
Compress.testGetEntries                 thrpt    3  657266.611 ± 1690.346  ops/s
Compress.testGetEntriesInPhysicalOrder  thrpt    3  121514.308 ± 1001.179  ops/s
Compress.testGetEntry                   thrpt    3  504127.618 ± 1988.320  ops/s
Compress.testOpen                       thrpt    3     663.604 ±   24.088  ops/s

This PR:

Benchmark                                Mode  Cnt       Score       Error  Units
Compress.testGetEntries                 thrpt    3  465237.072 ± 37283.323  ops/s
Compress.testGetEntriesInPhysicalOrder  thrpt    3  459856.583 ± 13165.000  ops/s
Compress.testGetEntry                   thrpt    3  661743.860 ±  1055.489  ops/s
Compress.testOpen                       thrpt    3     665.040 ±    23.971  ops/s

For the same jar file (commons-compress-2.13.0.jar):

Little change in performance opening zip files;
getEntry is 30% faster;
getEntriesInPhysicalOrder(String) is 280% faster.

The only problem is getEntries(String), which is 30% slower, because each call needs to create a new singleton list. But the old implementation was unsafe because it would return the internal mutable list that could be modified by the user simply by casting, thus breaking the ZipFile, the implementation in this PR fixes this issue.

"For extremely rare situations, this PR may have slight negative effects,"

Which are?

"But I think this should not be a problem."

Because?

Because this situation is too rare.

As far as I know, most tools (7zip, jar commands, etc.) don't create zip files with entries with duplicate names. Such unusual zip files are generally only possible to create using some low-level api, so such files themselves are very rare, I can't even easily find examples in the real world.

And, even for this unusual file, it's debatable whether this PR has a positive or negative effect.

If the majority of items do not have the same name, this PR still has a positive effect.

And if there are a large number of entries with the same name, the effect of this PR may be different under different conditions. But I don't think it's a good case to look into, as I can't imagine a real world use case for it.

If you guys know of a real use case for entries with duplicate name, I'll look into that further. But for now, I don't see much point in caring too much about this situation, since regular zip files are much, much more numerous.

garydgregory

@Glavo (CC @kinow)
Thank you for the PR. Please see my comments.

src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java

garydgregory · 2023-05-07T18:39:27Z

src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java

+                }
+
+                List<ZipArchiveEntry> entriesOfThatName = duplicateNameMap.computeIfAbsent(name, k -> {
+                    ArrayList<ZipArchiveEntry> list = new ArrayList<>(2);


Why 2?

This list is created when there are two entries with the same name.

If no more entries with the same name are encountered later, using 2 as the initial capacity can reduce unnecessary memory footpoint.

The reason for using ArrayList instead of LinkedList here is that ArrayList is almost always better than LinkedList.

In general, we can calculate the memory footprint of a lists with N elements like this:

LinkedList: 24 + 24*N

ArrayList:

Minimal (internal array is fully used, no wasted space):
24 (ArrayList) + 16 (array object header) + 4*N

Maximum (internal array has just grown, the reserved space is the largest):
24 (ArrayList) + 16 (array object header) + 6*N

If the list has only two elements, ArrayList with 2 as the initial capacity saves 24 bytes of memory compared to LinkedList.

As the list grows, the memory footprint of ArrayList is usually only 1/6~1/4 of that of LinkedList.

The combined memory footprint of all objects allocated during ArrayList construction (including those small temporary arrays) is smaller than the long-term memory footprint of LinkedList, so using ArrayList is more memory efficient.

For smaller lists, iterating over a LinkedList might be faster, but ArrayList has better data locality. So usually we have no reason to use LinkedList, ArrayList/ArrayDeque are better choices.

Glavo · 2023-06-28T16:06:09Z

Can someone review this PR again?

Glavo · 2023-09-22T08:40:26Z

@garydgregory Can you review this PR again?

# Conflicts: # src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java

Glavo · 2023-11-20T07:26:28Z

@garydgregory Hey, can you take a look at this PR?

elharo · 2023-11-28T14:39:09Z

Looks OK to me, though claims of performance improvements really need measurements and benchmarks.

Glavo · 2023-11-28T17:16:46Z

Looks OK to me, though claims of performance improvements really need measurements and benchmarks.

Here are some JMH benchmark results: #378 (comment)

garydgregory · 2023-11-28T17:23:42Z

Looks OK to me, though claims of performance improvements really need measurements and benchmarks.

Here are some JMH benchmark results: #378 (comment)

But where is the benchmark? You should make it part of this PR so thar anyone can run it.

Glavo · 2023-11-28T17:27:19Z

But where is the benchmark?

The source code is in the Benchmark source code folding block of the comment.

DamnedElric

Made a couple of (mostly nitpicky) comments on the diffs.

Regarding the benchmarks: it might be nice to have the benchmarks as part of the testsuite, but there are currently no benchmarks in compress. Might be useful as part of another MR?

src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java

DamnedElric · 2023-11-28T21:56:29Z

src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java

@@ -368,9 +369,14 @@ public static void closeQuietly(final ZipFile zipFile) {
    private final List<ZipArchiveEntry> entries = new LinkedList<>();

    /**
-     * Maps String to list of ZipArchiveEntrys, name -> actual entries.
+     * Maps a string to the first entry named it.


Nitpicking: "Maps a string to the first entry by that name" sounds clearer to me.

Nitpicking: "Maps a string to the first entry by that name" sounds clearer to me.

I'm not good at English so thank you very much for pointing out the poor grammar.

src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java

DamnedElric · 2023-11-28T22:05:40Z

src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java

@@ -820,7 +839,13 @@ public Enumeration<ZipArchiveEntry> getEntries() {
     * @since 1.6
     */
    public Iterable<ZipArchiveEntry> getEntries(final String name) {
-        return nameMap.getOrDefault(name, ZipArchiveEntry.EMPTY_LINKED_LIST);


You mentioned a potential 30% slowdown in this method in return for additional safety, which may or may not be a fair tradeoff. But I'm a bit concerned that this change potentially breaks the current behaviour. As you say, this currently returns a modifiable LinkedList, which some people may (though I hope not) depend on? Because it requires an explicit cast, I'm not too worried.

I think the slowdown is only in case there are duplicit entries which is extremely unlikely. I'd still simplify the code, as according to earlier comment.

kvr000 · 2023-12-31T05:07:11Z

src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java

-    private final Map<String, LinkedList<ZipArchiveEntry>> nameMap = new HashMap<>(HASH_SIZE);
+    private final Map<String, ZipArchiveEntry> nameMap = new HashMap<>(HASH_SIZE);
+
+    /**
+     * If multiple entries have the same name, maps the name to entries by that name.
+     */
+    private Map<String, List<ZipArchiveEntry>> duplicateNameMap;


Given that this is extremely unlikely, would it make sense to keep the original Map and change the value to Object and let it hold either ZipArchiveEntry or the List? It would eventually complicate the lookup but that should be a single place only.

I REALLY don't like that. When you look at the pack200 code, which is written at the pre-generics Java version level, and some of its lists, you can't tell if some of the lists are buggy, sloppy, or trying to be clever. That or the lack of generics let's you get away with not having fully thought through the hierarchy of types you put in those lists.

That's true, such optimizations typically come at cost of readability and safety. However, keeping the amount of code touching the data structures limited may help - we'll need only three methods - first-by-name, all-by-name and add-entry . And document the field to not touch except via those three methods :-)

Just a suggestion if it comes to true optimization, given this PR is purely about improving time and space performance. Note that the getEntries(name) is not slower only because of wrapping into a Collection but also because of performing double lookup.

kvr000 · 2023-12-31T05:09:40Z

src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java

@@ -820,7 +839,13 @@ public Enumeration<ZipArchiveEntry> getEntries() {
     * @since 1.6
     */
    public Iterable<ZipArchiveEntry> getEntries(final String name) {
-        return nameMap.getOrDefault(name, ZipArchiveEntry.EMPTY_LINKED_LIST);


I think the slowdown is only in case there are duplicit entries which is extremely unlikely. I'd still simplify the code, as according to earlier comment.

garydgregory · 2024-01-19T20:34:00Z

Hello All
I am -1 until tests are provided that show the new data structure is correctly populated. A package-private getter can be added and queried by new unit tests for example. Otherwise, it's just too easy undo in the future without visibility. If this were indeed undone in the future, then forcing a maintainer to remove tests will make it obvious that this was an intended feature. Edge cases should IMO account for no duplicates, some duplicates, and all duplicates use cases.

Optimize the nameMap of ZipFile

7a3e348

kinow reviewed Apr 10, 2023

View reviewed changes

garydgregory requested changes May 7, 2023

View reviewed changes

Glavo added 3 commits May 10, 2023 01:44

Add final modifiers

4019b26

Add final modifiers

77585b4

flip branches

c07b7df

Glavo added 2 commits July 18, 2023 10:47

Merge branch 'master' into linked-list

eec6ab5

Delete EMPTY_LINKED_LIST

71c3641

Glavo requested a review from garydgregory July 18, 2023 02:56

Merge remote-tracking branch 'origin/master' into linked-list

d0e5365

# Conflicts: # src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java

DamnedElric reviewed Nov 28, 2023

View reviewed changes

Update comments

b7cb0f0

kvr000 reviewed Dec 31, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the nameMap of ZipFile #378

Optimize the nameMap of ZipFile #378

Glavo commented Apr 10, 2023

codecov-commenter commented Apr 10, 2023 •

edited

Loading

kinow left a comment

garydgregory commented Apr 10, 2023

Glavo commented Apr 10, 2023 •

edited

Loading

garydgregory left a comment

garydgregory May 7, 2023

Glavo May 9, 2023

Glavo May 9, 2023

Glavo commented Jun 28, 2023

Glavo commented Sep 22, 2023

Glavo commented Nov 20, 2023

elharo commented Nov 28, 2023

Glavo commented Nov 28, 2023

garydgregory commented Nov 28, 2023

Glavo commented Nov 28, 2023

DamnedElric left a comment

DamnedElric Nov 28, 2023

Glavo Dec 8, 2023

DamnedElric Nov 28, 2023

kvr000 Dec 31, 2023

kvr000 Dec 31, 2023

garydgregory Dec 31, 2023 •

edited

Loading

kvr000 Jan 1, 2024 •

edited

Loading

kvr000 Dec 31, 2023

garydgregory commented Jan 19, 2024 •

edited

Loading

Optimize the nameMap of ZipFile #378

Are you sure you want to change the base?

Optimize the nameMap of ZipFile #378

Conversation

Glavo commented Apr 10, 2023

codecov-commenter commented Apr 10, 2023 • edited Loading

Codecov Report

kinow left a comment

Choose a reason for hiding this comment

garydgregory commented Apr 10, 2023

Glavo commented Apr 10, 2023 • edited Loading

garydgregory left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Glavo commented Jun 28, 2023

Glavo commented Sep 22, 2023

Glavo commented Nov 20, 2023

elharo commented Nov 28, 2023

Glavo commented Nov 28, 2023

garydgregory commented Nov 28, 2023

Glavo commented Nov 28, 2023

DamnedElric left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

garydgregory Dec 31, 2023 • edited Loading

Choose a reason for hiding this comment

kvr000 Jan 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

garydgregory commented Jan 19, 2024 • edited Loading

codecov-commenter commented Apr 10, 2023 •

edited

Loading

Glavo commented Apr 10, 2023 •

edited

Loading

garydgregory Dec 31, 2023 •

edited

Loading

kvr000 Jan 1, 2024 •

edited

Loading

garydgregory commented Jan 19, 2024 •

edited

Loading