-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the nameMap of ZipFile #378
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #378 +/- ##
============================================
- Coverage 80.55% 80.53% -0.03%
- Complexity 6764 6766 +2
============================================
Files 342 342
Lines 24860 24874 +14
Branches 4017 4022 +5
============================================
+ Hits 20027 20033 +6
- Misses 3298 3302 +4
- Partials 1535 1539 +4 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it right, instead of having a reverse map with the names and list of duplicated entries, here we would have a new separate map to keep the duplicates. In case there are no entries with the same name, that data structure would not be used, and it would perform a little better.
getEntry
will be definitely faster.
“For most zip files...” I think that might be correct, for most cases. +0 as I am not using the component with a large-ish project where I could test it.
Thanks!
No tests. Code coverage is down. No proof of performance change one way or another. Non-trivial additional complexity. So -1 as it stands. "For extremely rare situations, this PR may have slight negative effects," Which are? "But I think this should not be a problem." Because? |
The core intention of this PR is to reduce memory allocation. As mentioned earlier, currently we allocate a LinkedList for each entry in ZipFile, resulting in an additional 32 (LinkedList) + 24 (LinkedList$Node) bytes of memory overhead. This PR no longer allocate them. As for the side effect - performance improvement, this is the result of a simple JMH benchmark: Benchmark source code
public class Compress {
private static final Path jarPath = Path.of(/* path of commons-compress-1.23.0.jar */);
private static final ZipFile globalFile ;
private static final String[] entryNames;
static {
try {
globalFile = new ZipFile(jarPath);
} catch (IOException e) {
throw new RuntimeException(e);
}
entryNames = Collections.list(globalFile.getEntries())
.stream()
.map(ZipArchiveEntry::getName)
.toArray(String[]::new);
}
@Benchmark
public void testOpen() throws IOException {
try (ZipFile file = new ZipFile(jarPath)) {
}
}
@Benchmark
public void testGetEntry(Blackhole blackhole) throws IOException {
for (String entryName : entryNames) {
blackhole.consume(globalFile.getEntry(entryName));
}
}
@Benchmark
public void testGetEntries(Blackhole blackhole) throws IOException {
for (String entryName : entryNames) {
blackhole.consume(globalFile.getEntries(entryName));
}
}
@Benchmark
public void testGetEntriesInPhysicalOrder(Blackhole blackhole) throws IOException {
for (String entryName : entryNames) {
blackhole.consume(globalFile.getEntriesInPhysicalOrder(entryName));
}
}
} commons-compress 1.23.0:
This PR:
For the same jar file (commons-compress-2.13.0.jar):
The only problem is
Because this situation is too rare. As far as I know, most tools (7zip, jar commands, etc.) don't create zip files with entries with duplicate names. Such unusual zip files are generally only possible to create using some low-level api, so such files themselves are very rare, I can't even easily find examples in the real world. And, even for this unusual file, it's debatable whether this PR has a positive or negative effect. If the majority of items do not have the same name, this PR still has a positive effect. And if there are a large number of entries with the same name, the effect of this PR may be different under different conditions. But I don't think it's a good case to look into, as I can't imagine a real world use case for it. If you guys know of a real use case for entries with duplicate name, I'll look into that further. But for now, I don't see much point in caring too much about this situation, since regular zip files are much, much more numerous. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
List<ZipArchiveEntry> entriesOfThatName = duplicateNameMap.computeIfAbsent(name, k -> { | ||
ArrayList<ZipArchiveEntry> list = new ArrayList<>(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 2?
This list is created when there are two entries with the same name.
If no more entries with the same name are encountered later, using 2 as the initial capacity can reduce unnecessary memory footpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for using ArrayList
instead of LinkedList
here is that ArrayList
is almost always better than LinkedList
.
In general, we can calculate the memory footprint of a lists with N
elements like this:
- LinkedList:
24 + 24*N
- ArrayList:
- Minimal (internal array is fully used, no wasted space):
24 (ArrayList) + 16 (array object header) + 4*N
- Maximum (internal array has just grown, the reserved space is the largest):
24 (ArrayList) + 16 (array object header) + 6*N
- Minimal (internal array is fully used, no wasted space):
If the list has only two elements, ArrayList
with 2 as the initial capacity saves 24 bytes of memory compared to LinkedList
.
As the list grows, the memory footprint of ArrayList
is usually only 1/6~1/4 of that of LinkedList
.
The combined memory footprint of all objects allocated during ArrayList
construction (including those small temporary arrays) is smaller than the long-term memory footprint of LinkedList
, so using ArrayList
is more memory efficient.
For smaller lists, iterating over a LinkedList
might be faster, but ArrayList
has better data locality. So usually we have no reason to use LinkedList
, ArrayList
/ArrayDeque
are better choices.
Can someone review this PR again? |
@garydgregory Can you review this PR again? |
# Conflicts: # src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java
@garydgregory Hey, can you take a look at this PR? |
Looks OK to me, though claims of performance improvements really need measurements and benchmarks. |
Here are some JMH benchmark results: #378 (comment) |
But where is the benchmark? You should make it part of this PR so thar anyone can run it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a couple of (mostly nitpicky) comments on the diffs.
Regarding the benchmarks: it might be nice to have the benchmarks as part of the testsuite, but there are currently no benchmarks in compress. Might be useful as part of another MR?
@@ -368,9 +369,14 @@ public static void closeQuietly(final ZipFile zipFile) { | |||
private final List<ZipArchiveEntry> entries = new LinkedList<>(); | |||
|
|||
/** | |||
* Maps String to list of ZipArchiveEntrys, name -> actual entries. | |||
* Maps a string to the first entry named it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicking: "Maps a string to the first entry by that name" sounds clearer to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicking: "Maps a string to the first entry by that name" sounds clearer to me.
I'm not good at English so thank you very much for pointing out the poor grammar.
src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java
Outdated
Show resolved
Hide resolved
@@ -820,7 +839,13 @@ public Enumeration<ZipArchiveEntry> getEntries() { | |||
* @since 1.6 | |||
*/ | |||
public Iterable<ZipArchiveEntry> getEntries(final String name) { | |||
return nameMap.getOrDefault(name, ZipArchiveEntry.EMPTY_LINKED_LIST); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned a potential 30% slowdown in this method in return for additional safety, which may or may not be a fair tradeoff. But I'm a bit concerned that this change potentially breaks the current behaviour. As you say, this currently returns a modifiable LinkedList, which some people may (though I hope not) depend on? Because it requires an explicit cast, I'm not too worried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the slowdown is only in case there are duplicit entries which is extremely unlikely. I'd still simplify the code, as according to earlier comment.
private final Map<String, LinkedList<ZipArchiveEntry>> nameMap = new HashMap<>(HASH_SIZE); | ||
private final Map<String, ZipArchiveEntry> nameMap = new HashMap<>(HASH_SIZE); | ||
|
||
/** | ||
* If multiple entries have the same name, maps the name to entries by that name. | ||
*/ | ||
private Map<String, List<ZipArchiveEntry>> duplicateNameMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is extremely unlikely, would it make sense to keep the original Map and change the value to Object and let it hold either ZipArchiveEntry or the List? It would eventually complicate the lookup but that should be a single place only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I REALLY don't like that. When you look at the pack200 code, which is written at the pre-generics Java version level, and some of its lists, you can't tell if some of the lists are buggy, sloppy, or trying to be clever. That or the lack of generics let's you get away with not having fully thought through the hierarchy of types you put in those lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, such optimizations typically come at cost of readability and safety. However, keeping the amount of code touching the data structures limited may help - we'll need only three methods - first-by-name
, all-by-name
and add-entry
. And document the field to not touch except via those three methods :-)
Just a suggestion if it comes to true optimization, given this PR is purely about improving time and space performance. Note that the getEntries(name)
is not slower only because of wrapping into a Collection
but also because of performing double lookup.
@@ -820,7 +839,13 @@ public Enumeration<ZipArchiveEntry> getEntries() { | |||
* @since 1.6 | |||
*/ | |||
public Iterable<ZipArchiveEntry> getEntries(final String name) { | |||
return nameMap.getOrDefault(name, ZipArchiveEntry.EMPTY_LINKED_LIST); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the slowdown is only in case there are duplicit entries which is extremely unlikely. I'd still simplify the code, as according to earlier comment.
Hello All |
For most regular zip files, all of their entries will have different names.
The current implementation of ZipFile maps each name to a LinkedList in the nameMap in order to deal with the case of duplicate entry names. Normally (64-bit platform, compressed oops enabled), these LinkedLists occupy an additional 56 bytes of memory.
This PR optimizes the implementation of
ZipFile
, thereby:ZipFile
;ZipFile::getEntry(String)
will be faster.For extremely rare situations, this PR may have slight negative effects, but I think this should not be a problem.