Skip to content

Conversation

@vikiv480
Copy link
Contributor

@vikiv480 vikiv480 commented Aug 7, 2024

The current implementation of #unnormalize iterates over matched entity references that already has been substituted. With these changes we will reduce the number of redundant calls to rv.gsub!.

  • Reject filtered matches earlier in the loop
  • Improve #unnormalize by removing redundant calls to rv.gsub!
  • Improve entity_expansion_limit tests

Example:

require "rexml/parsers/baseparser"

entity_less_than = "<"
entitiy_length = 100

filler_text = "A"
filler_length = 100

feed = "#{entity_less_than * entitiy_length}#{filler_text * filler_length}"

base_parser = REXML::Parsers::BaseParser.new("")
base_parser.unnormalize(feed) # => "<" * 100 + "A" * 100

Before this PR, the example above would require 100 iterations. After this PR, 1 iteration.

@naitoh
Copy link
Contributor

naitoh commented Aug 7, 2024

Thanks for this pull request.
I think this fix looks good.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if the bytesize would be present in the 'raise' statement, so that one can at least adjust the @@entity_expansion_text_limit.

"entity expansion has grown too large: size: XY exceeded @@entity_expansion_text_limit"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be helpful but let's work on it in a separated PR.

@kou
Copy link
Member

kou commented Aug 10, 2024

This fixes the sum calculation but breaks entity_expansion_limit.

@naitoh The current entity_expansion_limit doesn't have any problem, right? (We have the same result with REXML::Document.new and pull parser, right?)

Could you use the following for entity_expand_limit tests?

index 5e3ad75..aeef268 100644
--- a/test/test_sax.rb
+++ b/test/test_sax.rb
@@ -145,17 +145,19 @@ module REXMLTests
 </member>
           XML
 
+          REXML::Security.entity_expansion_limit = 100000
           sax = REXML::Parsers::SAX2Parser.new(source)
-          assert_raise(RuntimeError.new("number of entity expansions exceeded, processing aborted.")) do
-            sax.parse
-          end
+          sax.parse
+          assert_equal(11111, sax.entity_expansion_count)
 
-          REXML::Security.entity_expansion_limit = 100
+          REXML::Security.entity_expansion_limit = @default_entity_expansion_limit
           sax = REXML::Parsers::SAX2Parser.new(source)
           assert_raise(RuntimeError.new("number of entity expansions exceeded, processing aborted.")) do
             sax.parse
           end
-          assert_equal(101, sax.entity_expansion_count)
+          assert do
+            sax.entity_expansion_count > @default_entity_expansion_limit
+          end
         end
 
         def test_with_default_entity

I think that we need another approach something like the following:

diff --git a/lib/rexml/parsers/baseparser.rb b/lib/rexml/parsers/baseparser.rb
index 28810bf..699ed91 100644
--- a/lib/rexml/parsers/baseparser.rb
+++ b/lib/rexml/parsers/baseparser.rb
@@ -547,22 +547,31 @@ module REXML
           [Integer(m)].pack('U*')
         }
         matches.collect!{|x|x[0]}.compact!
+        if filter
+          matches.reject! do |entity_reference|
+            filter.include?(entity_reference)
+          end
+        end
         if matches.size > 0
           sum = 0
-          matches.each do |entity_reference|
-            unless filter and filter.include?(entity_reference)
-              entity_value = entity( entity_reference, entities )
-              if entity_value
-                re = Private::DEFAULT_ENTITIES_PATTERNS[entity_reference] || /&#{entity_reference};/
-                rv.gsub!( re, entity_value )
-                sum += rv.bytesize
-                if sum > Security.entity_expansion_text_limit
-                  raise "entity expansion has grown too large"
-                end
-              else
-                er = DEFAULT_ENTITIES[entity_reference]
-                rv.gsub!( er[0], er[2] ) if er
+          matches.tally.each do |entity_reference, n|
+            entity_expansion_count_before = @entity_expansion_count
+            entity_value = entity( entity_reference, entities )
+            entity_expansion_count_delta =
+              @entity_expansion_count - entity_expansion_count_before
+            if n > 1
+              record_entity_expansion(entity_expansion_count_delta * (n - 1))
+            end
+            if entity_value
+              re = Private::DEFAULT_ENTITIES_PATTERNS[entity_reference] || /&#{entity_reference};/
+              rv.gsub!( re, entity_value )
+              sum += rv.bytesize
+              if sum > Security.entity_expansion_text_limit
+                raise "entity expansion has grown too large"
               end
+            else
+              er = DEFAULT_ENTITIES[entity_reference]
+              rv.gsub!( er[0], er[2] ) if er
             end
           end
           rv.gsub!( Private::DEFAULT_ENTITIES_PATTERNS['amp'], '&' )
@@ -572,8 +581,8 @@ module REXML
 
       private
 
-      def record_entity_expansion
-        @entity_expansion_count += 1
+      def record_entity_expansion(delta=1)
+        @entity_expansion_count += delta
         if @entity_expansion_count > Security.entity_expansion_limit
           raise "number of entity expansions exceeded, processing aborted."
         end

I could add tests for entity_expansion_text_limit if I get a suggestion on where they would fit

Could you use this?

diff --git a/test/test_sax.rb b/test/test_sax.rb
index 5e3ad75..d2bc231 100644
--- a/test/test_sax.rb
+++ b/test/test_sax.rb
@@ -102,10 +102,12 @@ module REXMLTests
     class EntityExpansionLimitTest < Test::Unit::TestCase
       def setup
         @default_entity_expansion_limit = REXML::Security.entity_expansion_limit
+        @default_entity_expansion_text_limit = REXML::Security.entity_expansion_text_limit
       end
 
       def teardown
         REXML::Security.entity_expansion_limit = @default_entity_expansion_limit
+        REXML::Security.entity_expansion_text_limit = @default_entity_expansion_text_limit
       end
 
       class GeneralEntityTest < self
@@ -124,6 +126,17 @@ module REXMLTests
 </member>
           XML
 
+          REXML::Security.entity_expansion_limit = 100_000
+          REXML::Security.entity_expansion_text_limit = 1_000_000_000
+          sax = REXML::Parsers::SAX2Parser.new(source)
+          text_size = nil
+          sax.listen(:characters, ["member"]) do |text|
+            text_size = text.size
+          end
+          sax.parse
+          assert_equal(300002, text_size)
+
+          REXML::Security.entity_expansion_text_limit = @default_entity_expansion_text_limit
           sax = REXML::Parsers::SAX2Parser.new(source)
           assert_raise(RuntimeError.new("entity expansion has grown too large")) do
             sax.parse

Could you do similar one for REXMLTests::PullParserTester::EntityExpansionLimitTest::GeneralEntityTest#test_have_value too?

@naitoh
Copy link
Contributor

naitoh commented Aug 10, 2024

@naitoh The current entity_expansion_limit doesn't have any problem, right? (We have the same result with REXML::Document.new and pull parser, right?)

@kou
Sorry.
The entity_expansion_limit is no problem.
But, the entity_expansion_text_limit has problem, so I created #195.

* Reject filtered matches earlier in the loop
* Improve `#unnormalize` by removing redundant calls to `rv.gsub!`
* Improve `entity_expansion_limit` tests

Co-Authored-By: Sutou Kouhei <[email protected]>
@vikiv480 vikiv480 force-pushed the improve-unnormalize-and-fix-sum-calculation branch from 4fd8b6b to 83be597 Compare August 13, 2024 19:05
@vikiv480
Copy link
Contributor Author

This fixes the sum calculation but breaks entity_expansion_limit.

Ah, I see, you are right!

Could you use the following for entity_expand_limit tests?
I think that we need another approach something like the following:

I've added your suggested changes to 83be597, but I moved the calculation of entity_expansion_count_delta deeper into the loop.

Since issue #193 was handled by #195, I'll edit the PR description to match the current situation.

@vikiv480 vikiv480 changed the title Improve BaseParser#unnormalize and fix sum calculation Improve BaseParser#unnormalize Aug 13, 2024
@kou
Copy link
Member

kou commented Aug 14, 2024

Ah, #tally doesn't exist in Ruby 2.5 and 2.6. How about using refinements?

diff --git a/lib/rexml/parsers/baseparser.rb b/lib/rexml/parsers/baseparser.rb
index 342f948..0ac243a 100644
--- a/lib/rexml/parsers/baseparser.rb
+++ b/lib/rexml/parsers/baseparser.rb
@@ -8,6 +8,22 @@ require "strscan"
 
 module REXML
   module Parsers
+    unless [].respond_to?(:tally)
+      module EnumerableTally
+        refine Enumerable do
+          def tally
+            counts = {}
+            each do |item|
+              counts[item] ||= 0
+              counts[item] += 1
+            end
+            counts
+          end
+        end
+      end
+      using EnumerableTally
+    end
+
     if StringScanner::Version < "3.0.8"
       module StringScannerCaptures
         refine StringScanner do

`#tally` doesn't exist in Ruby 2.5 and 2.6

* Refine `Enumerable` to support `#tally` in `REXML::Parsers`

Co-Authored-By: Sutou Kouhei <[email protected]>
@vikiv480
Copy link
Contributor Author

Ah, #tally doesn't exist in Ruby 2.5 and 2.6. How about using refinements?

Thanks for the patch, added in b0949d8

@kou
Copy link
Member

kou commented Aug 14, 2024

@naitoh Could you review this before we merge this?

@naitoh
Copy link
Contributor

naitoh commented Aug 15, 2024

Could you review this before we merge this?

I have checked this PR.
I think this PR is good.
Thanks!

@kou kou merged commit 2f019f9 into ruby:master Aug 16, 2024
@kou
Copy link
Member

kou commented Aug 16, 2024

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants