Skip to content

Commit

Permalink
Preserve control characters
Browse files Browse the repository at this point in the history
If a control character like `\u0002` appears in the XML it is preserved by the REXML parser, but Nokogiri parser bails out with an incomplete XML. Note that scrubbing the string does not help in this case since this is a valid Unicode character, but it is invalid in XML 1.0.

To handle this we extract the character from the error message. For parsing to continue we must also tell Nokogiri to recover from errors.
  • Loading branch information
stenlarsson committed Apr 5, 2024
1 parent dbbd948 commit 855fb6b
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 1 deletion.
9 changes: 8 additions & 1 deletion lib/nori/parser/nokogiri.rb
Original file line number Diff line number Diff line change
Expand Up @@ -44,13 +44,20 @@ def characters(string)

alias cdata_block characters

def error(message)
raise message unless (invalid_chr = message[/PCDATA invalid Char value (\d+)/, 1])

characters(invalid_chr.to_i.chr)
end
end

def self.parse(xml, options)
document = Document.new
document.options = options
parser = ::Nokogiri::XML::SAX::Parser.new document
parser.parse xml
parser.parse xml do |ctx|
ctx.recovery = true
end
document.stack.length > 0 ? document.stack.pop.to_hash : {}
end

Expand Down
4 changes: 4 additions & 0 deletions spec/nori/nori_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -640,6 +640,10 @@
expect(parse(' ')).to eq({})
end

it "should preserve control characters" do
xml = "<tag>a\u0002c</tag>".force_encoding('UTF-8')
expect(parse(xml)["tag"]).to eq("a\u0002c")
end
end
end

Expand Down

0 comments on commit 855fb6b

Please sign in to comment.