You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently the new method #scan_integer was introduced (see #113) to optimize scanning integer values.
The current implementation works regardless of what follows the integer, i.e. scanning 123, 123 something, 123,something, 123.32 and 123something all work and would return 123.
However, in - I suspect - many cases an integer may only be a valid integer if it is (not) followed by certain characters. One example is the input 123d which leads to an error when interpreted as Ruby code.
My use case is PDF syntax. There a token is an integer only when it is followed by a whitespace (ASCII decimal 0, 9, 10, 12, 13 and 32) or a delimiter (( ) < > [ ] / %) character (otherwise it is a generic token). To handle this the implementation using #scan_integer looks like this:
# Parses the number (integer or real) at the current position.## See: PDF2.0 s7.3.3defparse_numberprepare_string_scanner(20)pos=self.posif(tmp=@ss.scan_integer)if@ss.eos? || @ss.match?(WHITESPACE_OR_DELIMITER_RE)# Handle object references, see PDF2.0 s7.3.10prepare_string_scanner(10)if@ss.scan(REFERENCE_RE)tmp=iftmp > 0Reference.new(tmp,@ss[1].to_i)elsemaybe_raise("Invalid indirect object reference (#{tmp},#{@ss[1].to_i})")nilendendreturntmpelseself.pos=posendendval=scan_until(WHITESPACE_OR_DELIMITER_RE) || @ss.scan(/.*/)ifval.match?(/\A[+-]?(?:\d+\.\d*|\.\d+)\z/)val << '0'ifval.getbyte(-1) == 46# dot '.'Float(val)elseTOKEN_CACHE[val]# val is keywordendendend
As you can see we
need to store the current scan position,
check if scanning an integer works at the current position,
scan the content after the integer to verify that it is indeed an integer and work with it, or
if the previous step didn't work, reset the scan position.
This could be simplified to just a call of #scan_integer if this method would optionally check the contents after it. Something like #scan_integer(separator: SEPARATOR_PATTERN) or maybe #scan_integer(separator_chars: STRING) (where STRING contains separator characters, similar to whole String#tr works).
Would it make sense to include such functionality?
The text was updated successfully, but these errors were encountered:
Recently the new method
#scan_integer
was introduced (see #113) to optimize scanning integer values.The current implementation works regardless of what follows the integer, i.e. scanning
123
,123 something
,123,something
,123.32
and123something
all work and would return 123.However, in - I suspect - many cases an integer may only be a valid integer if it is (not) followed by certain characters. One example is the input
123d
which leads to an error when interpreted as Ruby code.My use case is PDF syntax. There a token is an integer only when it is followed by a whitespace (ASCII decimal 0, 9, 10, 12, 13 and 32) or a delimiter (
( ) < > [ ] / %
) character (otherwise it is a generic token). To handle this the implementation using#scan_integer
looks like this:As you can see we
This could be simplified to just a call of
#scan_integer
if this method would optionally check the contents after it. Something like#scan_integer(separator: SEPARATOR_PATTERN)
or maybe#scan_integer(separator_chars: STRING)
(whereSTRING
contains separator characters, similar to wholeString#tr
works).Would it make sense to include such functionality?
The text was updated successfully, but these errors were encountered: