Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide an efficient way to parse Integers (and Floats)? #113

Closed
casperisfine opened this issue Nov 7, 2024 · 16 comments · Fixed by #115
Closed

Provide an efficient way to parse Integers (and Floats)? #113

casperisfine opened this issue Nov 7, 2024 · 16 comments · Fixed by #115

Comments

@casperisfine
Copy link

Previous discussion: https://bugs.ruby-lang.org/issues/20394

Context

When trying to write pure Ruby gems that are competitive in term of performance with C extensions, a very common bottleneck is parsing of text based protocols and formats, such as the Redis RESP protocol, or even the PDF format (FYI @gettalong).

As a result, currently the most efficient way to parse integers in a string in Ruby, is to reimplement atoi using String#getbyte, which is a bit ridiculous.

Otherwise if you create a substring with String#slice or StringScanner#scan and then call to_i or Integer, instantiating the sub string and copying the bytes really tank the performance.

Proposal

Given that StringScanner is a default gem, is often involved in string parsing, and already act as a "pointer into a String", I think it's well positioned to offer an efficient way to parse an Integer without instantiating a useless temporary string.

Basically an optimized way to do scanner.scan(/\d+/).to_i.

The API could be any of:

  • scanner.scan(/\d+/, :to_i)
  • scanner.scan(/\d+/, Integer)
  • scanner.scan_integer(/\d+/)

Logically the two supported types would be Integer and Float, but perhaps others would be helpful for other protocols?

@kou as maintainer of strscan, do you have any opinion? I'm happy to put the work on this, but I'd need to know if the feature is desired, and which API would be deemed acceptable.

Also cc @tenderlove @mame from previous discussions.

@tenderlove
Copy link
Member

I think it makes sense, but we shouldn't we support things besides base 10? I'd expect to be able to provide a base (I think)

@casperisfine
Copy link
Author

The way I was seeing it, these would be a different format name, e.g. with the first syntax:

scanner.scan(/\d+/, :to_i)
scanner.scan(/\d[a-f]+/, :hexa)
# ...

But maybe that's not flexible enough? The issue is that for this to perform, we need to invoke a C function with a pointer, so argument passing a bit more complex I think?

@gettalong
Copy link

I plan on using the changes from #89 in HexaPDF (thanks @tenderlove) once they are released. This proposal feels quite similar and from a user's perspective I would probably go with the method names scan_int (or scan_integer) and scan_float.

The StringScanner#scan method is widely used and overloading it the proposed manner (e.g. additional argument and different return type) feels not right.

@kou
Copy link
Member

kou commented Nov 8, 2024

I like scanner.scan_integer. scanner.scan_integer(base=10) or scanner.scan_integer(base: 10) for non base 10 support.

Do we need the pattern argument (/d+/) for integer format? Can we always use String#to_i/Kernel.#Integer compatible format?

@casperisfine
Copy link
Author

Do we need the pattern argument (/d+/) for integer format? Can we always use String#to_i/Kernel.#Integer compatible format?

I think so, because Ruby's to_i / Integer accepts things you might not want, e.g. _ or 0x...:

>> "1_234".to_i
=> 1234
>> Integer("0x80")
=> 128

So if we're going to use something like rb_cstr2inum under the hood, we need to allow the user to restrict the class of character.

Alternatively we can not take a pattern argument, but we shouldn't allow thing like _ etc.

What do you think?

@kou
Copy link
Member

kou commented Nov 11, 2024

We can avoid the 0x... case by accepting base.

I have some concerns about mandatory pattern argument:

  • We may be able to optimize detecting the target substring without mandatory pattern
    • For example, we may be able to avoid using regular expression to detect the target substring
  • The user specified pattern may be buggy
    • For example, users may want to use /d+/ but it doesn't accept +/- sign. It may be buggy in most cases.
    • I want to design APIs that are difficult to misuse.

Can we add accept_underscore: false, make the pattern optional (scan_integer(pattern: /\d+/)) or something instead of mandatory pattern argument?

@gettalong
Copy link

For the HexaPDF use case there would need to be support for two things:

  • Integer: "An integer shall be written as one or more decimal digits optionally preceded by a sign." So strings like 123 43445 +17 -98 0.

  • Floats: "A real value shall be written as one or more decimal digits with an optional sign and a leading, trailing,
    or embedded period".
    So strings like 34.5 -3.62 +123.6 4. -.002.

For the integer use case I think this is what most integers look like. For the float use case the PDF syntax is probably special because a digit before or after the period is not necessary, as shown in the above examples. I'm not sure if only providing a scan pattern would work because the method itself would still need to know how to deal with that situation. So maybe a keyword argument for scan_float would be better than providing a pattern?

@gettalong
Copy link

How would you determine whether the digits at the current scan position are actually an integer and not something else? For example, is 123d something else considered as starting with an integer? In PDF syntax 123d would not be an integer but a general token.

Would it make sense to specify a "separator" pattern or "separator" characters?

@casperisfine
Copy link
Author

  • We may be able to optimize detecting the target substring without mandatory pattern

True, if we could avoid matching a regexp it would be preferable.

For example, users may want to use /d+/ but it doesn't accept +/- sign. It may be buggy in most cases.

That indeed sound like an easy mistake to do.

Can we add accept_underscore: false, make the pattern optional (scan_integer(pattern: /\d+/)) or something instead of mandatory pattern argument?

In my opinion, underscores should either not be supported or be opt-in, not opt-out, as it's a common thing in Ruby and YAML, but fairly rare otherwise, and I'm not even sure Ruby and YAML agree exactly on how underscores can be used.

So I think we shouldn't even support underscores, or at least not initially. strscan is used to parse all sorts of format, we probably shouldn't leak "rubyism's" into its default behavior.

For the float use case the PDF syntax is probably special because a digit before or after the period is not necessary

I think it's not that uncommon, Ruby, JavaScript, Python and likely many others support that, one notable exception I can think of being JSON.

So the decision is of course @kou call, but at that point I think scan_integer(pattern: /\d+/) is probably the most flexible? But I think even if we started with a version that doesn't take any option, it would already be very useful. So perhaps that's what we should do until some real use case show up?

@kou
Copy link
Member

kou commented Nov 14, 2024

So I think we shouldn't even support underscores, or at least not initially.

I'm OK with this.

So the decision is of course @kou call, but at that point I think scan_integer(pattern: /\d+/) is probably the most flexible? But I think even if we started with a version that doesn't take any option, it would already be very useful. So perhaps that's what we should do until some real use case show up?

How about implementing scan_integer() as the fist step and then implementing scan_integer(pattern:) as the next step?

@kou
Copy link
Member

kou commented Nov 14, 2024

We may implement scan_float later but it should be a separated task.

@casperisfine
Copy link
Author

How about implementing scan_integer() as the fist step and then implementing scan_integer(pattern:) as the next step?

Sounds good. I'll work on this today.

casperisfine pushed a commit to casperisfine/strscan that referenced this issue Nov 14, 2024
Fix: ruby#113

This allows to directly parse an Integer from a String without needing
to first allocate a sub string.

Notes:

The implementation is limited by design, it's meant as a first step,
only the most straightforward, based 10 integers are supported.
@casperisfine
Copy link
Author

#114

casperisfine pushed a commit to casperisfine/strscan that referenced this issue Nov 14, 2024
Fix: ruby#113

This allows to directly parse an Integer from a String without needing
to first allocate a sub string.

Notes:

The implementation is limited by design, it's meant as a first step,
only the most straightforward, based 10 integers are supported.
casperisfine pushed a commit to casperisfine/strscan that referenced this issue Nov 14, 2024
Fix: ruby#113

This allows to directly parse an Integer from a String without needing
to first allocate a sub string.

Notes:

The implementation is limited by design, it's meant as a first step,
only the most straightforward, based 10 integers are supported.
casperisfine pushed a commit to casperisfine/strscan that referenced this issue Nov 14, 2024
Fix: ruby#113

This allows to directly parse an Integer from a String without needing
to first allocate a sub string.

Notes:

The implementation is limited by design, it's meant as a first step,
only the most straightforward, based 10 integers are supported.
casperisfine pushed a commit to casperisfine/strscan that referenced this issue Nov 14, 2024
Fix: ruby#113

This allows to directly parse an Integer from a String without needing
to first allocate a sub string.

Notes:

The implementation is limited by design, it's meant as a first step,
only the most straightforward, based 10 integers are supported.
casperisfine pushed a commit to casperisfine/strscan that referenced this issue Nov 18, 2024
Fix: ruby#113

This allows to directly parse an Integer from a String without needing
to first allocate a sub string.

Notes:

The implementation is limited by design, it's meant as a first step,
only the most straightforward, based 10 integers are supported.
@gettalong
Copy link

@casperisfine I have been using the code from your pull request to benchmark HexaPDF with an adapted implementation that uses StringScanner#scan_integer (without YJIT and then with YJIT) and it looks really good:

$ RUBYOPT=-Ilib:/tmp/strscan/lib benchmark-driver bm.yaml
Warming up --------------------------------------
             current   203.166k i/s -    214.346k times in 1.055027s (4.92μs/i)
                 new   244.051k i/s -    259.589k times in 1.063666s (4.10μs/i)
Calculating -------------------------------------
             current   209.346k i/s -    609.499k times in 2.911445s (4.78μs/i)
                 new   251.667k i/s -    732.153k times in 2.909214s (3.97μs/i)

Comparison:
                 new:    251666.9 i/s
             current:    209345.9 i/s - 1.20x  slower

$ RUBYOPT="-Ilib:/tmp/strscan/lib --yjit" benchmark-driver bm.yaml
Warming up --------------------------------------
             current   244.532k i/s -    245.898k times in 1.005587s (4.09μs/i)
                 new   447.758k i/s -    450.714k times in 1.006603s (2.23μs/i)
Calculating -------------------------------------
             current   313.995k i/s -    733.595k times in 2.336329s (3.18μs/i)
                 new   460.687k i/s -      1.343M times in 2.915804s (2.17μs/i)

Comparison:
                 new:    460686.7 i/s
             current:    313994.7 i/s - 1.47x  slower

The adjusted implementation looks like this:

    # Parses the number (integer or real) at the current position.
    #
    # See: PDF2.0 s7.3.3
    def parse_number
      prepare_string_scanner(20)
      pos = self.pos
      if (tmp = @ss.scan_integer)
        if @ss.eos? || @ss.match?(WHITESPACE_OR_DELIMITER_RE)
          # Handle object references, see PDF2.0 s7.3.10
          prepare_string_scanner(10)
          if @ss.scan(REFERENCE_RE)
            tmp = if tmp > 0
                    Reference.new(tmp, @ss[1].to_i)
                  else
                    maybe_raise("Invalid indirect object reference (#{tmp},#{@ss[1].to_i})")
                    nil
                  end
          end
          return tmp
        else
          self.pos = pos
        end
      end

      val = scan_until(WHITESPACE_OR_DELIMITER_RE) || @ss.scan(/.*/)
      if val.match?(/\A[+-]?(?:\d+\.\d*|\.\d+)\z/)
        val << '0' if val.getbyte(-1) == 46 # dot '.'
        Float(val)
      else
        TOKEN_CACHE[val] # val is keyword
      end
    end
  end

Would it be possible to provide a "separator" argument to #scan_integer that needs to be matched after the integer? Then invalid cases like "1234d" would not match. And the above implementation could be simplified because the if @ss.eos? || @ss.match?(WHITESPACE_OR_DELIMITER_RE) part would already be handled by #scan_integer (and I wouldn't need to store the scan pointer and reset it if there is no match).

byroot added a commit to byroot/strscan that referenced this issue Nov 25, 2024
Fix: ruby#113

This allows to directly parse an Integer from a String without needing
to first allocate a sub string.

Notes:

The implementation is limited by design, it's meant as a first step,
only the most straightforward, based 10 integers are supported.
byroot added a commit to byroot/strscan that referenced this issue Nov 26, 2024
Fix: ruby#113

This allows to directly parse an Integer from a String without needing
to first allocate a sub string.

Notes:

The implementation is limited by design, it's meant as a first step,
only the most straightforward, based 10 integers are supported.
byroot added a commit to byroot/strscan that referenced this issue Nov 26, 2024
Fix: ruby#113

This allows to directly parse an Integer from a String without needing
to first allocate a sub string.

Notes:

The implementation is limited by design, it's meant as a first step,
only the most straightforward, based 10 integers are supported.
@kou kou closed this as completed in #115 Nov 26, 2024
kou pushed a commit that referenced this issue Nov 26, 2024
Fix: #113

This allows to directly parse an Integer from a String without needing
to first allocate a sub string.

Notes:

The implementation is limited by design, it's meant as a first step,
only the most straightforward, based 10 integers are supported.
@kou
Copy link
Member

kou commented Nov 27, 2024

@gettalong Could you open a new issue for your use case? Let's discuss it in a new issue not in a closed PR.

matzbot pushed a commit to ruby/ruby that referenced this issue Nov 27, 2024
(ruby/strscan#115)

Fix: ruby/strscan#113

This allows to directly parse an Integer from a String without needing
to first allocate a sub string.

Notes:

The implementation is limited by design, it's meant as a first step,
only the most straightforward, based 10 integers are supported.

ruby/strscan@6a3c74b4c8
@gettalong
Copy link

@gettalong Could you open a new issue for your use case? Let's discuss it in a new issue not in a closed PR.

Done: #119.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants