Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make x3 interger parser less dependent on fundamental character types #762

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

wanghan02
Copy link
Contributor

@wanghan02 wanghan02 commented May 5, 2023

Many of the spirit x3 parsers are natualy able to work with user defined char types. It would be very helpful if x3's integer parser could be one of those parsers. And it could be done by more canonical implementation of radix_traits::is_valid and radix_traits::digit.

In radix_traits::is_valid, input character ch of type Char is compared with several ascii characters of type char. The static_cast destination type should still be ascii character type char, not the input character type Char. Comparing input character ch with ascii type char should be the more canonical way to do the range check just like the comparison between ch with other ascii char literals ('0', '9', 'a', 'A'). If we have a user defined character type which doesn't support implicit conversion to/from the promoted integer type, the range check still works!

In the old implementation of radix_traits::digit, the Radix check is not necessary and incomplete.

            return (Radix <= 10 || (ch >= '0' && ch <= '9'))
                ? ch - '0'
                : char_encoding::ascii::tolower(ch) - 'a' + 10;

Since radix_traits::digit is only called after the range check radix_traits::is_valid, we don't have to do any range check at all in radix_traits::digit. The input character ch is guaranteed to be in one of the 3 ranges: '0'-'9', 'a' - 'a' + Radix -10 -1 and 'A' - 'A' + Radix -10 -1. We also don't have to call char_encoding::ascii::tolower(ch) which adds another layer of unnecessary dependency. Given the valid range is right above, a simpler implementation as below is clear and efficient. And it doesn't require implicit conversion to integer type from input character type as well!

            return ch < 'A'
                ? ch - '0'
                : ch < 'a'
                    ? ch - 'A' + 10
                    : ch - 'a' + 10;

Both old and new implementations assume char literal uses ascii table.

Copy link
Collaborator

@Kojoley Kojoley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user defined char types.

I don't think I ever seen one. It would help me in understanding the intent of the change if you show one.

Please also notice that when a patch doesn't add tests there is no guarantee that your use case will not be broken in a future by some unrelated change.

Comment on lines +115 to +117
return (ch >= '0' && ch <= (Radix > 10 ? '9' : static_cast<char>('0' + Radix -1)))
|| (Radix > 10 && ch >= 'a' && ch <= static_cast<char>('a' + Radix -10 -1))
|| (Radix > 10 && ch >= 'A' && ch <= static_cast<char>('A' + Radix -10 -1));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that at least for consistency it is better, though the original code is relying without a static assert or a comment on a fact that all these constants always are positive numbers and radix addition to a constant would never overflow or will be caught by the compiler.

}

template <typename Char>
inline static unsigned digit(Char ch)
{
return (Radix <= 10 || (ch >= '0' && ch <= '9'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a compile time branch, an if constexpr that doesn't produce a range of compiler warnings 26356b5 think-cell@7771d20. You basically replaced a compile time branch with a runtime one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this minor optimization is better than simplicity. But If it is really important that the commit won't change the performance in any branch, we could add this compile time optimization back in. Then we could keep this optimization and don't rely on char_encoding::ascii::tolower.

     return (Radix <= 10 || ch < 'A')
        ? ch - '0'
        : ch < 'a'
            ? ch - 'A' + 10
            : ch - 'a' + 10;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a char_ascii type that checks ascii range during construction. It is implicitly convertible and comparable to fundamental char types but not integer types.

IMO the two changes (with Radix constexpr optimization for the 2nd one?) are better implementation than the original code. They provide better semantics and less dependency. If you @Kojoley agree with both changes I could add a simplified char_ascii type and test cases to cover this change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does your 'user defined char type' support bitwise operators? Will replacing char_encoding::ascii::tolowe with (ch | 0x20) - 'a' + 10 work for you?

Copy link
Contributor Author

@wanghan02 wanghan02 May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fundamental char types, integral promotion is performed before doing bitwise operations. Basically it is an integer operation and relies on implicit conversion to integer type. For our char_ascii type, we focus more on character side of the type and don't support any integer operations other than getting the distance between 2 char_asciis.

But I do see there are some performance improvement with (ch | 0x20) - 'a' + 10 than ternary operator in case of fundamental character types. Can we take std::is_convertible<Char const&, int>::value into the path to make user-defined char types similar to our char_ascii work without downgrade the performance of fundamental char types?

     return (Radix <= 10 || ch < 'A')
        ? ch - '0'
        : std::is_convertible<Char const&, int>::value
            ? (ch | 0x20) - 'a' + 10
            : ch < 'a'
                ? ch - 'A' + 10
                : ch - 'a' + 10;

The performance is the same as below with fundamental char types.

    return (Radix <= 10 || ch < 'A')
        ? ch - '0'
        : (ch | 0x20) - 'a' + 10;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your char type design is just broken. Is there tolower associated with it?

Can we take std::is_convertible<Char const&, int>::value into the path to make user-defined char types similar to our char_ascii work without downgrade the performance of fundamental char types?

I don't want to add a code path that literally nobody else is using and also is not tested by the test suite.

Copy link
Contributor Author

@wanghan02 wanghan02 May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We simply don't use tolower. It doesn't matter how we design our character type. The user has freedom to define any character type. It's just if some day the community wants to add constraints to every spirit x3 parser, what is the minimum request on the Char type for this parser? I don't see depending on tolower or implicit cast to int is necessary in this function performance-wise and semantics-wise. We can split the function into overloads for better semantics. The Radix<=10 compile time optimization mixed in the code path is already weird.

        template <typename Char, unsigned Radix_=Radix, std::enable_if_t<Radix_<=10>* = nullptr>
        inline static unsigned digit(Char ch)
        {
            return ch - '0';
        }
        
        template <typename Char, unsigned Radix_=Radix, std::enable_if_t<10<Radix_ && std::is_convertible<Char, int>::value>* = nullptr>
        inline static unsigned digit(Char ch)
        {
            return ch < 'A'
                ? ch - '0'
                : (ch | 0x20) - 'a' + 10;
        }
        
        template <typename Char, unsigned Radix_=Radix, std::enable_if_t<10<Radix_ && !std::is_convertible<Char, int>::value>* = nullptr>
        inline static unsigned digit(Char ch)
        {
            return ch < 'A'
                ? ch - '0'
                : ch < 'a'
                    ? ch - 'A' + 10
                    : ch - 'a' + 10;
        }

You can take this pull request as a refactoring suggestion. The first change static_cast<Char> -> static_cast<char> for me definitely makes the code cleaner. The second change makes digit more aligned with the implementation of is_valid which doesn't use tolower either. Since digit is only called if is_valid is true, the implementation of these 2 functions should be aligned with each other. But it's only a refactoring suggestion. As I said if you agree, I could add test cases to cover the change. Otherwise we have to specialize this struct to workaround the dependency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants