Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enums types are not properly create when unicode character is used [rt.cpan.org #123698] #52

Open
rabbiveesh opened this issue Nov 20, 2022 · 0 comments

Comments

@rabbiveesh
Copy link
Contributor

Migrated from rt.cpan.org#123698 (status was 'open')

Requestors:

From [email protected] on 2017-11-21 09:54:01
:

The {extra}{list} enum values are not correct encoded. I use the same connection settings for the app itself and all data from the database are correctly encoded except this enum.


> \dT+
...
 steinhaus_main | enum_tasks_status   | enum_tasks_status   | 4     | offen         +| 
                |                     |                     |       | erledigt      +| 
                |                     |                     |       | zurückgestellt | 
...


$ grep status -C5 Tasks.pm
...
  "status",
  {
    data_type => "enum",
    default_value => "offen",
    extra => {
      custom_type_name => "enum_tasks_status",
      list => ["offen", "erledigt", "zur\xFCckgestellt"],
    },
    is_nullable => 0,
  },
...

the file is in utf8 with use utf8; in the beginning so i expected:

      list => ["offen", "erledigt", "zurückgestellt"],

From [email protected] on 2017-11-21 11:08:27
:

On 2017-11-21 09:54:01, [email protected] wrote:
> The {extra}{list} enum values are not correct encoded. I use the same
> connection settings for the app itself and all data from the database
> are correctly encoded except this enum.
> 
> 
> > \dT+
> ...
>   steinhaus_main | enum_tasks_status   | enum_tasks_status   | 4     |
> offen         +|
>                  |                     |                     |       |
> erledigt      +|
>                  |                     |                     |       |
> zurückgestellt |
> ...
> 
> 
> $ grep status -C5 Tasks.pm
> ...
>   "status",
>   {
>     data_type => "enum",
>     default_value => "offen",
>     extra => {
>       custom_type_name => "enum_tasks_status",
>       list => ["offen", "erledigt", "zur\xFCckgestellt"],
>     },
>     is_nullable => 0,
>   },
> ...
> 
> the file is in utf8 with use utf8; in the beginning so i expected:
> 
> list => ["offen", "erledigt", "zurückgestellt"],

These representations of the string are equivalent:

    $ perl -Mutf8 -E 'say "zur\xFCckgestellt" eq "zurückgestellt"'
    1

Schema::Loader uses Data::Dump to serialise method call arguments in the generated files, and it encodes all non-ASCII (and non-printable) characters using \x notation.

For aesthetic reasons it might be desirable to output Unicode word characters literally too, but the current output is not incorrect.

- ilmari

From [email protected] on 2017-11-21 11:43:13
:

Am Di 21. Nov 2017, 06:08:27, ilmari schrieb:
> On 2017-11-21 09:54:01, [email protected] wrote:
> > The {extra}{list} enum values are not correct encoded. I use the same
> > connection settings for the app itself and all data from the database
> > are correctly encoded except this enum.
> >
> >
> > > \dT+
> > ...
> >   steinhaus_main | enum_tasks_status   | enum_tasks_status   | 4
> > |
> > offen         +|
> >                  |                     |                     |
> > |
> > erledigt      +|
> >                  |                     |                     |
> > |
> > zurückgestellt |
> > ...
> >
> >
> > $ grep status -C5 Tasks.pm
> > ...
> >   "status",
> >   {
> >     data_type => "enum",
> >     default_value => "offen",
> >     extra => {
> >       custom_type_name => "enum_tasks_status",
> >       list => ["offen", "erledigt", "zur\xFCckgestellt"],
> >     },
> >     is_nullable => 0,
> >   },
> > ...
> >
> > the file is in utf8 with use utf8; in the beginning so i expected:
> >
> > list => ["offen", "erledigt", "zurückgestellt"],
> 
> These representations of the string are equivalent:
> 
> $ perl -Mutf8 -E 'say "zur\xFCckgestellt" eq "zurückgestellt"'
> 1
> 
> Schema::Loader uses Data::Dump to serialise method call arguments in
> the generated files, and it encodes all non-ASCII (and non-printable)
> characters using \x notation.
> 
> For aesthetic reasons it might be desirable to output Unicode word
> characters literally too, but the current output is not incorrect.
> 
> - ilmari

It is not really the same ...

In the real code i have to make a Encode::decode('ISO-8859-15', $enum) as a quickfix. 

$ cat ticket123698.pl 
use utf8;
use 5.20.0;
use Data::Dumper;
say "zur\xFCckgestellt" eq "zurückgestellt";
print Dumper("zur\xFCckgestellt","zurückgestellt");
$ perl ticket123698.pl 
1
$VAR1 = 'zur�ckgestellt';
$VAR2 = "zur\x{fc}ckgestellt";

From [email protected] on 2017-11-21 12:07:59
:

"Felix Antonius Wilhelm Ostmann via RT"
<[email protected]> writes:

> It is not really the same ...

The _internal_ representation is not the same; the \x from will be
represented internally as one byte per code point ("downgraded"), while
the literal form will be utf-8-encoded ("upgraded"). Semantically they
are the same, as evidenced by "eq" returning true.

> In the real code i have to make a Encode::decode('ISO-8859-15', $enum) as a quickfix. 

Please show where in the real code you have to do this.  It smells like
something you're passing it to suffering from the Unicode Bug,
i.e. treating the characters in the 128..255 range differently depending
on the internal representation (see
https://metacpan.org/pod/perlunicode#The-%22Unicode-Bug%22 for details).

> $ cat ticket123698.pl 
> use utf8;
> use 5.20.0;
> use Data::Dumper;
> say "zur\xFCckgestellt" eq "zurückgestellt";
> print Dumper("zur\xFCckgestellt","zurückgestellt");
> $ perl ticket123698.pl 
> 1
> $VAR1 = 'zur�ckgestellt';
> $VAR2 = "zur\x{fc}ckgestellt";

The different outputs here are a quirk of how Data::Dumper deals with
downgraded vs. upgraded strings (which could be viewed as an instance of
the Unicode Bug, but doesn't actually affect semantics).  The first one
is showing as � because you haven't thold perl that your terminal
expects UTF-8-encoded strings.  Adding

    use open qw(:std :utf8);

to the script will make it apply a UTF-8 encoding layer to the standard
input/output/error filehandles, so non-ASCII charcters show correctly.

- ilmari
-- 
"I use RMS as a guide in the same way that a boat captain would use
 a lighthouse.  It's good to know where it is, but you generally
 don't want to find yourself in the same spot." - Tollef Fog Heen

From [email protected] on 2017-11-21 13:35:39
:

Am Di 21. Nov 2017, 07:07:59, [email protected] schrieb:
> "Felix Antonius Wilhelm Ostmann via RT"
> <[email protected]> writes:
> 
> > It is not really the same ...
> 
> The _internal_ representation is not the same; the \x from will be
> represented internally as one byte per code point ("downgraded"),
> while
> the literal form will be utf-8-encoded ("upgraded"). Semantically they
> are the same, as evidenced by "eq" returning true.
> 
> > In the real code i have to make a Encode::decode('ISO-8859-15',
> > $enum) as a quickfix.
> 
> Please show where in the real code you have to do this.  It smells
> like
> something you're passing it to suffering from the Unicode Bug,
> i.e. treating the characters in the 128..255 range differently
> depending
> on the internal representation (see
> https://metacpan.org/pod/perlunicode#The-%22Unicode-Bug%22 for
> details).
> 
> > $ cat ticket123698.pl
> > use utf8;
> > use 5.20.0;
> > use Data::Dumper;
> > say "zur\xFCckgestellt" eq "zurückgestellt";
> > print Dumper("zur\xFCckgestellt","zurückgestellt");
> > $ perl ticket123698.pl
> > 1
> > $VAR1 = 'zur�ckgestellt';
> > $VAR2 = "zur\x{fc}ckgestellt";
> 
> The different outputs here are a quirk of how Data::Dumper deals with
> downgraded vs. upgraded strings (which could be viewed as an instance
> of
> the Unicode Bug, but doesn't actually affect semantics).  The first
> one
> is showing as � because you haven't thold perl that your terminal
> expects UTF-8-encoded strings.  Adding
> 
> use open qw(:std :utf8);
> 
> to the script will make it apply a UTF-8 encoding layer to the
> standard
> input/output/error filehandles, so non-ASCII charcters show correctly.
> 
> - ilmari


OK, here is the real world scenario with pseudo code. I am using DBIx::Class + Catalyst + Template Toolkit

ResultSet:
sub enum_status {
    my ($self) = @_;
    # FIXME see https://rt.cpan.org/Public/Bug/Update.html?id=123698
    return map { Encode::decode("ISO-8859-15", $_) } @{ $self->result_source->column_info('status')->{extra}->{list} };
    return @{ $self->result_source->column_info('status')->{extra}->{list} };
}

Catalyst-Controller:
$c->stash->{status_order} = [ $rs->enum_status ];

Template:
[% FOREACH status IN status_order %]
<a href="[% c.request.uri_with({status => status}) %]">
[% END %]

Without the FIXME the links are ISO-8859-15


After reading your reply and docs about unicode-Bug i changed the code to the following:

__PACKAGE__->column_adds(
...
  {         
    data_type => "enum",
    default_value => "offen",  
    extra => {
      custom_type_name => "enum_tasks_status",
      list => ["offen", "erledigt", "zur\xFCckgestellt"],
    },      
    is_nullable => 0,          
  },
...
);
...
# DO NOT MODIFY THIS OR ANYTHING ABOVE! md5sum:W4KhHAXiEW35h5XWiZwhFg
utf8::upgrade($_) for @{ __PACKAGE__->column_info('status')->{extra}->{list} };



But in my option this is kind of a bug. Why are all other strings comming from the database already upgraded but not this?


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant