Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add c_char type #875

Closed
tiehuis opened this issue Mar 31, 2018 · 14 comments · Fixed by #15263
Closed

Add c_char type #875

tiehuis opened this issue Mar 31, 2018 · 14 comments · Fixed by #15263
Labels
accepted This proposal is planned. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@tiehuis
Copy link
Member

tiehuis commented Mar 31, 2018

See #861.

This will be identical to a u8 internally except generated header files will result in a char type instead of a uint8_t type.

We should be able to cast between the two types transparently.

var c_char: a = u8(5);

You should never see c_char unless you have to generate a char type for a c header.

@tiehuis tiehuis added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label Mar 31, 2018
@tiehuis tiehuis mentioned this issue Mar 31, 2018
5 tasks
@andrewrk andrewrk added the accepted This proposal is planned. label Mar 31, 2018
@andrewrk andrewrk added this to the 0.4.0 milestone Mar 31, 2018
@eduardosm
Copy link
Contributor

Usinge C's char without the correct sign can raise ABI issues, as it is shown in the following example:
If we have the following C sources:

void test_func(signed char c);

int main(void) {
    test_func(-1);
}

and

void test_func(unsigned char c) {
    // should print a value between 0 and 255
    printf("%u\n", (unsigned int)c);
}

The expected output is 255

However, if I compile them with clang into an executable, with optimizations (-Os, -O1, -O2 or -O3) for x86_64, I get 4294967295. Equivalently, you can declare the parameter with type char in both files and compile the second one with -funsigned char.

This is probably due to zeroext and signext LLVM attributes (https://llvm.org/docs/LangRef.html#parameter-attributes).

I think that this is what is happening:
In the first source, to pass the parameter to the function, the value is sign extended to place it in a register because it is declared as signed.
In the second source, the parameter is declared as unsigned, so the compiler assumes that it had been zero extended. As an optimization, to convert the parameter from unsigned char to unsigned int, instead of reading the lower 8 bits and doing a zero extension, it reads 32 bits from the register, assuming that the remaining 24 bits are zero, resuling in an incorrect value.

That means that, declaring c_char as u8, can lead to problems when the C compiler is treating char as signed.

@andrewrk
Copy link
Member

andrewrk commented Jun 5, 2018

void test_func_signed(signed char c);
void test_func_unsigned(unsigned char c);
zig translate-c test.c
pub extern fn test_func_signed(c: i8) void;
pub extern fn test_func_unsigned(c: u8) void;

Zig correctly understands signed char and unsigned char.

@eduardosm
Copy link
Contributor

void test_func(char c);

produces

pub extern fn test_func(c: u8) void;

char can be signed or unsigned, but it is usually signed. I used signed char and unsigned char on the example to emphasize the signedness.

@eduardosm
Copy link
Contributor

eduardosm commented Jun 5, 2018

To better show how zig is involved in this issue, lets consider this example.

test.h:

extern void test_func(char c);

test.c:

#include <stdio.h>
#include "test.h"

void test_func(char c) {
    // should print a value between -128 and 127
    printf("%i\n", (int)(signed char)c);
}

test.zig:

const c = @cImport({
    @cInclude("test.h");
});

pub fn main() void {
    c.test_func(255);
}
clang -O3 -c test.c
zig run --libc-include-dir . test.zig --object test.o --object /usr/lib/libc.so

The output is 255, although we expected -1. If you pass -funsigned-char to clang, then it correctly prints -1.

@andrewrk
Copy link
Member

andrewrk commented Jun 5, 2018

If you are faced with a .h file which has the function prototype:

void print_char(char c);

Given that char can be signed or unsigned, as the API consumer of this function, you would have to accept that passing a value with the most significant bit set may be interpreted in 2 possible ways. -1 is one such value.

The difference between zig choosing i8 and u8 applies to precisely the range of values that the API consumer cannot be sure about, and therefore zig is free to choose either value.

Therefore print_char(255) is inherently ambiguous. It may print -1 or it may print 255. If the C API producer wished to make it explicit, they could do that. Given that they have chosen to leave it ambiguous, zig chooses to use u8.

@thejoshwolfe
Copy link
Contributor

thejoshwolfe commented Jun 6, 2018

From the C language spec:

The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char.

Zig is an implementation of a subset of the C language. In the Zig implementation of C, char has the same range, representation, and behavior as unsigned char.

Every implementation must make a choice one way or the other. Zig baked its choice into the semantics of the C-to-Zig interface and guarantees that it will not change without a major version bump.

The indeterminate signedness of char in C is not a restriction on implementations but on C programmers. A programmer who puts char in their API is bearing the burden of the ambiguity. An implementation who decides the signedness of char is doing what all implementations must do.

@eduardosm
Copy link
Contributor

If you look at my example, I am explicitly casting c to signed char, so it should always print -1 regardless of char's signedness.

Zig is an implementation of a subset of the C language. In the Zig implementation of C, char has the same range, representation, and behavior as unsigned char.

Shouldn't Zig's implementation be compatible with the system's C (at least in terms of ABI compatibility)?

This does not mean that char needs to be translated to i8. If you add the c_char type, you can make it behave like an u8 everywhere in Zig code, except when it used as parameter for functions with C calling conventions. In that case, the value would be sign- or zero-extended (that means setting the zeroext or signext LLVM attribute), depending on how the system's C implements it.

@andrewrk
Copy link
Member

andrewrk commented Jun 6, 2018

Shouldn't Zig's implementation be compatible with the system's C (at least in terms of ABI compatibility)?

Yes, that is correct. I need to go back and look at your examples more closely and think about the ABI more carefully. Thanks for taking the time to make sure I understood your point, I appreciate it.

To clarify, Zig guarantees compatibility with the C ABI of the chosen target, which includes a C environment, for example you can choose MSVC or GNU as your "C environment".

@shawnl
Copy link
Contributor

shawnl commented Jan 20, 2020

Char and uint_8 are not identical, because char pointers are exempt from strict aliasing while uint_8 pointers are not.

@daurnimator
Copy link
Contributor

As mentioned in #3999 (comment), some platforms have char as signed, some as unsigned. Also note the C compiler flag -fsigned-char

@pixelherodev
Copy link
Contributor

C's char can actually be semantically neither signed nor unsigned - which is all the more reason c_char is needed. It is currently impossible to define an exported function in Zig which takes in a C string correctly. Generated C headers need to be manually adjusted to account for this.

ctest.c:6:24: warning: pointer targets in initialization of 'const uint8_t *' {aka 'const unsigned char *'} from 'char *' differ in signedness [-Wpointer-sign]
    6 |     const uint8_t *t = "H";
      |                        ^~~
ctest.c:7:23: warning: pointer targets in initialization of 'const int8_t *' {aka 'const signed char *'} from 'char *' differ in signedness [-Wpointer-sign]
    7 |     const int8_t *v = "H";
      |                       ^~~

@Mouvedia
Copy link

If you #include <limits.h> and then look at CHAR_MIN, you can find out if plain char is signed or unsigned (if CHAR_MIN is less than 0 or equal to 0)

CHAR_MIN is SCHAR_MIN or 0, I wonder if that could be introspected somehow.

@matu3ba
Copy link
Contributor

matu3ba commented Dec 13, 2022

CHAR_MIN is SCHAR_MIN or 0, I wonder if that could be introspected somehow.

Yes, the sign of char can be introspected with macros:

#ifdef IS_SIGNED
#error "IS_SIGNED already defined"
#else
#define IS_SIGNED(Type) (((Type)-1) < 0)
#endif

For more context, read also this issue: marler8997/ziglibc#1 which ziglibc must eventually decide on how to handle it internally wrt c compatibility guarantees.

@andrewrk andrewrk modified the milestones: 0.11.0, 0.12.0 Apr 9, 2023
andrewrk added a commit that referenced this issue Apr 13, 2023
andrewrk added a commit that referenced this issue Apr 13, 2023
@tau-dev
Copy link
Contributor

tau-dev commented Apr 10, 2024

Char and uint_8 are not identical, because char pointers are exempt from strict aliasing while uint_8 pointers are not.

Most of the time, uint_8_t will be aliased to unsigned char and any value may be accessed through char, unsigned char or signed char pointers. char is still distinct from unsigned char and signed char though, so calling a function of type void print(char c) through a declaration of type void print(unsigned char) is undefined behavior even when the platform has char unsigned by default. I think LLVM chooses not to exploit this UB, but it is something to keep in mind for the C backend.

Sorry for the nitpick, but strict aliasing rules are just finicky : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted This proposal is planned. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants