Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

writing SPSS sav file with long strings changes the column names and values #260

Open
ofajardo opened this issue Dec 15, 2021 · 2 comments

Comments

@ofajardo
Copy link

ofajardo commented Dec 15, 2021

While doing some experiments I found the following strange thing:

If I compile the program described below, where I write three variables, the first one string, the two last ones double. Next I read it with the readstat binary converting it to csv, and then visually inspect the resulting csv. If the length of the string is 756 (does not matter if the last character is the null char as here or it is a normal char) or less, the csv looks as expected. i.e :

"aaaaa2","y","a0"
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",0.000000,0.000000

But if the length of the string is 757 (again, independently of the last character being null or not), then I get this:

"AAAAA2","AAAAA1","aaaaa2"
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"

where there are two new variables AAAAA2 and AAAAA1, the two numeric variables have disappeared and the values are all "a"s. The length of every value seems to be 255, except the last one being 246.

Maybe I did something wrong in the program, sorry if that is the case.

Also not sure if this is of any help, if not you can ignore it and close the issue. My thinking is that this is maybe somehow related to #236 and #241.

Here the program :

#include "readstat.h"
#include <unistd.h>
#include <fcntl.h>

/* A callback for writing bytes to your file descriptor of choice */
/* The ctx argument comes from the readstat_begin_writing_xxx function */
static ssize_t write_bytes(const void *data, size_t len, void *ctx) {
    int fd = *(int *)ctx;
    return write(fd, data, len);
}

int main(int argc, char *argv[]) {
    readstat_writer_t *writer = readstat_writer_init();
    readstat_set_data_writer(writer, &write_bytes);
    readstat_writer_set_file_label(writer, "My data set");

    int row_count = 1;
    int strlen = 757; /* with 756 still fine */

    readstat_variable_t *variable1 = readstat_add_variable(writer, "aaaaa2", READSTAT_TYPE_STRING, strlen);
    readstat_variable_set_label(variable1, "x");
    readstat_variable_t *variable2 = readstat_add_variable(writer, "y", READSTAT_TYPE_DOUBLE, 0);
    readstat_variable_set_label(variable2, "y");
    readstat_variable_t *variable3 = readstat_add_variable(writer, "a0", READSTAT_TYPE_DOUBLE, 0);
    readstat_variable_set_label(variable3, "z");

    int fd = open("test.sav", O_CREAT | O_WRONLY);
    readstat_begin_writing_sav(writer, &fd, row_count);

    char weirdstr[strlen];
    int i;
    for (i=0; i<strlen-1; i++){
	    weirdstr[i] = 'a';
    }
    weirdstr[strlen-1] = '\0'; /*seems to be optional, everything is the same if the last character is a*/
    printf("%s\n", weirdstr);

    for (i=0; i<row_count; i++) {
        readstat_begin_row(writer);
	readstat_insert_string_value(writer, variable1, weirdstr);
        readstat_insert_double_value(writer, variable2, 1.0 * i);
        readstat_insert_double_value(writer, variable3, 1.0 * i);
        readstat_end_row(writer);
    }

    readstat_end_writing(writer);
    readstat_writer_free(writer);
    close(fd);

    return 0;
}
@ofajardo
Copy link
Author

ofajardo commented Dec 15, 2021

And here another variation. My aim was to reproduce #241. If I introduce an international character at the end, the result is the same as reported before, i.e. with a length shorter than 756 everything looks fine including the international character, but with 757 I get the split as shown before, also with the international character appearing correctly. However if I replace the numbers with NANs, the file is written ok, but when converting it to csv with readstat, then the error arises:

Error processing test.sav: Unable to convert string to the requested encoding (invalid byte sequence)

Actually it is not needed to have an international character at all, it also happens with normal characters as shown below, so this is caused by the NANs. I can see the same effect using python (no international character is needed to cause the error), also there I need 757 characters to cause the error and with 756 it is fine. In python with the international character 756 characters were enough to cause the issue as I guess the international characters is two bytes.

Program

#include "readstat.h"
#include <unistd.h>
#include <fcntl.h>
#include <locale.h>

/* A callback for writing bytes to your file descriptor of choice */
/* The ctx argument comes from the readstat_begin_writing_xxx function */
static ssize_t write_bytes(const void *data, size_t len, void *ctx) {
    int fd = *(int *)ctx;
    return write(fd, data, len);
}

int main(int argc, char *argv[]) {
    setlocale(LC_ALL, "en_US.UTF-8");
    readstat_writer_t *writer = readstat_writer_init();
    readstat_set_data_writer(writer, &write_bytes);
    readstat_writer_set_file_label(writer, "My data set");

    int row_count = 1;
    int strlen = 757; /*with 756 still fine*/

    readstat_variable_t *variable1 = readstat_add_variable(writer, "aaaaa2", READSTAT_TYPE_STRING, strlen);
    readstat_variable_set_label(variable1, "x");
    readstat_variable_t *variable2 = readstat_add_variable(writer, "y", READSTAT_TYPE_DOUBLE, 0);
    readstat_variable_set_label(variable2, "y");
    readstat_variable_t *variable3 = readstat_add_variable(writer, "a0", READSTAT_TYPE_DOUBLE, 0);
    readstat_variable_set_label(variable3, "z");

    int fd = open("test.sav", O_CREAT | O_WRONLY);
    readstat_begin_writing_sav(writer, &fd, row_count);

    char weirdstr[strlen];
    int i;
    for (i=0; i<strlen-2; i++){
	    weirdstr[i] = 'a';
    }
   
    weirdstr[strlen-1] = '\0';
    printf("%s\n", weirdstr);
   /* I could also insert an international character at the end, but the important thing is the NANs*/
    /*char extra[] = "ü\""; 
    strcat(weirdstr, extra);*/
 
    for (i=0; i<row_count; i++) {
        readstat_begin_row(writer);
	readstat_insert_string_value(writer, variable1, weirdstr);
        readstat_insert_double_value(writer, variable2, NAN);
        readstat_insert_double_value(writer, variable3, NAN);
        readstat_end_row(writer);
    }

    readstat_end_writing(writer);
    readstat_writer_free(writer);
    close(fd);

    return 0;
}

@evanmiller
Copy link
Contributor

Thanks for the report. The best way to confirm a bug is to create a test case in

https://github.com/WizardMac/ReadStat/blob/master/src/test/test_list.h

This will perform a round-trip on the data and confirm whether the same values are read and written. (You can run the test suite with "make check".)

I will dig further into it when I get a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants