Automatically determine the encoding of the file #29

dkalinai · 2018-02-11T20:51:51Z

Hi there, again thanks for making this since it saves tons of time.

Could you point me in the code or explain how does the importer determine what type of encoding the file is in when importing. I need to somehow extract this information and not sure how to do that. Maybe you can give me a hint where to look. not a bug more like request for information. And is there actually an automatic encoding determination or am i misinterpreting things?

```

guard let csv = CSVImporter<[String: String]>(url: fileURL) else {
return
}

    csv.startImportingRecords(structure: { (headerValues) -> Void in
        print(headerValues)
        
    }) {$0}.onFinish {importedRecord in
        print(importedRecord)
        
    }

The text was updated successfully, but these errors were encountered:

Jeehut · 2018-02-12T09:38:49Z

I think what you're looking for is this line. This means, when creating a CSVImporter object you can pass a parameter named encoding with your files encoding. By default it's set to .utf8.

dkalinai · 2018-02-12T10:28:19Z

I see, so no way to determine encoding automatically then. Tough :( there is a method on NSString that computes that from an NSData object. Will try to look into that then on my own. Thanks for having a look.

Jeehut · 2018-02-13T07:24:42Z

As you can see here we already have logic in place which will automatically determine the type of line ending of the file when .unknown is specified by the user. I'm not against using the exact same strategy for encoding as well. This could be done e.g. by making encoding an optional parameter on the init method, and if it is nil we could use the FileSource to determine the encoding.

Would you be up adding this feature youself and sending a PR with tests and docs updated? If there's a method on NSString/NSData which can handle that, than it should be pretty straight forward to implement since you have the same logic for line ending already in place. That would be really awesome. I'm reopening this issue and renaming it to describe this feature.

dkalinai · 2018-02-13T07:29:13Z

I can make a PR, probably in next few days, I have already found a solution to this by the way and made a simple String extension that returns the encoding to me in String.Encoding format.

The only other issue and a bit off topic here is the delimeters (can be ; as well sometimes) and if one can process a string from memory as a CSV file. Because the NSString method i am referring to not only guesses the encoding but also returns the string to you which would potentially need to be handled by the importer on the fly rather than from a file.

Jeehut · 2018-02-13T07:43:01Z

Sounds good. Note that one of the advanatages of CSVImporter is that it's able to read big files faster and more safely since it doesn't read the entire file at once, which your solution probably does. So that's another plus on implementing this in CSVImporter.

I don't really understand your other problems though. You probably would need to post some code so I can understand. Note though, that if it's a different problem than this one, it's probably better you open another issue for each problem.

gaming-hacker · 2018-03-09T18:01:43Z

how does csvimporter handle garbage? i have a specific data structure but it can be corrupted or fields missiing or added so i need to add some regex.

Jeehut · 2018-03-12T09:19:36Z

CSVImporter generally expects a valid CSV file according to RFC 4180 which specifies:

Within the header and each record, there may be one or more
fields, separated by commas. Each line should contain the same
number of fields throughout the file. Spaces are considered part
of a field and should not be ignored. The last field in the
record must not be followed by a comma.

When a line for example doesn't have the same number of fields, then – at the moment – the entire line is simply ignored. That's not required by the RFC though (that's why it's a "should" not a "must"), so we could implement multiple different fallback strategies and let the user choose between them.

Can you give examples of lines and how they are "corrupted"? Depending on the case, I'm perfectly okay with a little more accommodating behavior, so long as it doesn't conflict with the RFC.

Feel free to post a PR with the changes you need and I'll have a look. As long as it is an opt-in feature, is documented (in the README) and is covered by tests (your corrupted file), I'm happy to merge it!

Jeehut closed this as completed Feb 12, 2018

Jeehut reopened this Feb 13, 2018

Jeehut changed the title ~~Question: How to determine encoding of the file~~ Automatically determine the encoding of the file Feb 13, 2018

Jeehut added the enhancement label Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically determine the encoding of the file #29

Automatically determine the encoding of the file #29

dkalinai commented Feb 11, 2018

Jeehut commented Feb 12, 2018 •

edited

Loading

dkalinai commented Feb 12, 2018

Jeehut commented Feb 13, 2018 •

edited

Loading

dkalinai commented Feb 13, 2018

Jeehut commented Feb 13, 2018

gaming-hacker commented Mar 9, 2018

Jeehut commented Mar 12, 2018 •

edited

Loading

Automatically determine the encoding of the file #29

Automatically determine the encoding of the file #29

Comments

dkalinai commented Feb 11, 2018

Jeehut commented Feb 12, 2018 • edited Loading

dkalinai commented Feb 12, 2018

Jeehut commented Feb 13, 2018 • edited Loading

dkalinai commented Feb 13, 2018

Jeehut commented Feb 13, 2018

gaming-hacker commented Mar 9, 2018

Jeehut commented Mar 12, 2018 • edited Loading

Jeehut commented Feb 12, 2018 •

edited

Loading

Jeehut commented Feb 13, 2018 •

edited

Loading

Jeehut commented Mar 12, 2018 •

edited

Loading