Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically determine the encoding of the file #29

Open
dkalinai opened this issue Feb 11, 2018 · 7 comments
Open

Automatically determine the encoding of the file #29

dkalinai opened this issue Feb 11, 2018 · 7 comments

Comments

@dkalinai
Copy link

Hi there, again thanks for making this since it saves tons of time.

Could you point me in the code or explain how does the importer determine what type of encoding the file is in when importing. I need to somehow extract this information and not sure how to do that. Maybe you can give me a hint where to look. not a bug more like request for information. And is there actually an automatic encoding determination or am i misinterpreting things?

   ```

guard let csv = CSVImporter<[String: String]>(url: fileURL) else {
return
}

    csv.startImportingRecords(structure: { (headerValues) -> Void in
        print(headerValues)
        
    }) {$0}.onFinish {importedRecord in
        print(importedRecord)
        
    }
@Jeehut
Copy link
Member

Jeehut commented Feb 12, 2018

I think what you're looking for is this line. This means, when creating a CSVImporter object you can pass a parameter named encoding with your files encoding. By default it's set to .utf8.

@Jeehut Jeehut closed this as completed Feb 12, 2018
@dkalinai
Copy link
Author

I see, so no way to determine encoding automatically then. Tough :( there is a method on NSString that computes that from an NSData object. Will try to look into that then on my own. Thanks for having a look.

@Jeehut
Copy link
Member

Jeehut commented Feb 13, 2018

As you can see here we already have logic in place which will automatically determine the type of line ending of the file when .unknown is specified by the user. I'm not against using the exact same strategy for encoding as well. This could be done e.g. by making encoding an optional parameter on the init method, and if it is nil we could use the FileSource to determine the encoding.

Would you be up adding this feature youself and sending a PR with tests and docs updated? If there's a method on NSString/NSData which can handle that, than it should be pretty straight forward to implement since you have the same logic for line ending already in place. That would be really awesome. I'm reopening this issue and renaming it to describe this feature.

@Jeehut Jeehut reopened this Feb 13, 2018
@Jeehut Jeehut changed the title Question: How to determine encoding of the file Automatically determine the encoding of the file Feb 13, 2018
@dkalinai
Copy link
Author

I can make a PR, probably in next few days, I have already found a solution to this by the way and made a simple String extension that returns the encoding to me in String.Encoding format.

The only other issue and a bit off topic here is the delimeters (can be ; as well sometimes) and if one can process a string from memory as a CSV file. Because the NSString method i am referring to not only guesses the encoding but also returns the string to you which would potentially need to be handled by the importer on the fly rather than from a file.

@Jeehut
Copy link
Member

Jeehut commented Feb 13, 2018

Sounds good. Note that one of the advanatages of CSVImporter is that it's able to read big files faster and more safely since it doesn't read the entire file at once, which your solution probably does. So that's another plus on implementing this in CSVImporter.

I don't really understand your other problems though. You probably would need to post some code so I can understand. Note though, that if it's a different problem than this one, it's probably better you open another issue for each problem.

@gaming-hacker
Copy link

how does csvimporter handle garbage? i have a specific data structure but it can be corrupted or fields missiing or added so i need to add some regex.

@Jeehut
Copy link
Member

Jeehut commented Mar 12, 2018

CSVImporter generally expects a valid CSV file according to RFC 4180 which specifies:

Within the header and each record, there may be one or more
fields, separated by commas. Each line should contain the same
number of fields throughout the file
. Spaces are considered part
of a field and should not be ignored. The last field in the
record must not be followed by a comma.

When a line for example doesn't have the same number of fields, then – at the moment – the entire line is simply ignored. That's not required by the RFC though (that's why it's a "should" not a "must"), so we could implement multiple different fallback strategies and let the user choose between them.

Can you give examples of lines and how they are "corrupted"? Depending on the case, I'm perfectly okay with a little more accommodating behavior, so long as it doesn't conflict with the RFC.

Feel free to post a PR with the changes you need and I'll have a look. As long as it is an opt-in feature, is documented (in the README) and is covered by tests (your corrupted file), I'm happy to merge it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants