Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Characters in text files are valid according to declared encoding #450

Open
jeanetteclark opened this issue Apr 23, 2024 · 0 comments
Open

Comments

@jeanetteclark
Copy link
Contributor

jeanetteclark commented Apr 23, 2024

Status : ⌛ Not Started

Description

Check for text values within the correct ranges for declared encoding.

e.g., ASCII files only contain characters in the range \x00 to \xFF
e.g., Unicode encoded text files only contain characters in the correct range (e.g., for UTF-8)

Priority

  • Data Quality: Required

Issues

  • Most files don't have a declared encoding? So I'm not sure how we would check for this other than assuming most things we see are UTF-8 (or maybe ASCII??) unless declared otherwise. Thoughts @mbjones?

Procedure

  • in R, we could use validUTF8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

No branches or pull requests

1 participant