Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notes contain non-visible characters #137

Open
tnaumann opened this issue Oct 3, 2016 · 2 comments
Open

Notes contain non-visible characters #137

tnaumann opened this issue Oct 3, 2016 · 2 comments

Comments

@tnaumann
Copy link
Contributor

tnaumann commented Oct 3, 2016

There are three non-visible characters (\x7f, \x14, and \x13) that appear in the notes. These characters cause issues when working with many text processing tools. Removing these characters will make it easier to work with the notes and should not change the underlying meaning.

Currently, when using processing tools (e.g., cTAKES) that process a directory, one can workaround this issue by manually removing the offending characters:

find notes -type f -exec sed -i 's/\x7f//g; s/\x14//g; s/\x13//g' {} +

These characters appear in ~700 notes. A truncated character distribution shows:

[ # ...
 ('\x13', 677),
 ('\x14', 49),
 ('\x7f', 2)]

There does not appear to be overlap among the affected notes.

find notes -type f | xargs grep --color='auto' -P -l "[\x7f]" | sort | uniq -c
      1 notes/1062359
      1 notes/862440
find notes -type f | xargs grep --color='auto' -P -l "[\x14]" | sort | uniq -c
      1 notes/1437693
      1 notes/1482241
...
find notes -type f | xargs grep --color='auto' -P -l "[\x13]" | sort | uniq -c
      1 notes/1901399
      1 notes/1902291
...
@alistairewj
Copy link
Member

Thanks for the report. I think there are some issues in the conversion from the original Microsoft SQL server to Oracle which was done before Tom and I got here. The notes are actually stored in a hexadecimal format and need decoding in order to be readable as plain text. I tried decoding it as UTF-8, but then you get Latin-1 characters sneaking in (in particular the Latin-1 character for a space, which is \x13 I think). I also tried decoding it as Latin-1, but then I received other errors. When we re-extract the notes this is definitely something we will pay attention to. Have you noticed that it only occurs in Metavision notes? That would be consistent with what I've found. The Metavision notes have categories: Nursing,Rehab Services,Case Management,General,Consult,Nutrition,Social Work,Pharmacy,Physician,Respiratory. The rest are sourced elsewhere (Nursing/other is CareVue, and the others are from the hospital database).

@tnaumann
Copy link
Contributor Author

tnaumann commented Oct 6, 2016

They seem to appear in CareVue. If you look at the breakdown of categories, it's primarily Nursing/other.

mimic=# select category, count(1) from mimiciii.noteevents where text ~ '\x7f' group by category;
 category  | count
-----------+-------
 Radiology |     2
(1 row)
mimic=# select category, count(1) from mimiciii.noteevents where text ~ '\x14' group by category;
   category    | count
---------------+-------
 Nursing/other |    47
(1 row)
mimic=# select category, count(1) from mimiciii.noteevents where text ~ '\x13' group by category;
   category    | count
---------------+-------
 Nursing/other |   593
(1 row)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants