You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are three non-visible characters (\x7f, \x14, and \x13) that appear in the notes. These characters cause issues when working with many text processing tools. Removing these characters will make it easier to work with the notes and should not change the underlying meaning.
Currently, when using processing tools (e.g., cTAKES) that process a directory, one can workaround this issue by manually removing the offending characters:
find notes -type f -exec sed -i 's/\x7f//g; s/\x14//g; s/\x13//g' {} +
These characters appear in ~700 notes. A truncated character distribution shows:
[ # ...
('\x13', 677),
('\x14', 49),
('\x7f', 2)]
There does not appear to be overlap among the affected notes.
Thanks for the report. I think there are some issues in the conversion from the original Microsoft SQL server to Oracle which was done before Tom and I got here. The notes are actually stored in a hexadecimal format and need decoding in order to be readable as plain text. I tried decoding it as UTF-8, but then you get Latin-1 characters sneaking in (in particular the Latin-1 character for a space, which is \x13 I think). I also tried decoding it as Latin-1, but then I received other errors. When we re-extract the notes this is definitely something we will pay attention to. Have you noticed that it only occurs in Metavision notes? That would be consistent with what I've found. The Metavision notes have categories: Nursing,Rehab Services,Case Management,General,Consult,Nutrition,Social Work,Pharmacy,Physician,Respiratory. The rest are sourced elsewhere (Nursing/other is CareVue, and the others are from the hospital database).
They seem to appear in CareVue. If you look at the breakdown of categories, it's primarily Nursing/other.
mimic=# select category, count(1) from mimiciii.noteevents where text ~ '\x7f' group by category;
category | count
-----------+-------
Radiology | 2
(1 row)
mimic=# select category, count(1) from mimiciii.noteevents where text ~ '\x14' group by category;
category | count
---------------+-------
Nursing/other | 47
(1 row)
mimic=# select category, count(1) from mimiciii.noteevents where text ~ '\x13' group by category;
category | count
---------------+-------
Nursing/other | 593
(1 row)
There are three non-visible characters (
\x7f
,\x14
, and\x13
) that appear in the notes. These characters cause issues when working with many text processing tools. Removing these characters will make it easier to work with the notes and should not change the underlying meaning.Currently, when using processing tools (e.g., cTAKES) that process a directory, one can workaround this issue by manually removing the offending characters:
find notes -type f -exec sed -i 's/\x7f//g; s/\x14//g; s/\x13//g' {} +
These characters appear in ~700 notes. A truncated character distribution shows:
There does not appear to be overlap among the affected notes.
The text was updated successfully, but these errors were encountered: