Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle line breaks encapsulated in XML tags #46

Open
andreasnoack opened this issue Feb 23, 2018 · 1 comment
Open

Handle line breaks encapsulated in XML tags #46

andreasnoack opened this issue Feb 23, 2018 · 1 comment
Labels

Comments

@andreasnoack
Copy link
Contributor

Indeed, this is a pretty exotic feature request but I happen to have some CSVs where the last column contains mixed text including XML and the text within the XML tags can potentially have newline characters which shouldn't be interpreted as newlines when parsing the file. Two such examples

<PAGE_AUTHORS>&#xD;\n&#xD;\n&#xD;\n&#xD;\n&#xD;\nHACKETT;Ark. &#xC3;&#xA2;&#xC2;&#x80;&#xC2;&#x94; A sheriff;admin;About the Author</PAGE_AUTHORS>

and

<PAGE_AUTHORS>K G Rana;\nMax Planck Institute of Microstructure Physics;Weinberg 2;D-06120 Halle;Germany;\nMax Planck Institute for Chemical Physics of Solids;N&#xC3;&#xB6;thnitzer Str. 40;D-01187 Dresden;O Meshcheriakova;J K&#xC3;&#xBC;bler;\nInstitut f&#xC3;&#xBC;r Festk&#xC3;&#xB6;rperphysik;Technische Universit&#xC3;&#xA4;t Darmstadt;D-64289 Darmstadt;B Ernst;J Karel;R Hillebrand;E Pippel;P Werner;A K Nayak;C Felser;S S P Parkin</PAGE_AUTHORS>

The first example is taken from the file 20160810171500.gkg.csv from the GDELT2 dataset

@davidanthoff
Copy link
Member

Are these columns surrounded by quotation marks? If not, we would have to add support for XML to handle this? That seems not like a good idea :) Or am I misunderstanding something here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants