-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add other formats? #8
Comments
xlsx-support without formating can be achieved quite easily using pyexcel:
This might already take you quite a long way without too many changes to pantable. |
I've considered something like that. But the hard part is to get "proper" xlsx support. Otherwise we're just "saving CSV in an excel container", so to speak. e.g. looking at a macro in Excel for LaTeX (forgot the name but that's virtually the only game on town), people would expect one just to format the Excel in the usual way, and all translation happened 'automatically'. Ideally, something like pandoc that parses docx but for xlsx would be appropriate. It is just that such tool isn't available. I've even considered converting the xlsx to HTML first and uses pandoc to translate it to the AST, but don't know how to do the first step "elegantly" and reliably. |
|
Maybe use pandas? pandas can read data from xlsx, json, html, and many others. There's a temp to implement this in #37 . Right now, I'm using it myself, works fine. |
I considered pandas before but its CSV parser is much less lenient. One example is rows of different no. of columns. In this case the Python standard library can parse it fine and the pandas will emit an error. |
Pantable works very well with csv. Just for xlsx, json, html, even from an online webpage and many others, pandas may be a solution. Just my opinion, for me, csv and my own implementation of xlsx are enough for me. |
But I think if you try using panda's csv parser for the test csv files in this repo then it wouldn't pass. I tested this earlier. The main reason is really in the beginning of the design of pantable, I try to make it as lenient as possible. CSV is a poorly defined format so there are tons of "valid" csv out there that some parsers might choke on. I think one example is that I mentioned, when different rows have different no. of columns. However, since what you said concerns the xlsx parser, and there aren't any existing reader in pantable, so having one might be better than none. My main issue about xlsx reader is that there really isn't one in Python that really parse the markups (correct me if I'm wrong about this), such as bold text, etc. So I have trouble understanding why would someone want to enter markdown/plain text in excel and let pantable read that. In this case one could save the xlsx as csv instead of xlsx without any loss (if one really write plain text / markdown only.) So in this use case one should really has another pipeline handling the xlsx to csv process. i.e. the "do one thing but one thing good". For usability one can even define a filter that read xlsx files and convert to csv and present it as a code block to be piped with the next filter such as pantable. I think when I started to think about xlsx reader I was thinking more like something truly use some non-trivial feature in Excel which would have been lost in xlsx to csv conversion, e.g. bold, italic rather than markdown syntax. But that kind of xlsx parser doesn't seem to exist in Python. I will think more about this. Currently my thought on this will be to have another function that convert xlsx to csv first, by different engines (e.g. depending on availability of engines, default to the "best" engine, and e.g. has a env. var. to select.), and feed that csv into the usual pantable csv pipeline. This is redundant but I guess performance-wise it won't matter much (test needed) but one could has an option to choose multiple xlsx engine since none in Python is perfect, unlike CSV counterparts which are incredibly robust. And even so one need to emphasize this is for plain text/markdown in xlsx only. |
and probably I need to take a more "adventurous" approach in reading other formats. I try to make pantable and pantable2csv incredibly robust but that put me in a corner that I couldn't do the same for other formats and so hasn't really touched them. May be just emphasize in the doc that other formats are beta/alpha only. |
Ha, maybe I wasn't express myself clearly, the way pantale delt with csv is very good. I wasn't suggesting switching to pandas. And I wan't pushing my opition or my pull request either. As for xlsx, I wasn't thinking as much as you do. For me, I actually don't need any advantage of xlsx, so what I did here works for me. As for why xlsx at all, from where I'm working, nobody around uses csv, nobody knows pandoc, even nobody knows markdown, the whole world is full of docx and xlsx. (^_^ I'm lonely.). The format I import all the data from and export to is xlsx, this format is what I'm given with and what I will share with. But I honestly don't need any more markups from xlsx, just like bold, or italic or some equation, I can use raw markdown in xlsx. So basically, for me, xlsx is just a csv file with no format markups so that others can be opened directly with excel. Following the way you were thinking of, pandas wasn't a good way to read the table, but just a convenient way to read data (at least it gets the data right). I like the 'adventurous' idea, so maybe for some formats, you can implement them one by one, and for others, make theme under alpha only. Thank you for making this very useful soft. |
Excel open CSV just fine. So I've been trying to save as CSV rather than xlsx if I'm not using "advanced" Excel features. But then I recently also run into another problem that I end up saving in xlsx even when all the cells are plain text, because I need the split and freeze functionality in order to navigate a big table effectively. So I guess there's valid argument to save in xlsx even when cells are plain texts. I'm more leaning toward writing another filter to preprocess the files as csv first. e.g. a filter using pandas exclusively to do just that, leaving options for another filter using another backend. Naming this would be interesting, combining pandoc and pandas? pandacs? |
Currently, pantable support to and from CSV.
Potentially, other table formats could be supported:
.xlsx
: this one will be useful but difficult:.xlsx
just like how pandoc read/write.docx
. But there's seems no good cli to convert between.docx
and.xlsx
(to pass the.docx
to pandoc)..xlsx
, which might works just like the current.csv
but seems counterintuitive (people expect rich formatting in Excel)..xlsx
to.html
to pass the.html
to pandoc (quite lossy though).The text was updated successfully, but these errors were encountered: