Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add other formats? #8

Open
4 tasks
ickc opened this issue Jan 27, 2017 · 10 comments
Open
4 tasks

Add other formats? #8

ickc opened this issue Jan 27, 2017 · 10 comments

Comments

@ickc
Copy link
Owner

ickc commented Jan 27, 2017

Currently, pantable support to and from CSV.

Potentially, other table formats could be supported:

  • .xlsx: this one will be useful but difficult:
    • Ideally, one would want to read/write .xlsx just like how pandoc read/write .docx. But there's seems no good cli to convert between .docx and .xlsx (to pass the .docx to pandoc).
    • So one might allow "markdown syntax" in .xlsx, which might works just like the current .csv but seems counterintuitive (people expect rich formatting in Excel).
    • Or find a tool to convert .xlsx to .html to pass the .html to pandoc (quite lossy though).
    • And then this' a question of what if people want to intermix rich text and markdown syntax (e.g. for LaTeX equation).
  • HTML: if someone find writing the source table in HTML easier, say when it is a big table, but want to output to other formats as well
  • YAML (table representing general YAML is not ideal, I'm thinking more like 2 column tables to show the key-value pairs)
  • JSON (similar to above)
@ickc ickc changed the title Add other format? Add other formats? Jan 27, 2017
This was referenced Jan 27, 2017
@ghost
Copy link

ghost commented Oct 12, 2017

xlsx-support without formating can be achieved quite easily using pyexcel:

import pyexcel
raw_table_list = pyexcel.get_array(file_name="input.xlsx")

This might already take you quite a long way without too many changes to pantable.

@ickc
Copy link
Owner Author

ickc commented Oct 12, 2017

I've considered something like that. But the hard part is to get "proper" xlsx support. Otherwise we're just "saving CSV in an excel container", so to speak.

e.g. looking at a macro in Excel for LaTeX (forgot the name but that's virtually the only game on town), people would expect one just to format the Excel in the usual way, and all translation happened 'automatically'.

Ideally, something like pandoc that parses docx but for xlsx would be appropriate. It is just that such tool isn't available. I've even considered converting the xlsx to HTML first and uses pandoc to translate it to the AST, but don't know how to do the first step "elegantly" and reliably.

@fralau
Copy link

fralau commented Oct 3, 2018

  1. The question of the csv separator and dialect is, in my view, slightly different because it involves very few changes in the code (the csv module provides the necessary arguments "out of the box"). It is also very important, because csv files come in all forms and shapes (see RFC 480).

  2. I really have a preference for the Excel dialect with semicolons (";") as a default format, because that is the format that Excel can import automatically, without any manipulations (just double-click on the file). So working with that format saves a lot of time with manipulation, debugging, etc.

  3. I am glad to see that pyexcel has gone such a long way with importing files (and notably arrays) from an Excel (native .xlrx) file and that it has become so easy: a few years ago, I had to develop my own logic with xlrd, and that was not exactly fun. It worked very well, though.

  4. My suggestion is that if if you import .xlrx files, you do some reasoning about worksheets.

    • if no worksheet is specified, import data from the first worksheet
    • otherwise, accepte a 'worksheet' argument (either by name or by number). The use cases with the name or the number could be different, so I would provide both possibilities (and I suppose pyexcel should be agnostic anyway about this).
  5. Also, note that csv files (and possibly Excel files) do not always start with the list of fields at the first line. This is due to the fact that some SQL export tools write the query on the first line of the export file, by default. Having to edit csv files manually every time can be annoying, so an argument saying e.g. "first_line = 2" could be useful.

@gepcel
Copy link
Contributor

gepcel commented Jan 26, 2019

Maybe use pandas? pandas can read data from xlsx, json, html, and many others.

There's a temp to implement this in #37 . Right now, I'm using it myself, works fine.

@ickc
Copy link
Owner Author

ickc commented Jan 29, 2019

I considered pandas before but its CSV parser is much less lenient. One example is rows of different no. of columns. In this case the Python standard library can parse it fine and the pandas will emit an error.

@gepcel
Copy link
Contributor

gepcel commented Jan 29, 2019

Pantable works very well with csv. Just for xlsx, json, html, even from an online webpage and many others, pandas may be a solution. Just my opinion, for me, csv and my own implementation of xlsx are enough for me.

@ickc
Copy link
Owner Author

ickc commented Jan 29, 2019

But I think if you try using panda's csv parser for the test csv files in this repo then it wouldn't pass. I tested this earlier. The main reason is really in the beginning of the design of pantable, I try to make it as lenient as possible. CSV is a poorly defined format so there are tons of "valid" csv out there that some parsers might choke on. I think one example is that I mentioned, when different rows have different no. of columns.

However, since what you said concerns the xlsx parser, and there aren't any existing reader in pantable, so having one might be better than none.

My main issue about xlsx reader is that there really isn't one in Python that really parse the markups (correct me if I'm wrong about this), such as bold text, etc. So I have trouble understanding why would someone want to enter markdown/plain text in excel and let pantable read that. In this case one could save the xlsx as csv instead of xlsx without any loss (if one really write plain text / markdown only.) So in this use case one should really has another pipeline handling the xlsx to csv process. i.e. the "do one thing but one thing good". For usability one can even define a filter that read xlsx files and convert to csv and present it as a code block to be piped with the next filter such as pantable.

I think when I started to think about xlsx reader I was thinking more like something truly use some non-trivial feature in Excel which would have been lost in xlsx to csv conversion, e.g. bold, italic rather than markdown syntax. But that kind of xlsx parser doesn't seem to exist in Python.

I will think more about this. Currently my thought on this will be to have another function that convert xlsx to csv first, by different engines (e.g. depending on availability of engines, default to the "best" engine, and e.g. has a env. var. to select.), and feed that csv into the usual pantable csv pipeline. This is redundant but I guess performance-wise it won't matter much (test needed) but one could has an option to choose multiple xlsx engine since none in Python is perfect, unlike CSV counterparts which are incredibly robust. And even so one need to emphasize this is for plain text/markdown in xlsx only.

@ickc
Copy link
Owner Author

ickc commented Jan 29, 2019

and probably I need to take a more "adventurous" approach in reading other formats. I try to make pantable and pantable2csv incredibly robust but that put me in a corner that I couldn't do the same for other formats and so hasn't really touched them. May be just emphasize in the doc that other formats are beta/alpha only.

@gepcel
Copy link
Contributor

gepcel commented Jan 29, 2019

Ha, maybe I wasn't express myself clearly, the way pantale delt with csv is very good. I wasn't suggesting switching to pandas. And I wan't pushing my opition or my pull request either.

As for xlsx, I wasn't thinking as much as you do. For me, I actually don't need any advantage of xlsx, so what I did here works for me.

As for why xlsx at all, from where I'm working, nobody around uses csv, nobody knows pandoc, even nobody knows markdown, the whole world is full of docx and xlsx. (^_^ I'm lonely.). The format I import all the data from and export to is xlsx, this format is what I'm given with and what I will share with. But I honestly don't need any more markups from xlsx, just like bold, or italic or some equation, I can use raw markdown in xlsx.

So basically, for me, xlsx is just a csv file with no format markups so that others can be opened directly with excel.

Following the way you were thinking of, pandas wasn't a good way to read the table, but just a convenient way to read data (at least it gets the data right). I like the 'adventurous' idea, so maybe for some formats, you can implement them one by one, and for others, make theme under alpha only.

Thank you for making this very useful soft.

@ickc
Copy link
Owner Author

ickc commented Jan 30, 2019

Excel open CSV just fine. So I've been trying to save as CSV rather than xlsx if I'm not using "advanced" Excel features. But then I recently also run into another problem that I end up saving in xlsx even when all the cells are plain text, because I need the split and freeze functionality in order to navigate a big table effectively.

So I guess there's valid argument to save in xlsx even when cells are plain texts.

I'm more leaning toward writing another filter to preprocess the files as csv first. e.g. a filter using pandas exclusively to do just that, leaving options for another filter using another backend. Naming this would be interesting, combining pandoc and pandas? pandacs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants