-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rowCount wrong for converted excel ODS files #64
Comments
that sounds familiar, i think Calc wants to set some style on every cell in the spreadsheet and it does that by writing empty cells as far as its max capacity goes. |
First we need to detect "same rows which should be repeadted-rows", as I said there is one last extra-row at the very end of the document. If we can merge such entries, we can remove/ignore those big-size repeated-rows at the very end of the document. |
The table is like an anomaly in the XML document format. I had a similar problem as yours years ago, when I transformed ODS to HTML and some ODF applications were expressing the table background via cell styles instead of using the table background-colour style. I would suggest keeping the repeated rows/columns/cells in the DOM model and not to split them during loading nor iteration. Could you attach a problematic test document to this issue and point to the problematic source code in the master (as in the table style example above, you might point to any GitHub line number), please? PS: During the beginning of the heatwave and last week of school holiday in Berlin, I decided to spend some time on the cold countryside till next week, therefore my answers (and review of pull requests) might be delayed (due to a final summer holiday)... ;-) |
The problem of that higher-level API approach is, that currently I need to iterate trough 1 mio x 1024 cells = 1 billion, because I do not know if I have reached the end of the "real" content. I would suggest to do a "Find all usages" on TableTableRowElement:getTableNumberRowsRepeatedAttribute, TableTableRowElement:getRepetition and OdfTableRow:getRowsRepeatedNumber to pinpoint the problematic source codes, it seems only the former has the real problematic code attached to (see below). OdfTable:getRowCount OdfTable:getRowElementByIndex OdfTable:getRowList OdfTable:getHeaderRowCount OdfTableCell: ignorable At the end, it might be sufficient for an additional method around "getLastRowWithContentIndex" and some documentation efforts for people not pitfalling into. What do you think? |
We might be crosstalking, let me give you an example. Would it not be helpful to have an API that returns all cells (and their positions) in document-order which are different from the 'default' empty cells? What if we are not only returning cells but even cell ranges? By this, we might turn the repeated row/cell feature into an advantage and return those without splitting them. In other words, solve the problem by getting away from the usual one-cell at the time paradigm. Going to discuss this with others in the next days and likely come back with an update afterwards. |
As I understand, there should not be a "foreach row; foreach column ()" but a "foreach area; foreach cell ()". It runs over the same ODS/ODF-Document format as we know today.
|
Allow me to answer indirectly ;-)
Now in regard to your question: "foreach row; foreach column ()" vs. "foreach area; foreach cell ()". This high-level traversal over the DOM ranges might be a generic function, where we might provide some query/filter, like only certain content, or only formula, pictures, etc. You asked for a scenario for such an API. Imagine we would not only offer this API on top of the DOM, but also on top of the SAX API. We would have a high-level stream of ODF API. Remember an ODF document is equal to a list of its user changes creating it. Hopefully, I have covered your questions and comments. |
One last comment in regard to cell ranges. If you select any rectangle of cells within a table (every sheet of a spreadsheet is a table) it is a cell range (or how I learned and defined it). When I - in my prior life - once added ODF spreadsheet support to OX Documents, which used a fork of ODFDOM in the backend, we allowed the user to do arbitrary actions on cell ranges of the spreadsheet. Aside from that (fixable mistake), I loved the simplicity of that generic cell-ranged API! :-) |
Libreoffice generates an ODS file from Excel having a row with repeated-rows: 1mio and another empty one.
Reference: https://forum.lazarus.freepascal.org/index.php?topic=35451.0
This is still an issue, I found the same behavior in our ODS document.
Because of this, a
for (OdfTableRow row : table.getRowList())
iterates through a million empty rows and table.getRowCount() returns the maximum (~1 mio).
As a workaround, a repeated-rows of above, let's say 10000, should be simply ignored. There is very probably no reason to have anything repeated ten thousand times.
I did not search whether there are other occurances of getTableNumberRowsRepeatedAttribute to be regarded.
I did not research a correct way to know whether that row is "empty" in reality - even then one might add one for visuals.
I did not research the count of columns in such documents in detail, but they seem to be fixed to 1024.
The text was updated successfully, but these errors were encountered: