rowCount wrong for converted excel ODS files #64

HeikoStudt · 2020-08-05T12:14:18Z

Libreoffice generates an ODS file from Excel having a row with repeated-rows: 1mio and another empty one.
Reference: https://forum.lazarus.freepascal.org/index.php?topic=35451.0
This is still an issue, I found the same behavior in our ODS document.

Because of this, a
for (OdfTableRow row : table.getRowList())
iterates through a million empty rows and table.getRowCount() returns the maximum (~1 mio).

As a workaround, a repeated-rows of above, let's say 10000, should be simply ignored. There is very probably no reason to have anything repeated ten thousand times.

I did not search whether there are other occurances of getTableNumberRowsRepeatedAttribute to be regarded.
I did not research a correct way to know whether that row is "empty" in reality - even then one might add one for visuals.
I did not research the count of columns in such documents in detail, but they seem to be fixed to 1024.

mistmist · 2020-08-05T12:28:07Z

that sounds familiar, i think Calc wants to set some style on every cell in the spreadsheet and it does that by writing empty cells as far as its max capacity goes.
so either there need to be some way to handle this efficiently (by not materializing a million empty rows), or to ignore empty rows at the end of the sheet - but not if they're followed by non-empty rows...

HeikoStudt · 2020-08-06T11:03:56Z

First we need to detect "same rows which should be repeadted-rows", as I said there is one last extra-row at the very end of the document. If we can merge such entries, we can remove/ignore those big-size repeated-rows at the very end of the document.
We could try to count the (real) columns of the repeated-rows, if there is exactly one (being repeated) which is empty-typed, it is a good indicator for en "empty" row.

svanteschubert · 2020-08-06T12:38:03Z

The table is like an anomaly in the XML document format.
While everything else fits nicely in the XML tree format, the table being a two-dimensional array does not and had to be adopted to be serialized into the XML.
The equality of rows and columns had to be broken up. As you know, columns became a second class citizen being serialized in the beginning without any content, followed by the rows with the real cell content. It could have been designed the other away around, but other formats as HTML have chosen the same design.
From today perspective, I do not like the repeated empty approach, either. An index on cells to jump to the relevant cells with content seem more appropriate, but the redesign of ODF XML is not ready, but will happen hopefully come during my lifetime ;-)

I had a similar problem as yours years ago, when I transformed ODS to HTML and some ODF applications were expressing the table background via cell styles instead of using the table background-colour style.

I would suggest keeping the repeated rows/columns/cells in the DOM model and not to split them during loading nor iteration.
On top of this DOM layer, should be the high-level API, which abstracts from the design decision of the XML and handles tables again as two-dimensional arrays.
For instance, our new collaboration API, which is dispatching changes between collaborators (instead of full documents), should be able to rely on a similar high-level API on top of the DOM - likely an extension of the current DOC API.

Could you attach a problematic test document to this issue and point to the problematic source code in the master (as in the table style example above, you might point to any GitHub line number), please?

PS: During the beginning of the heatwave and last week of school holiday in Berlin, I decided to spend some time on the cold countryside till next week, therefore my answers (and review of pull requests) might be delayed (due to a final summer holiday)... ;-)

HeikoStudt · 2020-08-09T07:08:46Z

The problem of that higher-level API approach is, that currently I need to iterate trough 1 mio x 1024 cells = 1 billion, because I do not know if I have reached the end of the "real" content.
It would help to give a method "isEmptyEndOfDocumentRow" and (if possible) "isEmptyEndOfRowColumn". Then, in my own iteration I can fast-break out of the loop. Currently, I put a fixed limit of 2000 rows.

I would suggest to do a "Find all usages" on TableTableRowElement:getTableNumberRowsRepeatedAttribute, TableTableRowElement:getRepetition and OdfTableRow:getRowsRepeatedNumber to pinpoint the problematic source codes, it seems only the former has the real problematic code attached to (see below).

OdfTable:getRowCount
=> additional method around "getLastRowWithContentIndex"

OdfTable:getRowElementByIndex
=> probably OK

OdfTable:getRowList
=> additional method like "getContentRowsList", though with this also the content-less columns need to be filtered, or even "look-nice" empty rows in between. This behavior might however be helpfull for backend business logic anyway.
=> or a documented "you might want to break after getLastRowWithContentIndex was reached"

OdfTable:getHeaderRowCount
=> Currently, I would not change

OdfTableCell: ignorable

At the end, it might be sufficient for an additional method around "getLastRowWithContentIndex" and some documentation efforts for people not pitfalling into. What do you think?

svanteschubert · 2020-08-10T23:08:39Z

We might be crosstalking, let me give you an example.

Would it not be helpful to have an API that returns all cells (and their positions) in document-order which are different from the 'default' empty cells?
Like returning the red, green and yellow cell:

MergeTest.odt - download MergeTest.zip & rename to ods

What if we are not only returning cells but even cell ranges? By this, we might turn the repeated row/cell feature into an advantage and return those without splitting them.

In other words, solve the problem by getting away from the usual one-cell at the time paradigm.

Going to discuss this with others in the next days and likely come back with an update afterwards.
Not sure how this API might look like in detail in Java, yet! ;-)

HeikoStudt · 2020-08-11T05:16:50Z

As I understand, there should not be a "foreach row; foreach column ()" but a "foreach area; foreach cell ()". It runs over the same ODS/ODF-Document format as we know today.
I'll try to argument about the idea in the following, hopefully, I understood your idea correctly and do not cross-talk again. :-)

Which Cells are given back in which order?
a) There should be some kind of Filter-Flags like "Ignore-StyleOnly"
b) In which order do you want to Iterate? In the document above, it might be cool to iterate reds, greens, yellows - but this iteration would be very chaotic (from a programmer's point of view) and may change through versions of the API, as we change the identification of these areas. Additionally, in most documents (I am aware of) you will have a simple table for the main data - do you want to go through this row-based or column-based? Both might have reasoning behind.
c) For whom is this new API? Business-Apps trying to import some data, only interested in textual context? Business-Apps trying to understand the structure for some analysis? ODF-Apps for templating ODS-files? View-/Edit-Tools for the user?
Does this solve the issue rowCount wrong for converted excel ODS files #64?
The decision of "this is an empty area or not" is probably better of in the API instead of the user of the API. But it might change dramatically due to the use-case (see 1.c)
In the document all the 1 billion cells are non-empty (they have got some style). Even subtle differences in the style (even of empty cells) or how the style was defined, might change the outcome of the API but also whether the user wants to see that cell. Like auto-format coiumns.
What do you do with areas-within-areas? Like manually styled first-row/first column and the data? Are there one, two, three, or even four areas? I have arguments for all variants - the notion of an "area" in a table does work best for rectangular areas and even then only if they do not overlap - so in this example, there should be the non-natural four areas. (A1, A2-A99, B1-Z1, B2-Z99).
Documents and their areas might get way more complicated then first-column/first-row... there was some artist painting huge images in Excel with each cell being a pixel.
In Excel there often whole columns (or rows) auto-formatted due to the cell content. These "areas" go until the end of rows (columns).
I did not thought this one through, but there was some NP-complete problem being somehow similar (provided there are no limits on count of columns/rows). It might not be feasable to detect the areas "perfectly".
In the full API, I think it is more complicated than a programmer (with a mild knowledge of Excel/Calc) understands about his documents - for him, you have one area of rows,columns of data and you can scroll infinitly.

svanteschubert · 2020-08-11T10:53:57Z

Allow me to answer indirectly ;-)
Let's say there are different types of APIs for ODF (or in general).

Initially, when we load the ODF XML we receive the XML as XML (SAX) Events, it's a "streaming API", where we have to go through the complete document, we might stop (e.g. StAX API), but we need to memorize everything we need to use "later" in a run-time memory model.
The DOM is the typical XML run-time API created from a streaming API as SAX - we have the full document as tree structure at our fingertips. We can traverse the tree and we might have our own created/memorized references to quickly jump to areas of interest and have improved data structures, like a HashMap for the styles using their name (in conjunction with their family) as a key - that we are keeping in synch with the DOM.
The new collaborative remote change API is different. It focuses on changes. The need arises when you work offline/async with coeditors at the same document because if you want to synchronise with other editors, you have to ask what they have changed to merge these changes.
A change is like a well-defined macro close to the user understanding, like "insert a new 3rd column @1stTable", which comes with a "logical" position of the change to a certain state. Changes are like DIFFs in Git and like DIFFS they are based on a certain state and we might need to rebase them later. Or changes are like ODF RPC.
Very similar to new change API will be the remote query API - like the search for "Harry Potter" in a document or corpus. Queries might be combined with changes, like doing a search&replace. There are no queries yet in ODFOM and changes are going to be refactored - more generated instead of manually written. This should be my focus of work in the next year.
On top of the run-time DOM model (see No2 above) should be a high-level run-time API that abstracts from the ODF XML details. Like the deprecated Simple API or the DOC API aimed to be. This is the kind of API I am currently talking about. I placed this API at the end of this list, as it should represent capsulate the run-time model and match the in- and outgoing prior mentioned high-level RPCs (like change & query API).
Ideally, every given change can be mapped 1:1 to an API call of this high-level API.

Now in regard to your question: "foreach row; foreach column ()" vs. "foreach area; foreach cell ()".
First, it is still a "foreach row; foreach column ()". We are traversing the two-dimensional table array first by row from left to right: A1, B1,..., then A2, B2,...
In my previous example, we are skipping default empty cells and returning the red merged cell first (and every cell in general only once), as we encounter it in A1, the green cell we encounter in B2 and yellow in C4.
The query is based on the DOM, but returning as little results as possible to be efficient. These can be merged cells but in general cell ranges, like repeated cells with contents, which I have added to the second sheet of my new extended example:

MergeTest2.odt - download MergeTest2.zip & rename to ods

This high-level traversal over the DOM ranges might be a generic function, where we might provide some query/filter, like only certain content, or only formula, pictures, etc.
In MergeTest2.odt we have for instance covered content hidden beyond the red cell, so the query would provide additional results for the merged cell.
I have added also a formula that sums up the cells in the second sheet, which includes content that is repeated in such a range.
Some users might want to place behind these results (pipe) a foreach cell loop, but that does not have always be the case and might vary, therefore should be offered as an optional generic function, we might see the details when it evolves, but it certainly should fix #64

You asked for a scenario for such an API. Imagine we would not only offer this API on top of the DOM, but also on top of the SAX API. We would have a high-level stream of ODF API. Remember an ODF document is equal to a list of its user changes creating it.
Therefore, we might stream these changes to create visual content for instance in a browser (or SKIA API?) abstracting XML complexity.

Hopefully, I have covered your questions and comments.
If not, can you create a test document that covers your use cases or the edge cases you have mentioned?
PS: I am will focus for now on the OASIS ODF standard and the change API (aside of the long-overdue releases, including ODF 1.3 support), therefore I would be happy if you could give it a try, Heiko! ;-)

svanteschubert · 2020-08-11T12:19:33Z

One last comment in regard to cell ranges.

If you select any rectangle of cells within a table (every sheet of a spreadsheet is a table) it is a cell range (or how I learned and defined it).
The smallest cell range is a single cell, the largest cell range is the table itself.
Columns and rows are special pre-defined cell ranges.
Therefore cell-range is a high-level interface for a destination of a change within a table or as in the example above is the return value of a high-level iteration over a table.

When I - in my prior life - once added ODF spreadsheet support to OX Documents, which used a fork of ODFDOM in the backend, we allowed the user to do arbitrary actions on cell ranges of the spreadsheet.
I traversed the DOM only according to the type of action and skipped - according to the type - sometimes DOM sections (like some actions only required the XML column or row areas).
What I, unfortunately, missed at the beginning of the design was that the users sometimes desire to alter the difference of two cell-ranges, like adding a background colour to the cells around the table content.

Aside from that (fixable mistake), I loved the simplicity of that generic cell-ranged API! :-)

svanteschubert added this to the 1.0.0 milestone Oct 22, 2020

Nnnes mentioned this issue Apr 12, 2024

OdfTableCellRange() is unusably slow for spreadsheets with whole-sheet formatting #285

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rowCount wrong for converted excel ODS files #64

rowCount wrong for converted excel ODS files #64

HeikoStudt commented Aug 5, 2020

mistmist commented Aug 5, 2020

HeikoStudt commented Aug 6, 2020

svanteschubert commented Aug 6, 2020

HeikoStudt commented Aug 9, 2020

svanteschubert commented Aug 10, 2020

HeikoStudt commented Aug 11, 2020

svanteschubert commented Aug 11, 2020

svanteschubert commented Aug 11, 2020

rowCount wrong for converted excel ODS files #64

rowCount wrong for converted excel ODS files #64

Comments

HeikoStudt commented Aug 5, 2020

mistmist commented Aug 5, 2020

HeikoStudt commented Aug 6, 2020

svanteschubert commented Aug 6, 2020

HeikoStudt commented Aug 9, 2020

svanteschubert commented Aug 10, 2020

HeikoStudt commented Aug 11, 2020

svanteschubert commented Aug 11, 2020

svanteschubert commented Aug 11, 2020