Parsing weird-but-valid sheets (namespaces and `r` attribute on cells) #64

skipchris · 2025-01-09T12:17:38Z

Hi,

In real-world use we’ve encountered a few valid OOXML spreadsheets that simple_xlsx_reader can’t parse correctly. These sheets are annoying, but they are spec-compliant, and as far as I can tell actually come from Microsoft tools like PowerBI.

First of all, sheets can contain namespaced tags, i.e.:

<?xml version="1.0" encoding="utf-8"?>
<x:worksheet xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <x:sheetData>
    <x:row>
      <x:c s="2" t="inlineStr">
        <x:is>
          <x:t>Hello</x:t>
        </x:is>
      </x:c>
    </x:row>
</x:sheetData>

As far as I can tell, the pragmatic approach here is to just drop the namespacing, i.e. name.split(':').last in start_element. This may be 'wrong' as far as correct XML parsing goes, but it’s simple, pragmatic, works in my testing, and is also basically the approach that xsv takes to solve the problem.

Second, it’s not mandatory for cells to have the r attribute, in which case parsers should infer the column by its position in the file.

This is easy enough to fix with a @column_counter variable which is set to 0 at the start of each row and incremented for each c, then doing something like @cell_name = attrs['r'] || column_number_to_letter(@ column_counter).

Again, this won’t stand up to something truly truly horrible like mixed present/absent r attributes, but as far as I’ve seen in real-world use, worksheets either have it, or they don’t.

If you’d be happy to merge a pull request to accommodate both of these cases, i’m happy to submit one!

The text was updated successfully, but these errors were encountered:

skipchris mentioned this issue Jan 14, 2025

Parse sheets containing namespaces and no 'r' att #65

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing weird-but-valid sheets (namespaces and `r` attribute on cells) #64

Parsing weird-but-valid sheets (namespaces and `r` attribute on cells) #64

skipchris commented Jan 9, 2025

Parsing weird-but-valid sheets (namespaces and r attribute on cells) #64

Parsing weird-but-valid sheets (namespaces and r attribute on cells) #64

Comments

skipchris commented Jan 9, 2025

Parsing weird-but-valid sheets (namespaces and `r` attribute on cells) #64

Parsing weird-but-valid sheets (namespaces and `r` attribute on cells) #64