Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser does not handle nested tables which also have content #181

Open
safranchik opened this issue Nov 18, 2018 · 1 comment
Open

Parser does not handle nested tables which also have content #181

safranchik opened this issue Nov 18, 2018 · 1 comment
Labels
bug Something isn't working help wanted Extra attention is required

Comments

@safranchik
Copy link

When parsing certain HTML files, the parser is unable to add to the database all sections that have tables as parents.

I implemented a quick solution by adding two lines of code after line 583 in parser.py, and was able to read the documents. However, this may not be the most efficient solution (if any at all) for creating a corresponding tree structure of the section.

HTML file parsed:
20841.txt

Screenshot of error:
screen shot 2018-11-17 at 9 45 43 pm

Image of a quick fix after line 583:
screen shot 2018-11-17 at 9 56 12 pm

@lukehsiao
Copy link
Contributor

lukehsiao commented Nov 18, 2018

Here's a minimal snippet showing the issue:

<table>
    <tbody>
        <tr>
            <th class="ssTableHeader" valign="top" rowspan="2" id="PIr03">State</th>
            <th class="ssTableHeader" valign="top" id="PIr13">State of Texas</th>
            <td rowspan="2"></td>
            <td rowspan="2" valign="top" headers="
                PIr03 
                PIr13"> 
            </td>
            <td rowspan="2" valign="top" headers="
                PIr03 
                PIr13 
                PIc5">
                <b>Tamara Y S Keener</b><br>830-997-9542(W)
                <table>
                    <tbody>
                        <tr height="25">
                            <td>&nbsp;</td>
                        </tr>
                    </tbody>
                </table>
                Jay Weinheimer<br>997-2149(W)
            </td>
        </tr>
    </tbody>
</table>

The challenge here is that there is a nested table within a table cell.

@lukehsiao lukehsiao changed the title Parser is not adding documents to database because table is parent of paragraph Parser does not handle nested tables which also have content Nov 18, 2018
@lukehsiao lukehsiao added bug Something isn't working help wanted Extra attention is required labels Nov 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is required
Projects
None yet
Development

No branches or pull requests

2 participants