Extend Top Level Section detection to work for more 10-Q reports #48
Replies: 7 comments
-
This is a common issue in:
Almost all of them (RTX, TMO, MDT, KHC, GM) have the following pattern: <div ...>
<span ...>
Item 6. Exhibits
</span>
<table ...>
...
</table>
</div> UPS with small difference: <div ...>
<span ...>
Item 6.
</span>
<span ...>
Exhibits
</span>
<table ...>
...
</table>
</div> MCD has major difference where the title "Item 6. Exhibits" is written inside the table <div>
<table>
<tr>
<tr/>
<tr>
<td ...>
<div ...>
<span ...>
Item 6. Exhibits
</span>
</div>
</td>
</tr>
...
<tr>
<tr/>
<tr>
<tr/>
</table>
</div> Screenshots for two major cases: Table at the same level as top-level section title Top-level section title written inside table: |
Beta Was this translation helpful? Give feedback.
-
Currently, where ~55% of text content is bold with some font weight. The Because of this Solution:Change In order to find a sweet spot I checked if there were more fails like these where the threshold needs to be decreased, but could not find any in the dataset. So, I think the best value as of now would be |
Beta Was this translation helpful? Give feedback.
-
Found a case where un-highlighted text can be a title too. (See Item 1) |
Beta Was this translation helpful? Give feedback.
-
Case 1: INTC A. "Consolidated Condensed Statements of Income" instead of traditional "Item 1. Financial Statements." B. No "Item 3." in "Quantitative and Qualitative Disclosures About Market Risk" C. This example does not have the "Part I - Financial Information" heading as text as well. (Also the case in D. It contains the "Form 10-Q Cross-Reference Index" section with hyperlinks to actual sections. Test:
Case 2: GE Similar to case 1, "FORM 10-Q CROSS REFERENCE INDEX" section is present, and there are no prefixes to section headings. Test:
Case 3: JPM The "Item 2. Management’s Discussion and Analysis of Financial Condition and Results of Operations" (as in it's table of contents) is present in its section without any heading. Test:
Case 4: MS All top-level sections are present without Test:
Case 5: HON Missing "ITEM 1." prefix Test:
Some Other Off-Topic Fails1. Filing:
|
Beta Was this translation helpful? Give feedback.
-
AA/000119312518236766 |
Beta Was this translation helpful? Give feedback.
-
Current state with
|
Beta Was this translation helpful? Give feedback.
-
Caution Note: There are cases where even sec-api.io fails to extract top level sections correctly. More about it with examples is present in this discord thread. |
Beta Was this translation helpful? Give feedback.
-
What are "Top Level Sections"?
Top-level sections are needed in most 10-Q reports as required by the SEC. These sections are:
Introduction to Testing
We have set up generalization tests to test how certain features of our parser generalize across our document dataset.
Execute this command to run the tests:
You should see an output similar to this:
In this image, we see that 42 reports have their top-level sections identified correctly, while 30 reports have some issues.
Enabling a failing test
In the YAML file, uncomment (i.e. remove the
#
symbol) from some report, to include it in the test. For example, to include "10-Q ABT Abbott Laboratories", uncomment the line in:https://github.com/alphanome-ai/sec-parser/blob/3078b5fd6e24f0dae796a76217bff25204b356af/tests/generalization/processing_steps/test_top_level_section_title_classifier.yaml#L4-L6
so it becomes:
Now, executing the test command:
again would include the failing test:
Helpful tools
You can use the UI dashboard to check the elements
Consider viewing the source code to better understand any problems. Go to "Show Contents" -> "HTML Code"
Beta Was this translation helpful? Give feedback.
All reactions