Skip to content

Commit 4e44342

Browse files
committed
Addressing issues
1 parent 4c7492f commit 4e44342

File tree

1 file changed

+25
-16
lines changed

1 file changed

+25
-16
lines changed

doc/adr/0021-xml-with-no-docx.md

+25-16
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# 21. XML with no docx
1+
# 21. XML format for documents where source text cannot be extracted by the parser
22

33
Date: 2024-11-20
44

@@ -8,7 +8,7 @@ Draft
88

99
## Context
1010

11-
We need to support documents which do not have a source DOCX. They will typically be PDF-only, although we're imagining there might be other formats (zip files full of jpegs).
11+
We need to support documents where the structured contents of the document cannot necessarily be extracted by the parser. They will typically be PDF-only, although we're imagining there might be other formats (such as zip files full of jpegs).
1212

1313
These documents might not have neutral ciation numbers -- this is outside the scope of this document.
1414

@@ -24,43 +24,52 @@ The emitted XML will validate against our modified AkomaNtoso schema
2424

2525
Metadata is provided to the parser via spreadsheets. We will want to preserve some of this data.
2626

27-
The existing tag names (see the appendix) will be used inside a `uk:external-metadata` tag in the `proprietary` tag to allow this metadata to be searched at the same time as other documents that have these tags in the running body of the judgment. Values MUST NOT be deliberately replicated across `external-metadata` and other places, but MAY be replicated if the text body is successfully parsed and the string appeared in externally provided data sources.
27+
The existing tag names (see the appendix) will be used inside a `uk:external-metadata` tag in the `proprietary` tag to allow this metadata to be searched at the same time as other documents that have these tags in the running body of the judgment; however, `akn:` namespaced tags are forbidden, so the `uk:` namespace MUST be used instead.
2828

29-
Editors SHOULD be able to edit the external metadata and the Editor UI have affordances to do so.
29+
Values MUST NOT be deliberately replicated across `external-metadata` and other places, but MAY be replicated if the text body is successfully parsed and the string appeared in externally provided data sources.
3030

31-
### Marked as Low Quality
31+
Editors SHOULD be able to edit these metadata values within the `proprietary` tag and the Editor UI have affordances to do so.
3232

33-
The `proprietary` tag will contain a `uk:source-document` tag, with attributes:
33+
### Marked as Low Quality
3434

35-
TODO: I think this part wants tearing apart carefully.
35+
The `proprietary` tag will contain a `uk:source-document` tag, containing two fractional attributes:
3636

37-
Each of these metrics defaults to `1`: we have no reason to doubt the quality of the document. Values of `0` imply a complete lack of quality, or an absence of that feature. Values below `0.5` should be considered poor, and mitigations applied. Values above `1` might be signs of additional verification.
37+
Each of these numerical metrics defaults to `1.0`: we have no reason to doubt the quality of the document. Values of `0.0` imply a complete lack of quality, or an absence of that feature. Values below `0.5` should be considered poor, and mitigations applied. Values above `1.0` might be signs of additional verification.
3838

3939
- `uk:markup-quality-score`: Is the body of the document marked up well with appropriate HTML and AkomaNtoso tags?
4040

4141
- `uk:text-quality-score`: Are all the words believed to be in the right order, with appropriate interword spacing? (Linebreaks optional)
4242

43-
- `uk:parsed-format`: default `docx`, even if the docx isn't original. `pdf` an obvious value.
43+
There is also:
4444

45-
The tag SHALL contain human-readable text warning that the XML is low quality and should not be relied upon.
45+
- `uk:source-document-format`: the MIME type of the document from which this XML was generated. For Rich Text
46+
documents, and others where the document was converted to docx first, the mime type still will be `application/vnd.openxmlformats-officedocument.wordprocessingml.document`;
4647

47-
### Purposes for Low Quality Marks
48+
- `uk:markup-human-reviewed`: boolean -- An editor has reviewed the document. This will not be added by any parser, but may be added in the EUI.
4849

49-
#### Deprioritised in Search
50+
- `uk:quality-warning`: Human readable text warning that the XML is low quality and should not be relied upon.
5051

51-
Since the risk of documents having bad search experiences due to broken OCR / text flow issues, documents with low quality scores should be deprioritised in search.
52+
### Purposes for Low Quality Marks
5253

5354
#### Rejected for Impossible Tasks
5455

55-
Documents with no DOCX cannot be reparsed, so SHOULD be excluded from the list of reparse candidates, SHOULD not be reparsable in the UI and MUST not be sent for reparsing.
56+
Documents with no DOCX cannot be reparsed via the existing framework, so SHOULD be excluded from the list of reparse candidates, SHOULD NOT be reparsable in the UI and MUST not be sent for reparsing.
57+
58+
Documents without text SHOULD NOT be sent to enrichment.
5659

5760
#### Hidden Information
5861

59-
The Public UI WILL NOT reveal the body of the XML, including HTML transforms, to users who have not opted in to recieving a poor quality document. It may reveal suitable subsections.
62+
The Atom Feed MAY allow searching for only documents with sufficient quality scores.
63+
64+
The Public UI SHOULD NOT attempt to display XML or HTML transforms of XML to users where the text and/or markup quality scores are insufficient. (0.5 is probably the threshold)
65+
66+
XML representations MAY be available to users who explicitly request them but SHOULD NOT be routinely displayed
67+
in the PUI.
6068

6169
#### Warning the user
6270

63-
The PUI MAY flag that a document exists only as a PDF so they are not surprised by the absence of a HTML version.
71+
The Public UI SHOULD indicate to users that a document is not available as HTML and instead is only available
72+
as the original source document (or other compiled artifact in the future)
6473

6574
### Mediocre-Effort Text
6675

0 commit comments

Comments
 (0)