SFR-2500: Refactor DOAB mapping #558

jackiequach · 2025-02-05T21:42:05Z

Description

Refactors and simplifies DOAB mapping to use functional programming
Adds data fields that were missing in the previous mapping.
- Unsure if datacite was completely replaced by dc in DOAB records. I looked on the DOAB DSpace and for the records I checked, ~10, datacite is not being used for many of the fields. As a result, I left the datacite field retrieval in case there are still records with those fields
Updates .gitignore to allow test file for the functional test
Removes unit tests for DOABMapping

Testing

python main.py -p DOABProcess -e local -r "20.500.12854/62823"

kylevillegas93 · 2025-02-06T21:10:29Z

.gitignore

+launcher manifest.xml
+nosetests.xml
+coverage.xml


we could do *.xml and !test-doab.xml if that works!

kylevillegas93 · 2025-02-06T21:10:51Z

mappings/doab.py

+        if identifiers is None or len(identifiers) == 0:
+            return None


Nice - this is a good check!

mappings/doab.py

kylevillegas93 · 2025-02-06T21:12:16Z

mappings/doab.py


-        # Clean up links
-        self.record.has_part = self.parseLinks()
+        return [f'{author}|||true' for author in authors]


We may want to add if author is not None or author == '' to ensure we aren't adding empty strings

kylevillegas93 · 2025-02-06T21:14:33Z

tests/functional/mappings/test_doab.py

+OAI_NAMESPACES = {
+    'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/',
+    'dc': 'http://purl.org/dc/elements/1.1/',
+    'datacite': 'https://schema.datacite.org/meta/kernel-4.1/metadata.xsd',
+    'oapen': 'http://purl.org/dc/elements/1.1/',
+    'oaire': 'https://raw.githubusercontent.com/rcic/openaire4/master/schemas/4.0/oaire.xsd'
+}


These I think should be accessible form the DspaceService - so you could potentially grab them from there

kylevillegas93 · 2025-02-06T21:15:00Z

tests/functional/mappings/test_doab.py

+        assert parsed_record.source == Source.DOAB.value
+        assert parsed_record.source_id == '20.500.12854/62823'
+        assert parsed_record.title == 'A World of Nourishment'


Nice looks good!

kylevillegas93 · 2025-02-06T21:17:37Z

mappings/doab.py

+            identifiers= identifiers,
+            authors=self._get_authors(doab_record, namespaces=namespaces),
+            contributors=self._get_contributors(doab_record, namespaces=namespaces),
+            title=title[0] if len(title) > 0 else None,


hmm thoughts on not bringing the record in if there's no title?

Good idea, moved the check above with the identifiers

kylevillegas93 · 2025-02-06T21:19:10Z

mappings/doab.py

+            authors=self._get_authors(doab_record, namespaces=namespaces),
+            contributors=self._get_contributors(doab_record, namespaces=namespaces),
+            title=title[0] if len(title) > 0 else None,
+            is_part_of=[f'{part}||series' for part in relations],


You may want to add if part checks in these list comprehensions to ensure we aren't adding strings like 'None||series' or '||series'

kylevillegas93 · 2025-02-07T14:18:10Z

tests/fixtures/test-doab.xml

Nice - i like this idea of keeping the test files in a fixtures folder

kylevillegas93 · 2025-02-07T14:19:34Z

services/sources/dspace_service.py

@@ -64,8 +64,7 @@ def get_records(self, full_import=False, start_timestamp=None, offset: Optional[

    def parse_record(self, record):
        try:
-            record = self.source_mapping(record, self.OAI_NAMESPACES, self.constants)
-            record.applyMapping()
+            record = self.source_mapping(record, self.OAI_NAMESPACES)


May just want to update this lhs variable name to parsed_record or the like to distinguish between record and parsed_record

kylevillegas93 · 2025-02-07T14:23:23Z

mappings/doab.py

+            authors=self._get_authors(doab_record, namespaces=namespaces),
+            contributors=self._get_contributors(doab_record, namespaces=namespaces),
+            title=title[0],
+            is_part_of=[f'{part}||series' for part in relations if part is not None or part == ''],


oops - i think this is my bad here - we should ensure it's not None and not empty.

I think if you simplify these if checks to just if x for example, that if statement checks the truthyness of the value x. So if x is either the empty string or null, then it evaluates to false.

If the value is in integer tho just a heads up that 0 will also by falsy but we do want to include those cases!

kylevillegas93

Looks good! just a couple last comments!

kylevillegas93 · 2025-02-07T14:34:46Z

mappings/doab.py

+            dates=self._get_dates(doab_record, namespaces=namespaces),
+            languages=[f'||{language}' for language in languages if language is not None or language == ''],
+            extent=[f'{extent} pages' for extent in extents if extent is not None or extent == ''],
+            abstract=doab_record.xpath('./dc:description/text()', namespaces=namespaces),


This may return a list - I think we should grab the first value if it exists

{"296 pages"} should be 296 pages

kylevillegas93 · 2025-02-07T14:37:13Z

mappings/doab.py

+            dates=self._get_dates(doab_record, namespaces=namespaces),
+            languages=[f'||{language}' for language in languages if language is not None or language == ''],
+            extent=[f'{extent} pages' for extent in extents if extent is not None or extent == ''],
+            abstract=doab_record.xpath('./dc:description/text()', namespaces=namespaces),


Same as below - when we call xpath - i believe it tries to pull all elements that match that xpath. So we should just grab the first 1

kylevillegas93 · 2025-02-07T14:41:19Z

mappings/doab.py

+        datacite_dates = self._get_text_type_data(record, namespaces, './datacite:date', '{}|||{}')
+        dc_dates = self._get_text_type_data(record, namespaces, './dc:date', '{}|||{}')


Here I recommend that if we do not know the date type, we may want to exclude it so

{2021-04-20T15:29:32Z|||,2021-04-20T15:29:32Z|||,2012|||Issued}

becomes

{,2012|||Issued}

Updated _get_text_type_data to do this since this also makes sense for the other fields we are calling it for (alternative identifiers and contributors)

kylevillegas93 · 2025-02-07T15:47:44Z

mappings/doab.py

+            extent=f'{extent[0]} pages' if extent else None,
+            abstract=abstract[0] if abstract else None,


Nice - i like this - especially because it'll check if the array is both not none and not empty

jackiequach added 6 commits February 5, 2025 16:40

initial refactor for doab mapping

8f424ae

fix identifiers, rights, and title fields

f989f8c

update has_part parsing

93d8ddd

account for missing dc values from mapping and misc clean up

a75587c

integrate refactor with the ingest process

75a91f2

add functional test and remove unit test for mapping

084dba1

jackiequach changed the title ~~[WIP] SFR-2500: Refactor DOAB mapping~~ SFR-2500: Refactor DOAB mapping Feb 6, 2025

jackiequach marked this pull request as ready for review February 6, 2025 19:19

jackiequach requested review from mitri-slory and kylevillegas93 February 6, 2025 19:20

kylevillegas93 reviewed Feb 6, 2025

View reviewed changes

mappings/doab.py Outdated Show resolved Hide resolved

kylevillegas93 reviewed Feb 6, 2025

View reviewed changes

jackiequach added 4 commits February 6, 2025 16:46

address comments and minor cleanup

3b89758

Merge branch 'main' into SFR-2500/refactor-doab-mapping

777aa44

update gitignore

023b518

update integration test

a92c1be

kylevillegas93 reviewed Feb 7, 2025

View reviewed changes

tests/fixtures/test-doab.xml Outdated

Copy link

Contributor

kylevillegas93 Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - i like this idea of keeping the test files in a fixtures folder

kylevillegas93 reviewed Feb 7, 2025

View reviewed changes

kylevillegas93 approved these changes Feb 7, 2025

View reviewed changes

kylevillegas93 self-requested a review February 7, 2025 14:34

kylevillegas93 reviewed Feb 7, 2025

View reviewed changes

jackiequach added 2 commits February 7, 2025 10:38

address comments

b03b72a

Merge branch 'main' into SFR-2500/refactor-doab-mapping

a03da54

kylevillegas93 reviewed Feb 7, 2025

View reviewed changes

kylevillegas93 approved these changes Feb 7, 2025

View reviewed changes

jackiequach merged commit 30854df into main Feb 7, 2025
1 check passed

jackiequach deleted the SFR-2500/refactor-doab-mapping branch February 7, 2025 15:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SFR-2500: Refactor DOAB mapping #558

SFR-2500: Refactor DOAB mapping #558

jackiequach commented Feb 5, 2025 •

edited

Loading

kylevillegas93 Feb 6, 2025

kylevillegas93 Feb 6, 2025

kylevillegas93 Feb 6, 2025

kylevillegas93 Feb 6, 2025

kylevillegas93 Feb 6, 2025

kylevillegas93 Feb 6, 2025

jackiequach Feb 6, 2025

kylevillegas93 Feb 6, 2025

kylevillegas93 Feb 7, 2025

kylevillegas93 Feb 7, 2025

kylevillegas93 Feb 7, 2025

kylevillegas93 left a comment

kylevillegas93 Feb 7, 2025

kylevillegas93 Feb 7, 2025

kylevillegas93 Feb 7, 2025 •

edited

Loading

jackiequach Feb 7, 2025

kylevillegas93 Feb 7, 2025

		datacite_dates = self._get_text_type_data(record, namespaces, './datacite:date', '{}\|\|\|{}')
		dc_dates = self._get_text_type_data(record, namespaces, './dc:date', '{}\|\|\|{}')

		extent=f'{extent[0]} pages' if extent else None,
		abstract=abstract[0] if abstract else None,

SFR-2500: Refactor DOAB mapping #558

SFR-2500: Refactor DOAB mapping #558

Conversation

jackiequach commented Feb 5, 2025 • edited Loading

Description

Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylevillegas93 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylevillegas93 Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackiequach commented Feb 5, 2025 •

edited

Loading

kylevillegas93 Feb 7, 2025 •

edited

Loading