Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SFR-2500: Refactor DOAB mapping #558

Merged
merged 12 commits into from
Feb 7, 2025
Merged

Conversation

jackiequach
Copy link
Contributor

@jackiequach jackiequach commented Feb 5, 2025

Description

  • Refactors and simplifies DOAB mapping to use functional programming
  • Adds data fields that were missing in the previous mapping.
    • Unsure if datacite was completely replaced by dc in DOAB records. I looked on the DOAB DSpace and for the records I checked, ~10, datacite is not being used for many of the fields. As a result, I left the datacite field retrieval in case there are still records with those fields
  • Updates .gitignore to allow test file for the functional test
  • Removes unit tests for DOABMapping

Testing

python main.py -p DOABProcess -e local -r "20.500.12854/62823"

@jackiequach jackiequach changed the title [WIP] SFR-2500: Refactor DOAB mapping SFR-2500: Refactor DOAB mapping Feb 6, 2025
@jackiequach jackiequach marked this pull request as ready for review February 6, 2025 19:19
.gitignore Outdated
Comment on lines 6 to 8
launcher manifest.xml
nosetests.xml
coverage.xml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could do *.xml and !test-doab.xml if that works!

mappings/doab.py Outdated
Comment on lines 23 to 24
if identifiers is None or len(identifiers) == 0:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - this is a good check!

mappings/doab.py Outdated Show resolved Hide resolved
mappings/doab.py Outdated

# Clean up links
self.record.has_part = self.parseLinks()
return [f'{author}|||true' for author in authors]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to add if author is not None or author == '' to ensure we aren't adding empty strings

Comment on lines 5 to 11
OAI_NAMESPACES = {
'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/',
'dc': 'http://purl.org/dc/elements/1.1/',
'datacite': 'https://schema.datacite.org/meta/kernel-4.1/metadata.xsd',
'oapen': 'http://purl.org/dc/elements/1.1/',
'oaire': 'https://raw.githubusercontent.com/rcic/openaire4/master/schemas/4.0/oaire.xsd'
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These I think should be accessible form the DspaceService - so you could potentially grab them from there

Comment on lines +22 to +24
assert parsed_record.source == Source.DOAB.value
assert parsed_record.source_id == '20.500.12854/62823'
assert parsed_record.title == 'A World of Nourishment'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice looks good!

mappings/doab.py Outdated
identifiers= identifiers,
authors=self._get_authors(doab_record, namespaces=namespaces),
contributors=self._get_contributors(doab_record, namespaces=namespaces),
title=title[0] if len(title) > 0 else None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm thoughts on not bringing the record in if there's no title?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, moved the check above with the identifiers

mappings/doab.py Outdated
authors=self._get_authors(doab_record, namespaces=namespaces),
contributors=self._get_contributors(doab_record, namespaces=namespaces),
title=title[0] if len(title) > 0 else None,
is_part_of=[f'{part}||series' for part in relations],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to add if part checks in these list comprehensions to ensure we aren't adding strings like 'None||series' or '||series'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - i like this idea of keeping the test files in a fixtures folder

@@ -64,8 +64,7 @@ def get_records(self, full_import=False, start_timestamp=None, offset: Optional[

def parse_record(self, record):
try:
record = self.source_mapping(record, self.OAI_NAMESPACES, self.constants)
record.applyMapping()
record = self.source_mapping(record, self.OAI_NAMESPACES)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May just want to update this lhs variable name to parsed_record or the like to distinguish between record and parsed_record

mappings/doab.py Outdated
authors=self._get_authors(doab_record, namespaces=namespaces),
contributors=self._get_contributors(doab_record, namespaces=namespaces),
title=title[0],
is_part_of=[f'{part}||series' for part in relations if part is not None or part == ''],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops - i think this is my bad here - we should ensure it's not None and not empty.

I think if you simplify these if checks to just if x for example, that if statement checks the truthyness of the value x. So if x is either the empty string or null, then it evaluates to false.

If the value is in integer tho just a heads up that 0 will also by falsy but we do want to include those cases!

Copy link
Contributor

@kylevillegas93 kylevillegas93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! just a couple last comments!

@kylevillegas93 kylevillegas93 self-requested a review February 7, 2025 14:34
mappings/doab.py Outdated
dates=self._get_dates(doab_record, namespaces=namespaces),
languages=[f'||{language}' for language in languages if language is not None or language == ''],
extent=[f'{extent} pages' for extent in extents if extent is not None or extent == ''],
abstract=doab_record.xpath('./dc:description/text()', namespaces=namespaces),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may return a list - I think we should grab the first value if it exists

{"296 pages"} should be 296 pages

mappings/doab.py Outdated
dates=self._get_dates(doab_record, namespaces=namespaces),
languages=[f'||{language}' for language in languages if language is not None or language == ''],
extent=[f'{extent} pages' for extent in extents if extent is not None or extent == ''],
abstract=doab_record.xpath('./dc:description/text()', namespaces=namespaces),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below - when we call xpath - i believe it tries to pull all elements that match that xpath. So we should just grab the first 1

Comment on lines +114 to +115
datacite_dates = self._get_text_type_data(record, namespaces, './datacite:date', '{}|||{}')
dc_dates = self._get_text_type_data(record, namespaces, './dc:date', '{}|||{}')
Copy link
Contributor

@kylevillegas93 kylevillegas93 Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I recommend that if we do not know the date type, we may want to exclude it so

{2021-04-20T15:29:32Z|||,2021-04-20T15:29:32Z|||,2012|||Issued}

becomes

{,2012|||Issued}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated _get_text_type_data to do this since this also makes sense for the other fields we are calling it for (alternative identifiers and contributors)

Comment on lines +45 to +46
extent=f'{extent[0]} pages' if extent else None,
abstract=abstract[0] if abstract else None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - i like this - especially because it'll check if the array is both not none and not empty

@jackiequach jackiequach merged commit 30854df into main Feb 7, 2025
1 check passed
@jackiequach jackiequach deleted the SFR-2500/refactor-doab-mapping branch February 7, 2025 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants