-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SFR-2500: Refactor DOAB mapping #558
Conversation
.gitignore
Outdated
launcher manifest.xml | ||
nosetests.xml | ||
coverage.xml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could do *.xml and !test-doab.xml if that works!
mappings/doab.py
Outdated
if identifiers is None or len(identifiers) == 0: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice - this is a good check!
mappings/doab.py
Outdated
|
||
# Clean up links | ||
self.record.has_part = self.parseLinks() | ||
return [f'{author}|||true' for author in authors] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to add if author is not None or author == ''
to ensure we aren't adding empty strings
OAI_NAMESPACES = { | ||
'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/', | ||
'dc': 'http://purl.org/dc/elements/1.1/', | ||
'datacite': 'https://schema.datacite.org/meta/kernel-4.1/metadata.xsd', | ||
'oapen': 'http://purl.org/dc/elements/1.1/', | ||
'oaire': 'https://raw.githubusercontent.com/rcic/openaire4/master/schemas/4.0/oaire.xsd' | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These I think should be accessible form the DspaceService - so you could potentially grab them from there
assert parsed_record.source == Source.DOAB.value | ||
assert parsed_record.source_id == '20.500.12854/62823' | ||
assert parsed_record.title == 'A World of Nourishment' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice looks good!
mappings/doab.py
Outdated
identifiers= identifiers, | ||
authors=self._get_authors(doab_record, namespaces=namespaces), | ||
contributors=self._get_contributors(doab_record, namespaces=namespaces), | ||
title=title[0] if len(title) > 0 else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm thoughts on not bringing the record in if there's no title?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, moved the check above with the identifiers
mappings/doab.py
Outdated
authors=self._get_authors(doab_record, namespaces=namespaces), | ||
contributors=self._get_contributors(doab_record, namespaces=namespaces), | ||
title=title[0] if len(title) > 0 else None, | ||
is_part_of=[f'{part}||series' for part in relations], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to add if part checks in these list comprehensions to ensure we aren't adding strings like 'None||series' or '||series'
tests/fixtures/test-doab.xml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice - i like this idea of keeping the test files in a fixtures folder
services/sources/dspace_service.py
Outdated
@@ -64,8 +64,7 @@ def get_records(self, full_import=False, start_timestamp=None, offset: Optional[ | |||
|
|||
def parse_record(self, record): | |||
try: | |||
record = self.source_mapping(record, self.OAI_NAMESPACES, self.constants) | |||
record.applyMapping() | |||
record = self.source_mapping(record, self.OAI_NAMESPACES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May just want to update this lhs variable name to parsed_record or the like to distinguish between record and parsed_record
mappings/doab.py
Outdated
authors=self._get_authors(doab_record, namespaces=namespaces), | ||
contributors=self._get_contributors(doab_record, namespaces=namespaces), | ||
title=title[0], | ||
is_part_of=[f'{part}||series' for part in relations if part is not None or part == ''], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops - i think this is my bad here - we should ensure it's not None and not empty.
I think if you simplify these if checks to just if x for example, that if statement checks the truthyness of the value x. So if x is either the empty string or null, then it evaluates to false.
If the value is in integer tho just a heads up that 0 will also by falsy but we do want to include those cases!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! just a couple last comments!
mappings/doab.py
Outdated
dates=self._get_dates(doab_record, namespaces=namespaces), | ||
languages=[f'||{language}' for language in languages if language is not None or language == ''], | ||
extent=[f'{extent} pages' for extent in extents if extent is not None or extent == ''], | ||
abstract=doab_record.xpath('./dc:description/text()', namespaces=namespaces), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may return a list - I think we should grab the first value if it exists
{"296 pages"}
should be 296 pages
mappings/doab.py
Outdated
dates=self._get_dates(doab_record, namespaces=namespaces), | ||
languages=[f'||{language}' for language in languages if language is not None or language == ''], | ||
extent=[f'{extent} pages' for extent in extents if extent is not None or extent == ''], | ||
abstract=doab_record.xpath('./dc:description/text()', namespaces=namespaces), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as below - when we call xpath - i believe it tries to pull all elements that match that xpath. So we should just grab the first 1
datacite_dates = self._get_text_type_data(record, namespaces, './datacite:date', '{}|||{}') | ||
dc_dates = self._get_text_type_data(record, namespaces, './dc:date', '{}|||{}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I recommend that if we do not know the date type, we may want to exclude it so
{2021-04-20T15:29:32Z|||,2021-04-20T15:29:32Z|||,2012|||Issued}
becomes
{,2012|||Issued}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated _get_text_type_data to do this since this also makes sense for the other fields we are calling it for (alternative identifiers and contributors)
extent=f'{extent[0]} pages' if extent else None, | ||
abstract=abstract[0] if abstract else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice - i like this - especially because it'll check if the array is both not none and not empty
Description
Testing
python main.py -p DOABProcess -e local -r "20.500.12854/62823"