Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SFR-2292: Skip clustering records with no title #423

Merged
merged 1 commit into from
Oct 30, 2024
Merged

Conversation

kylevillegas93
Copy link
Contributor

Description

  • Clustering fails if we try to tokenize a null title
  • This change skips trying to cluster records without a title

Testing

python main.py -p ClusterProcess -e local-qa -i single -r 1ad13bf1-e48b-4f8f-8868-457d98b302a8

{"timestamp":1730226129848,"message":"Starting process ClusterProcess in local-qa","log.level":"INFO","logger.name":"__main__","thread.id":140704663367552,"thread.name":"MainThread","process.id":13466,"process.name":"MainProcess","file.name":"/Users/kylejacksonvillegas/workspace/drb-etl-pipeline-venv/drb-etl-pipeline/main.py","line.number":32,"entity.type":"SERVICE"}
{"timestamp":1730226132070,"message":"Clustering 1ad13bf1-e48b-4f8f-8868-457d98b302a8","log.level":"INFO","logger.name":"processes.cluster","thread.id":140704663367552,"thread.name":"MainThread","process.id":13466,"process.name":"MainProcess","file.name":"/Users/kylejacksonvillegas/workspace/drb-etl-pipeline-venv/drb-etl-pipeline/processes/cluster.py","line.number":71,"entity.type":"SERVICE"}
{"timestamp":1730226132524,"message":"Matched record with id 36624282 has no title","log.level":"WARNING","logger.name":"processes.cluster","thread.id":140704663367552,"thread.name":"MainThread","process.id":13466,"process.name":"MainProcess","file.name":"/Users/kylejacksonvillegas/workspace/drb-etl-pipeline-venv/drb-etl-pipeline/processes/cluster.py","line.number":178,"entity.type":"SERVICE"}
{"timestamp":1730226182016,"message":"Clustered 1 works","log.level":"INFO","logger.name":"processes.cluster","thread.id":140704663367552,"thread.name":"MainThread","process.id":13466,"process.name":"MainProcess","file.name":"/Users/kylejacksonvillegas/workspace/drb-etl-pipeline-venv/drb-etl-pipeline/processes/cluster.py","line.number":101,"entity.type":"SERVICE"}

In this case, one of our NYPL bibs caused us to fail clustering.

Comment on lines +176 to +178
if not matched_record_title:
logger.warning(f'Matched record with id {matched_record_id} has no title')
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A record with no title should not be clustered so that's great to note this.

Copy link
Contributor

@mitri-slory mitri-slory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great error to catch in the cluster records.

@kylevillegas93 kylevillegas93 merged commit 66faee2 into main Oct 30, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants