Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repair BDC pipeline runs with forceRefresh=False #235

Closed
Tims777 opened this issue Feb 3, 2024 · 6 comments · Fixed by #239
Closed

Repair BDC pipeline runs with forceRefresh=False #235

Tims777 opened this issue Feb 3, 2024 · 6 comments · Fixed by #239
Assignees
Labels
bugfix Something isn't working
Milestone

Comments

@Tims777
Copy link
Contributor

Tims777 commented Feb 3, 2024

When running the pipeline run_all_steps.json (but with forceRefresh set to false everywhere), several errors happen in the different steps. These need to be fixed or the affected pipeline steps should be taken out.

List of errors

Ordered by severity

  • at the end of regional atlas step: | ERROR | pipeline.py:57 | Step Regional_Atlas failed! Columns must be same length as key
  • all the time in GPT and insights enhancer steps: | ERROR | s3_repository.py:212 | Error loading review from S3 with id ChIJkdTnnsMzs1IRlCF2m6bKYsU. Error: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist. (might indicate a problem with the Google step)
  • address scraping hangs: Getting addresses from custom domains...: 47%|████████████████████████▏ | 241/518 [17:00<19:32, 4.23s/it]

Note

The current pipeline run_all_steps.json should be changed to have forceRefresh: false set everywhere. The current configuration can optionally be copied to a new pipeline config force_refresh_all_steps.json.

Acceptance Criteria

  • It is possible to run the run_all_steps.json with forceRefresh set to false everywhere
    • All steps complete successfully (i.e. their output will appear in the enriched.csv file)
    • If steps cannot be repaired easily, they should be excluded from the pipeline and the problem should be documented
@Tims777 Tims777 converted this from a draft issue Feb 3, 2024
@Tims777 Tims777 added this to the Demo Day milestone Feb 3, 2024
@Tims777 Tims777 added the bugfix Something isn't working label Feb 3, 2024
@Tims777 Tims777 changed the title Repair BDC pipeline Repair BDC pipeline runs with forceRefresh=False Feb 3, 2024
@luccalb luccalb self-assigned this Feb 5, 2024
@luccalb
Copy link
Collaborator

luccalb commented Feb 5, 2024

The issue with regionalatlas seems to be related to the hashtables, as the error only occurs when running the first time (without any hashtables)

@luccalb
Copy link
Collaborator

luccalb commented Feb 5, 2024

The address scraping step was a very early experiment and is not "production ready". It was never meant to end up in the final pipeline, as we get the address from google. I'll creat a special demo pipeline config for the BDC.

@luccalb
Copy link
Collaborator

luccalb commented Feb 5, 2024

The GPT Errors look worse than they are. It just means that the data was not present in the cache files. I adjusted the error logging.

@Tims777
Copy link
Contributor Author

Tims777 commented Feb 5, 2024

The GPT Errors look worse than they are. It just means that the data was not present in the cache files. I adjusted the error logging.

Good to know. However, the error message is still appearing:

Running sentiment analysis on reviews:  78%|████████████████████████████████████████████████████████████                 | 928/1190
[04:11<02:55,  1.49it/s]2024-02-05 22:20:13,197 |    ERROR | s3_repository.py:212 | Error loading review from S3 with id ChIJFzDNMdC_xkcRmSfqst9o1g8.
Error: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.

@luccalb
Copy link
Collaborator

luccalb commented Feb 6, 2024

The GPT Errors look worse than they are. It just means that the data was not present in the cache files. I adjusted the error logging.

Good to know. However, the error message is still appearing:

Running sentiment analysis on reviews:  78%|████████████████████████████████████████████████████████████                 | 928/1190
[04:11<02:55,  1.49it/s]2024-02-05 22:20:13,197 |    ERROR | s3_repository.py:212 | Error loading review from S3 with id ChIJFzDNMdC_xkcRmSfqst9o1g8.
Error: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.

This seems to be S3 specific, I'll check again.

@Tims777 Tims777 moved this from Sprint Backlog to In Progress in amos2023ws06-feature-board Feb 6, 2024
@luccalb
Copy link
Collaborator

luccalb commented Feb 6, 2024

The GPT Errors look worse than they are. It just means that the data was not present in the cache files. I adjusted the error logging.

Good to know. However, the error message is still appearing:

Running sentiment analysis on reviews:  78%|████████████████████████████████████████████████████████████                 | 928/1190
[04:11<02:55,  1.49it/s]2024-02-05 22:20:13,197 |    ERROR | s3_repository.py:212 | Error loading review from S3 with id ChIJFzDNMdC_xkcRmSfqst9o1g8.
Error: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.

This seems to be S3 specific, I'll check again.

It's just ungraceful error handling, when a google place has no reviews, we don't save any to S3. The sentiment analyzer just assumes where the review file should be but cant find it. The sentiment score will be None in that case.

@luccalb luccalb moved this from In Progress to Awaiting Review in amos2023ws06-feature-board Feb 6, 2024
@Tims777 Tims777 linked a pull request Feb 6, 2024 that will close this issue
@Tims777 Tims777 closed this as completed Feb 7, 2024
@github-project-automation github-project-automation bot moved this from Awaiting Review to Feature Archive in amos2023ws06-feature-board Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix Something isn't working
Projects
Status: Feature Archive
Development

Successfully merging a pull request may close this issue.

2 participants