Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge sitemap scaleway from polomarcus/barometre #83

Merged
merged 89 commits into from
Nov 24, 2023

Conversation

polomarcus
Copy link
Collaborator

@polomarcus polomarcus commented Oct 30, 2023

To simplify the merge, i suggest a git push force

Features

  • add medias
  • parse some medias to get the article's description using BeautifulSoup

Tests

  • some tests with pytests and on the CI
  • Test coverage published on codecov

Ops

    • docker compose to run the project
    • Scaleway docker images - and serverless containers

To do

polomarcus and others added 21 commits October 25, 2023 14:30
Some medias changed their publication date instead of last modification,
so it messed up the PK


To do after deployment :
* delete duplicate due to new PK

```
psql
>
DELETE FROM
    sitemap_table a
        USING sitemap_table b
WHERE a.news_title = b.news_title;
```
Use env variable to connect to PG
To be the more generinic possible we parse this tag from every news :
 ```
<meta name="description" content="coucou">
```
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta

Something to have in mind during first deployments : execution time and concurrency done with asyncio

:warning: websites made with JS cannot be parsed yet (need https://www.zenrows.com/blog/scraping-javascript-rendered-web-pages#requirements)
…astmod/publication date

To avoid wasteful scrapping :
* we compare first the sitemaps we already know inside PG, then on the
difference of the sitemap.xml parsed, we continue to parse or not.
 * reading sitemap.xml we only keep 7 day-old news
@polomarcus polomarcus changed the title Merge sitemap scaleway WIP: Merge sitemap scaleway Oct 30, 2023
@polomarcus polomarcus force-pushed the merge-sitemap-scaleway branch from ecf0d86 to b857083 Compare October 30, 2023 17:04
@polomarcus polomarcus changed the title WIP: Merge sitemap scaleway Merge sitemap scaleway from polomarcus/barometre Oct 30, 2023
@estellerambier
Copy link
Collaborator

I take a look today 😬

Copy link
Collaborator

@estellerambier estellerambier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it! 🚀

Are you going to squash?

@polomarcus polomarcus merged commit 7e0ba98 into main Nov 24, 2023
1 check passed
@polomarcus polomarcus deleted the merge-sitemap-scaleway branch November 24, 2023 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants