Skip to content

Commit

Permalink
Refacto/Modin to_sql (#201)
Browse files Browse the repository at this point in the history
* wip

* wip: readme

* doc: delete where example
  • Loading branch information
polomarcus authored Jul 3, 2024
1 parent 9ea9484 commit d737df0
Show file tree
Hide file tree
Showing 5 changed files with 10 additions and 5 deletions.
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -262,6 +262,12 @@ If our media perimeter evolves, we have to reimport it all using env variable `S

Otherwise, default is yesterday midnight date (default cron job)

**As pandas to_sql does not enable upsert (update/insert)**, if we want to update already saved rows, we have to delete first the rows and then start the program with `START_DATE` :
```
DELETE FROM keywords
WHERE start BETWEEN '2024-05-01' AND '2024-05-30';
```

### Based on channel
Use env variable `CHANNEL` like in docker compose (string: tf1)

Expand Down
1 change: 0 additions & 1 deletion alembic/README

This file was deleted.

2 changes: 1 addition & 1 deletion docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ services:
#entrypoint: ["python", "quotaclimat/data_processing/mediatree/api_import.py"]
environment:
ENV: docker # change me to prod for real cases
LOGLEVEL: DEBUG # Change me to info (debug, info, warning, error) to have less log
LOGLEVEL: WARNING # Change me to info (debug, info, warning, error) to have less log
PYTHONPATH: /app
POSTGRES_USER: user
POSTGRES_DB: barometre
Expand Down
1 change: 1 addition & 0 deletions postgres/schemas/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ class Keywords(Base):
plaintext= Column(Text)
theme=Column(JSON) #keyword.py # ALTER TABLE keywords ALTER theme TYPE json USING to_json(theme);
created_at = Column(DateTime(timezone=True), server_default=text("(now() at time zone 'utc')")) # ALTER TABLE ONLY keywords ALTER COLUMN created_at SET DEFAULT (now() at time zone 'utc');
updated_at = Column(DateTime(), default=datetime.now, onupdate=text("now() at time zone 'Europe/Paris'"), nullable=True)
keywords_with_timestamp = Column(JSON) # ALTER TABLE keywords ADD keywords_with_timestamp json;
number_of_keywords = Column(Integer) # ALTER TABLE keywords ADD number_of_keywords integer;
srt = Column(JSON) # ALTER TABLE keywords ADD srt json;
Expand Down
5 changes: 2 additions & 3 deletions quotaclimat/data_processing/mediatree/api_import.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,9 +109,8 @@ async def get_and_save_api_data(exit_event):
logging.info(f"Querying API for {channel} - {channel_program} - {channel_program_type} - {start_epoch} - {end_epoch}")
df = extract_api_sub(token, channel, type_sub, start_epoch,end_epoch, channel_program,channel_program_type)
if(df is not None):
# must ._to_pandas() because modin to_sql is not working
save_to_pg(df._to_pandas(), keywords_table, conn)
else:
save_to_pg(df, keywords_table, conn)
else:
logging.info("Nothing to save to Postgresql")
except Exception as err:
logging.error(f"continuing loop but met error : {err}")
Expand Down

1 comment on commit d737df0

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coverage

Coverage Report
FileStmtsMissCoverMissing
postgres
   insert_data.py44784%36–38, 57–59, 64
   insert_existing_data_example.py19384%25–27
postgres/schemas
   models.py1471093%121–128, 140–141, 199–200, 214–215
quotaclimat/data_ingestion
   scrap_sitemap.py1341787%27–28, 33–34, 66–71, 95–97, 138–140, 202, 223–228
quotaclimat/data_ingestion/ingest_db
   ingest_sitemap_in_db.py553733%21–42, 45–58, 62–73
quotaclimat/data_ingestion/scrap_html
   scrap_description_article.py36392%19–20, 32
quotaclimat/data_processing/mediatree
   api_import.py20412738%43–47, 52–67, 71–74, 80, 83–121, 127–142, 146–147, 160–172, 176–182, 195–206, 209–213, 219, 254–255, 259, 263–297, 300–302
   channel_program.py1365162%30–32, 43–45, 59, 95, 104, 142–183
   config.py15287%7, 16
   detect_keywords.py213896%169–172, 216, 271–273
   update_pg_keywords.py523729%14–97, 120–121, 144–170, 176
   utils.py642266%26–50, 53, 62, 78–79
quotaclimat/utils
   healthcheck_config.py291452%22–24, 27–38
   logger.py241154%22–24, 28–37
   sentry.py10280%21–22
TOTAL120835171% 

Tests Skipped Failures Errors Time
81 0 💤 0 ❌ 0 🔥 1m 32s ⏱️

Please sign in to comment.