Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Re)move news content & optimize repo #1611

Open
3 tasks
bwbroersma opened this issue Jan 10, 2025 · 0 comments
Open
3 tasks

(Re)move news content & optimize repo #1611

bwbroersma opened this issue Jan 10, 2025 · 0 comments

Comments

@bwbroersma
Copy link
Collaborator

bwbroersma commented Jan 10, 2025

Currently news content, and worse, large /assets files (PDF/videos) attached to it are merged into the main repo.
Resulting in:

  1. large repository (so quite some data/wait on cloning, after a clone .git is about 220MB)
    Receiving objects: 100% (9650/9650), 218.77 MiB
  2. large release artifacts
  3. large app container, since:
    COPY --chown=nobody:nogroup assets ./assets
  4. news needs a new release for non-release news

this all is not ideal.

A solution would be to:

  • Periodically fetch /assets and /main/translations/en/news.po content from URL parameter (like HOSTERS_HOF_URL, but adding -L and using https://api.github.com/repos/internetstandards/new-news-repo/tarball)

This would fix point 2, 3 and 4, but not 1.

Maybe look into git replace to optimize the repo, and fresh clones will be small again, without rewriting history. See this Stack Overflow and the git book documentation with clear visuals about it. However this does not seem to be ideal. Unless maybe if the complete 'filtered' (rewritten/rebased) history is replayed?

Another option might be to repack the pack files so the removed /assets are packed in one pack file, that does not need to be downloaded since it's not in the current HEAD references. But I'm unsure if this can be done with GitHub, and if a pack file can be 'skipped', e.g.:

git rm -r assets && git commit -m "removed assets demo" && git repack --filter='blob:limit=1m' -ad

sparse:path was removed, although the result is 2 pack files (12MiB and 207MiB), pushing this to a new GitHub repo does not do the trick, it is cloned again as one pack file.

  • Git replace or repack the deleted assets files
  • Prevent bloat from coming back with PR-check

Large stuff:

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize:disk) %(rest)' \
| sed -n 's/^blob //p' \
| awk '{f[$3]=f[$3]+$2;c[$3]++}END{for(i in f){print f[i], c[i], i}}' \
| sort -nr \
| numfmt --field=1 --to=iec-i --suffix=B --padding=7 --round=nearest
50 largest files (on disk)
sizecommitspath
86MiB1assets/article/terugblik-10-jarig-jubileum/Vint-Cerf-ISOCNL-PLIS-NLIGF-2024.mp4
35MiB1assets/article/terugblik-10-jarig-jubileum/Vint-Cerf-ISOCNL-PLIS-NLIGF-2024.webm
26MiB1assets/article/nederland-voor-veilig-emailverkeer/20170202_ondertekening_veilige_email_coalitie_WEBM_1280x720.webm
19MiB1assets/article/nederland-voor-veilig-emailverkeer/20170202_ondertekening_veilige_email_coalitie_MP4_1280x720.mp4
15MiB1assets/article/DMARC-masterclass-authenticatie-afzender-noodzaak-geworden/internet.nl-NL_2.pdf
8.9MiB1assets/article/nederland-voor-veilig-emailverkeer/ondertekening_veilig_email_coalitie-scaled.webm
6.5MiB1assets/article/DMARC-masterclass-authenticatie-afzender-noodzaak-geworden/4_DMARC_masterclass_april_2015.pdf
5.7MiB1assets/article/DMARC-masterclass-authenticatie-afzender-noodzaak-geworden/3_DMARC_NL.pdf
1.4MiB1vendor/zlib-1.2.13.tar.gz
1.4MiB12remote_data/macs/padded_macs.json
1.1MiB1assets/article/ncsc-one-conference-2018/GKB-DANE_for_mail_One-Conference.pdf
1.0MiB1assets/article/ncsc-one-conference-2018/PBK-dane-one.pdf
709KiB14remote_data/certs/ca-bundle.crt
706KiB1assets/article/terugblik-10-jarig-jubileum/Vint-Cerf-ISOCNL-PLIS-NLIGF-2024.png
642KiB1assets/article/terugblik-10-jarig-jubileum/ISOCNL-PLIS-NLIGF-2024-sessie.jpg
596KiB11remote_data/certs/certdata.txt
478KiB1assets/article/ncsc-one-conference-2018/JPC-OneConf-DANE-XS4ALL.pdf
368KiB1interface/static/accessibility/check_6967_auto_rapport.pdf
349KiB1assets/article/rol-IPv6-task-force-ondergebracht-bij-platform-internetstandaarden/ipv6-task-force.jpg
320KiB1documentation/images/dockerfile.png
304KiB1assets/article/nederland-voor-veilig-emailverkeer/20170201a_Intentieverklaring_Veilige_E-mail_Coalitie.pdf
264KiB3documentation/images/dockerfiles.png
247KiB1assets/article/nederland-voor-veilig-emailverkeer/20170202_ondertekening_veilige_email_coalitie_PNG_1280x720.png
245KiB74translations/nl/main.po
222KiB76translations/en/main.po
195KiB1documentation/images/integration_test_environment.png
168KiB1assets/article/nederland-voor-veilig-emailverkeer/20170202_ondertekening_veilige_email_coalitie_JPEG_1280x720.jpg
167KiB1documentation/images/docs-summary.png
153KiB1documentation/images/development_environment_volumes.png
145KiB1assets/article/terugblik-10-jarig-jubileum/ISOCNL-PLIS-NLIGF-2024-afsluiting.jpeg
141KiB1documentation/images/all-green.png
135KiB80checks/tasks/tls.py
135KiB1documentation/database_model_2022.png
125KiB40translations/nl/news.po
120KiB1assets/article/website-internet-nl-drukbezocht-na-cybertop/GCCS2015-OlafKolkman-600x399.jpg
111KiB1documentation/images/production.png
110KiB1documentation/images/batch.png
105KiB1documentation/images/development_environment.png
103KiB1documentation/images/metrics.png
98KiB1assets/article/DMARC-masterclass-authenticatie-afzender-noodzaak-geworden/TimDraegen-slide19-600x450.png
92KiB1documentation/inl_architecture.png
91KiB1assets/article/DMARC-masterclass-authenticatie-afzender-noodzaak-geworden/BK-IMAG1622-unsh-cropped-600x468.jpg
76KiB1docker/integration-tests/www/static/list/public_suffix_list.dat
75KiB1documentation/images/live_tests.png
70KiB1assets/article/DMARC-masterclass-authenticatie-afzender-noodzaak-geworden/BK-IMAG1620-unsh-cropped-600x468.jpg
69KiB1docker/it/targetbase/html/list/public_suffix_list.dat
67KiB1assets/author/erik-huizer/picture.jpg
64KiB29translations/en/news.po
54KiB114Changelog.md
45KiB1documentation/images/internetnl-docker-workflow.png

Resulting in the following summary, which takes into account pack compression.

Total size first part of path reason
208MiB /assets News related assets like mp4, webm, pdf
2.7MiB /remote_data decompressed current snapshot is 3.6MiB (tar.gz 1.1MiB)
2.3MiB /documentation 1.9MiB in /images
1.4MiB /vendor 1.4MiB zlib-1.2.13.tar.gz file
955KiB /interface 790KiB in /static
368KiB of this is accessibility/check_6967_auto_rapport.pdf
172KiB are (unused?) fonts
661KiB /checks all code, although some deleted /static/ does exists,
unsure if these are 'replaced' or really removed files
(if it's the first, cleaning up will not save the pack size)
659KiB /translations 245KiB (nl) and 222KiB (en) main.po
125KiB (nl) and 64KiB (en) news.po
449KiB /docker 145KiB are two version of public_suffix_list.dat (one is deleted)
57KiB /tests
54KiB /Changelog.md

All /assets and the old news.po can be cleaned up.
Should look into optimizing /documentation/images/ and /interface/static/.
Probably /remote_data can not be optimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

1 participant