Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code refactoring and speed optimizations #24

Merged
merged 21 commits into from
Dec 23, 2024
Merged

Conversation

Hukyl
Copy link
Contributor

@Hukyl Hukyl commented Oct 26, 2024

General overview

This PR introduces a major update from v0.01 to v0.02, which comes with a lot of new features.

Improved general code structure

  1. Now each entity/class is separated into different files and folders for better logical representation. This will also allow for extending the functionality, for example, with export types.
  2. Added type hints and type declarations for basic entities, like Message or Dialog.
  3. Refactored some functions.
  4. Exported settings to a separate file settings.py, which allows for easy setup and explanations on some variables. A few of the most important settings include concurrent dialog processing count, which allows for high customizations regarding the speed and proved to be useful when adjusting to Telegram's request blocking (429 Too Many Requests).
  5. Added CONTRIBUTING.md for future contributors.

Speed optimizations

The initial version of the script, while utilizing the asynchronous nature of the Telethon library, didn't use it to full capacity.
Now, with including asynchronous operation between separate dialogs, the speed has drastically increased.

Note: all the following tests (unless specified) were conducted locally using my local data. It consists of 328 Telegram dialogs, where around 5-10 contained >100k messages. These measurements are not extremely accurate and should be taken with a grain of salt.

0_download_dialogs_list.py:

Initial version: 1m41s
Optimized version: 1m9s

1_download_dialogs_data.py (on 5 chats with >100k msgs):

Initial version: 40k msgs/hour
Improved version: 200k msgs/hour
When testing for another user (kind regards to the tester @velosypedno), the speed (after editing the concurrency parameter) reached 340k msgs/hour.

Speed comes very important as the number of chats starts to increase. In order to show the difference, with the optimized version it took around 10–12 hours, when with the original it would take a whopping 52 hours!

Important updates

Each of the following changes were made out of improvement considerations, are debatable and can be reverted upon request.

  1. In the initial version, from_id parameter for the second script was saved as a representation of a Python object (e.g., PeerUser(user_id=...)). Now, instead we retrieve the ID from any Telethon object (see TypePeer) using utility Telethon function (see telethon.utils.get_peer_id), which lead to some changes in ID. The next part is retrieved directly from documentation:

    Convert the given peer into its marked ID by default.

    This "mark" comes from the "bot api" format, and with it the peer type
    can be identified back. User ID is left unmodified, chat ID is negated,
    and channel ID is "prefixed" with -100:

    • user_id
    • -chat_id
    • -100channel_id

    The original ID and the peer type class can be returned with
    a call to :meth:resolve_id(marked_id).

  2. Instead of previous dialog types ("Private dialog", "Group", "Channel" and "" for unknown type) more standardized dialog types were introduced, which are listed in an Enum (see dict_types): "private", "group", "channel" and "unknown".

  3. Migrated to poetry (see here) package manager instead of default pip. It proved to be more user-friendly and better at resolving dependencies.

  4. Used ruff (see here) instead of black formatter. While the code styles are highly debatable, it is a lot faster and more responsive than black.

Possible improvements

  1. Add threading support to increase throughput. Sadly, couldn't utilize nor test it myself, as received a lot of 429 status codes, but proved to be useful for other users.
  2. Improved code style and responsibility separation.

Summary

All-in-all, each improvement was made with a possible use case in mind. Feel free to leave your comments and suggest improvements!

@Hukyl Hukyl changed the title Code refactoring and optimizations Code refactoring and speed optimizations Oct 30, 2024
Copy link
Owner

@SanGreel SanGreel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general comments:

  1. move tests to one folder
  2. add instructions how to run tests
  3. add test coverage percentage
  4. run black
  5. add docstrings
  6. add pylint to the project - https://docs.pylint.org/index.html
  7. add commands how to run pylint

1_download_dialogs_data.py Show resolved Hide resolved
1_download_dialogs_data.py Outdated Show resolved Hide resolved
1_download_dialogs_data.py Outdated Show resolved Hide resolved
1_download_dialogs_data.py Show resolved Hide resolved
1_download_dialogs_data.py Outdated Show resolved Hide resolved
telegram_data_downloader/processor/message_downloader.py Outdated Show resolved Hide resolved
telegram_data_downloader/processor/message_downloader.py Outdated Show resolved Hide resolved
telegram_data_downloader/processor/message_downloader.py Outdated Show resolved Hide resolved
telegram_data_downloader/settings.py Outdated Show resolved Hide resolved
telegram_data_downloader/utils.py Outdated Show resolved Hide resolved
@Hukyl
Copy link
Contributor Author

Hukyl commented Nov 23, 2024

  1. Please, consider using ruff instead of pylint and black.

    In short, it is a blazingly fast (yeah, it's written in Rust) code linter and formatter. It's 99% compatible with black (doc source), and supports some error codes from pylint (here is a list of all supported). Maybe we could use a combination of ruff for linting and formatting, and run pylint before commiting, and suggest this to other contributors as a contributing guideline.

  2. About docstring, I suggest adding only basic docstrings, as I believe code should be mostly self-documenting. Therefore, I'll try to improve the namings and function return types, so that it would be so, while providing basic docstring about the method function.

  3. Perhaps keep unittests in the same folder as the file it is testing? This is a project structuring technique I've adopted from Golang, which allows for lower cognitive load when running tests and lays more incentive on the test file path (useful for pytest) and reducing the naming requirements. That being said, integration and e2e tests IMO should still be placed in the tests/ dir in the project root.

  4. Testing instruction is already present in CONTRIBUTING.md. Is it better to move to README.md?

  5. It appears as if you cannot fetch all reactions to the message (i.e., set to None), see here. I still exported the setting to config file tho

Waiting for some feedback and further discussion on these suggestions! @SanGreel

@Hukyl
Copy link
Contributor Author

Hukyl commented Nov 25, 2024

Also, while debugging, I noticed a special case: when Dialog is of type Group or Channel, then Telethon's message.from_id is not empty, but when it's a Private Dialog, it is None, but the user can still be identified by the dialog.id, which is the user's id.

Therefore, all messages from groups or channels have a user_id in from_id field, but all private chats - don't. Again, user can still be identified by dialog_id field

Semantically, does this make sense? Should this be left as is, or modified so that it isn't null even in case of private chat? @SanGreel

@Hukyl Hukyl requested a review from SanGreel December 12, 2024 09:58
@SanGreel SanGreel merged commit 0623975 into SanGreel:master Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants