This Telegram scraper collects telegram messages, comments (comments.bot/comments.app) and media files. It was originally build for this story on behalf of Addendum.
Contributors: @PeterWalchhofer, @vali101, @fin
This scraper was written before Telgram introduced its native comment feature for broadcasting channels. Nowadays, comment.bot/comment.app extensions are rarely used anymore. Feel free to make a PR to support this feature. @vali101 re-wrote the scraper for his bachelor thesis here with a focus on snowball sampling. His code may help for extending the scraper or for finding interesting channels in the first place. Check it out!
- Google Chrome (However, in theory you could als use Firefox when installing the necessary driver manually )
- Python 3
- Telegram Account (Phone Number)
- A lot of storage if downloading media
- Time - arround 3000msg and comments/ per Minute.
- Install dependencies
make install
OR just install requirements.txt (using venv is recommended) - Create your own
channel.csv
as explained in the next section - Put the phone-number of the linked telegram account int the
config.yaml
- Get your API-key here an put them inside the
config.yaml
. sh scrape.sh
to start the scraper OR runchannelscraper/python app.py
- The outputs will be stored in the
/output
directory.
You need to create your own channels.csv
and put it in the /input
folder.
Only Link and Broadcast Relevant for scraping. The csv should have the form described below. There also is an example csv in the folder.
Kategorie | Name | Link | @ | Broadcast |
---|---|---|---|---|
Gruppe Typ XY | Example Channel | https://t.me/example_channel | example_channel | TRUE |
- Kategorie(optional): Metadata to annotate channel
- Name(optional): Not identifier Name
- Link: Link to channel
- @ (optional): Indentifier Name
- Broadcast: ´True´ if channel is Broadcasting Channel ´else´ false. Broadcasting channels are large one-to-many channels that only allow owners to write messages.
The Scraper can be further configured via the channelscraper/config.yaml
.
The Scraper extracts all messages from a channel. It is also possible to scrape only those messages that were written in the last x days. This can be set in the config.yaml
.
- In many broadcasting channels comment-bots are used in order to provide feedback from the audience. It is also possible to scrape those messages. Currently comments.app and comments.bot bots are supported.
- Be careful. The date format is different from the telegram-api and has to be parsed manually (e.g. Dec 09)
- As there is a "Load more comments" button it has to be clicked using javascript. Selenium is used to interact with the chrome driver that is installed automatically.
- Unique username extraction is working most of the time. However, if the user has deleted its account, this is not possible.
- Unique usernames are not extracted, because we found no way to find out without querying the api at a high cost.
- Only the display name is persited
- Messages and comments are persisted in the same csv-file. To tell them apart use the
isComment
column. Additionally, the ID includes a period in the format msgId.commId (e.g. 101.3) - We allocated ids to the comments manually. The telegram message ids are unique within a channel.
If you want to run selenium with docker use selenium/standalone-chrome:3.141.59-yttrium
Also see:
https://stackoverflow.com/questions/45323271/how-to-run-selenium-with-chrome-in-docker