Telegram Scraper

This Telegram scraper collects telegram messages, comments (comments.bot/comments.app) and media files. It was originally build for this story on behalf of Addendum.

Contributors: @PeterWalchhofer, @vali101, @fin

Notes

This scraper was written before Telgram introduced its native comment feature for broadcasting channels. Nowadays, comment.bot/comment.app extensions are rarely used anymore. Feel free to make a PR to support this feature. @vali101 re-wrote the scraper for his bachelor thesis here with a focus on snowball sampling. His code may help for extending the scraper or for finding interesting channels in the first place. Check it out!

Setup

Requirements:

Google Chrome (However, in theory you could als use Firefox when installing the necessary driver manually )
Python 3
Telegram Account (Phone Number)
A lot of storage if downloading media
Time - arround 3000msg and comments/ per Minute.

Getting Started

Install dependencies make install OR just install requirements.txt (using venv is recommended)
Create your own channel.csv as explained in the next section
Put the phone-number of the linked telegram account int the config.yaml
Get your API-key here an put them inside the config.yaml.
sh scrape.sh to start the scraper OR run channelscraper/python app.py
The outputs will be stored in the /output directory.

Input Data

You need to create your own channels.csvand put it in the /input folder.

Only Link and Broadcast Relevant for scraping. The csv should have the form described below. There also is an example csv in the folder.

Kategorie	Name	Link	@	Broadcast
Gruppe Typ XY	Example Channel	https://t.me/example_channel	example_channel	TRUE

Kategorie(optional): Metadata to annotate channel
Name(optional): Not identifier Name
Link: Link to channel
@ (optional): Indentifier Name
Broadcast: ´True´ if channel is Broadcasting Channel ´else´ false. Broadcasting channels are large one-to-many channels that only allow owners to write messages.

Configuration

The Scraper can be further configured via the channelscraper/config.yaml.

Features

Messages

The Scraper extracts all messages from a channel. It is also possible to scrape only those messages that were written in the last x days. This can be set in the config.yaml.

Comment Bots

In many broadcasting channels comment-bots are used in order to provide feedback from the audience. It is also possible to scrape those messages. Currently comments.app and comments.bot bots are supported.
Be careful. The date format is different from the telegram-api and has to be parsed manually (e.g. Dec 09)
As there is a "Load more comments" button it has to be clicked using javascript. Selenium is used to interact with the chrome driver that is installed automatically.

Comments.app

Unique username extraction is working most of the time. However, if the user has deleted its account, this is not possible.

Comments.bot

Unique usernames are not extracted, because we found no way to find out without querying the api at a high cost.
Only the display name is persited

Further remarks

Messages and comments are persisted in the same csv-file. To tell them apart use the isComment column. Additionally, the ID includes a period in the format msgId.commId (e.g. 101.3)
We allocated ids to the comments manually. The telegram message ids are unique within a channel.

Optional

If you want to run selenium with docker use selenium/standalone-chrome:3.141.59-yttrium Also see: https://stackoverflow.com/questions/45323271/how-to-run-selenium-with-chrome-in-docker

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
channelscraper		channelscraper
input		input
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
scrape.sh		scrape.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Telegram Scraper

Notes

Setup

Requirements:

Getting Started

Input Data

Configuration

Features

Messages

Comment Bots

Comments.app

Comments.bot

Further remarks

Optional

About

Releases

Packages

Contributors 3

Languages

License

PeterWalchhofer/Telescrape

Folders and files

Latest commit

History

Repository files navigation

Telegram Scraper

Notes

Setup

Requirements:

Getting Started

Input Data

Configuration

Features

Messages

Comment Bots

Comments.app

Comments.bot

Further remarks

Optional

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages