Class Documentation

Documentation

class `BaseCrawler`

Main classes inherited by all crawler classes.

Methods

`BaseCrawler(self, name, start_url)`

Constructor for the class.
Arguments -

name : Name of crawler.
start_url : Base URL of the website.

class `CrawlerType0`

Base class : BaseCrawler Crawler for websites Hindi Lyrics, Smriti and Lyrics Masti.

Methods

`CrawlerType0(self, name, start_url, list_of_url, number_of_threads)`

Constructor for the class.
Arguments -

name : Name of crawler.
start_url : Base URL of the website.
list_of_url : List of URL(s) to start with.
number_of_threads : Number of threads to use to crawl.

`threader(self, thread_id)`

Worker methods. Gets task from task_queue and does corresponding job.
Arguments -

thread_id : Assigned ID of thread.

`run(self)`

Method called from derived classes to start the crawling process.

`download_movie(self, thread_id, url, movie)`

Method called from threader if task type is to get songs from movie. Gets all songs from a movie and saves all song from it in the database.
Arguments -

thread_id : Assigned ID of thread.
url : URL for the movie.
movie : Name of movie.

`get_movies(self, thread_id, url)`

Method called from threader if task type is to get movies from a page. Get movies from a webpage.
Arguments -

thread_id : Assigned ID of thread.
url : URL of page.

`get_movies_with_url(self, raw_html)`

User overrides this method to get list of movies with URL in following format -

[
    ('link1', 'movie1'),
    ('link2', 'movie2'),
]

Arguments -

raw_html : Raw HTML code of the page.

`get_songs_with_url(self, raw_html)`

User overrides this method to get list of songs with URL from a movie page in following format -

[
    ('link1', 'song1'),
    ('link2', 'song2'),
]

Arguments -

raw_html : Raw HTML of the page.

`get_song_details(self, raw_html)`

User overrides this method to get details for a song from raw html in followinf format -

(
    'lyrics',
    [
        'singer1',
        'singer2',
    ],
    [
        'director1',
        'director2',
    ],
    [
        'lyricist1',
        'lyricist2',
    ]
)

Arguments -

raw_html : Raw HTML of song page.

class `CrawlerType1`

Base Class : BaseCrawler
Crawer for Az Lyrics.

Methods

`CrawlerType1(self, name, start_url, list_of_url, number_of_threads)`

Constructor of the class. Arguments -
Same as that for CrawlerType0.

`run(self)`

Same as that for CrawlerType0.

`threader(self, thread_id)`

Same as that for CrawlerType0.

`get_artists(self, thread_id, url)`

Method called from threader if task type is to get artists from a page. Gets artists from a page and puts each of them back in the task_queue.
Arguments -
As usual.

`get_artist_albums(self, thread_id, url, artist)`

Method called from threader if task type is to get songs for an artist. Gets all songs for an artist and their details, storing them in database.
Arguments -
As usual.

`get_artists_with_url(self, raw_html)`

User overrides this method to get artists with URLs in following format -

[
    ('link1', 'artist1'),
    ('link2', 'artist2'),
]

Arguments -
As usual.

`get_albums_with_songs(self, raw_html)`

User overrides this method to get albums for an artist with all songs in it from artist's page in following format -

[
   (
        'album1',
        [
            ('url1', 'song1'),
            ('url2', 'song2')
        ]
    ),
    (
        'album2',
        [
            ('url3', 'song3'),
            ('url4', 'song4')
        ]
    )
]

Arguemts -
As usual.

`get_song_details(self, song_html)`

User overrides this method to get lyrics for the from raw HTML of the song.
Arguments -
As usual.

class `CrawlerType2`

Base class : BaseCrawler
Crawler for website Metro Lyrics.

Methods

`CrawlerType2(self, name, start_url, list_of_urls, number_of_threads)`

Constructor for the class.
Arguments -
As usual.

`run(self)`

Same as that for CrawlerType0.

`threader(self, thread_id)`

Same as that for CrawlerType0.

`get_artists(self, thread_id, url)`

Same as that for CrawlerType1.

`get_artist(self, thread_id, url, artist)`

Method called by threader when task type is to get artist songs. Gets all the songs for an artist and put each of them in the task_queue. Arguments -
As usual.

`get_songs_from_page(self, thread_id, url, artist)`

Method called by threader when task type is to get songs from an artist page.
Arguments -
As usual.

`get_song(self, thread_id, url, song, artist)`

Method called by threader to get song details.
Arguments - As usual.

`get_song_details(self, raw_html)`

User overrides this metho to get song details from raw HTML in following format -

(
    'album',
    'lyrics',
    [
        'lyricist1',
        'lyricist2'
    ],
    [
        'other_artist1',
        'other_artist2',
    ]
)

Arguments -
As usual.

`get_artist_with_url(self, raw_html)`

User overrides this method to get artists with URLs from a page in following format -

[
    ('url1', 'artist1'),
    ('url2', 'artist2')
]

Argumets -
As usual.

`get_pages_for_artist(self, raw_html)`

User overrides this method to get all pages that contains songs by an artist in following format -

[
    'url1',
    'url2'
]

Arguments -
As usual.

`get_songs(self, raw_html)`

User overrides this method to get list of songs with URLs in following format -

[
    ('url1', 'song1'),
    ('url2', 'song2'),
]

Arguments -
As usual.

Class Documentation

Documentation

class BaseCrawler

Methods

BaseCrawler(self, name, start_url)

class CrawlerType0

Methods

CrawlerType0(self, name, start_url, list_of_url, number_of_threads)

threader(self, thread_id)

run(self)

download_movie(self, thread_id, url, movie)

get_movies(self, thread_id, url)

get_movies_with_url(self, raw_html)

get_songs_with_url(self, raw_html)

get_song_details(self, raw_html)

class CrawlerType1

Methods

CrawlerType1(self, name, start_url, list_of_url, number_of_threads)

run(self)

threader(self, thread_id)

get_artists(self, thread_id, url)

get_artist_albums(self, thread_id, url, artist)

get_artists_with_url(self, raw_html)

get_albums_with_songs(self, raw_html)

get_song_details(self, song_html)

class CrawlerType2

Methods

CrawlerType2(self, name, start_url, list_of_urls, number_of_threads)

run(self)

threader(self, thread_id)

get_artists(self, thread_id, url)

get_artist(self, thread_id, url, artist)

get_songs_from_page(self, thread_id, url, artist)

get_song(self, thread_id, url, song, artist)

get_song_details(self, raw_html)

get_artist_with_url(self, raw_html)

get_pages_for_artist(self, raw_html)

get_songs(self, raw_html)

Clone this wiki locally

class `BaseCrawler`

`BaseCrawler(self, name, start_url)`

class `CrawlerType0`

`CrawlerType0(self, name, start_url, list_of_url, number_of_threads)`

`threader(self, thread_id)`

`run(self)`

`download_movie(self, thread_id, url, movie)`

`get_movies(self, thread_id, url)`

`get_movies_with_url(self, raw_html)`

`get_songs_with_url(self, raw_html)`

`get_song_details(self, raw_html)`

class `CrawlerType1`

`CrawlerType1(self, name, start_url, list_of_url, number_of_threads)`

`run(self)`

`threader(self, thread_id)`

`get_artists(self, thread_id, url)`

`get_artist_albums(self, thread_id, url, artist)`

`get_artists_with_url(self, raw_html)`

`get_albums_with_songs(self, raw_html)`

`get_song_details(self, song_html)`

class `CrawlerType2`

`CrawlerType2(self, name, start_url, list_of_urls, number_of_threads)`

`run(self)`

`threader(self, thread_id)`

`get_artists(self, thread_id, url)`

`get_artist(self, thread_id, url, artist)`

`get_songs_from_page(self, thread_id, url, artist)`

`get_song(self, thread_id, url, song, artist)`

`get_song_details(self, raw_html)`

`get_artist_with_url(self, raw_html)`

`get_pages_for_artist(self, raw_html)`

`get_songs(self, raw_html)`