Skip to content

Class Documentation

Pratyush Singh edited this page Jun 15, 2016 · 4 revisions

Documentation

class BaseCrawler

Main classes inherited by all crawler classes.

Methods

BaseCrawler(self, name, start_url)

Constructor for the class.
Arguments -

  • name : Name of crawler.
  • start_url : Base URL of the website.

class CrawlerType0

Base class : BaseCrawler Crawler for websites Hindi Lyrics, Smriti and Lyrics Masti.

Methods

CrawlerType0(self, name, start_url, list_of_url, number_of_threads)

Constructor for the class.
Arguments -

  • name : Name of crawler.
  • start_url : Base URL of the website.
  • list_of_url : List of URL(s) to start with.
  • number_of_threads : Number of threads to use to crawl.
threader(self, thread_id)

Worker methods. Gets task from task_queue and does corresponding job.
Arguments -

  • thread_id : Assigned ID of thread.
run(self)

Method called from derived classes to start the crawling process.

download_movie(self, thread_id, url, movie)

Method called from threader if task type is to get songs from movie. Gets all songs from a movie and saves all song from it in the database.
Arguments -

  • thread_id : Assigned ID of thread.
  • url : URL for the movie.
  • movie : Name of movie.
get_movies(self, thread_id, url)

Method called from threader if task type is to get movies from a page. Get movies from a webpage.
Arguments -

  • thread_id : Assigned ID of thread.
  • url : URL of page.
get_movies_with_url(self, raw_html)

User overrides this method to get list of movies with URL in following format -

[
    ('link1', 'movie1'),
    ('link2', 'movie2'),
]

Arguments -

  • raw_html : Raw HTML code of the page.
get_songs_with_url(self, raw_html)

User overrides this method to get list of songs with URL from a movie page in following format -

[
    ('link1', 'song1'),
    ('link2', 'song2'),
]

Arguments -

  • raw_html : Raw HTML of the page.
get_song_details(self, raw_html)

User overrides this method to get details for a song from raw html in followinf format -

(
    'lyrics',
    [
        'singer1',
        'singer2',
    ],
    [
        'director1',
        'director2',
    ],
    [
        'lyricist1',
        'lyricist2',
    ]
)

Arguments -

  • raw_html : Raw HTML of song page.

class CrawlerType1

Base Class : BaseCrawler
Crawer for Az Lyrics.

Methods

CrawlerType1(self, name, start_url, list_of_url, number_of_threads)

Constructor of the class. Arguments -
Same as that for CrawlerType0.

run(self)

Same as that for CrawlerType0.

threader(self, thread_id)

Same as that for CrawlerType0.

get_artists(self, thread_id, url)

Method called from threader if task type is to get artists from a page. Gets artists from a page and puts each of them back in the task_queue.
Arguments -
As usual.

get_artist_albums(self, thread_id, url, artist)

Method called from threader if task type is to get songs for an artist. Gets all songs for an artist and their details, storing them in database.
Arguments -
As usual.

get_artists_with_url(self, raw_html)

User overrides this method to get artists with URLs in following format -

[
    ('link1', 'artist1'),
    ('link2', 'artist2'),
]

Arguments -
As usual.

get_albums_with_songs(self, raw_html)

User overrides this method to get albums for an artist with all songs in it from artist's page in following format -

[
   (
        'album1',
        [
            ('url1', 'song1'),
            ('url2', 'song2')
        ]
    ),
    (
        'album2',
        [
            ('url3', 'song3'),
            ('url4', 'song4')
        ]
    )
]

Arguemts -
As usual.

get_song_details(self, song_html)

User overrides this method to get lyrics for the from raw HTML of the song.
Arguments -
As usual.

class CrawlerType2

Base class : BaseCrawler
Crawler for website Metro Lyrics.

Methods

CrawlerType2(self, name, start_url, list_of_urls, number_of_threads)

Constructor for the class.
Arguments -
As usual.

run(self)

Same as that for CrawlerType0.

threader(self, thread_id)

Same as that for CrawlerType0.

get_artists(self, thread_id, url)

Same as that for CrawlerType1.

get_artist(self, thread_id, url, artist)

Method called by threader when task type is to get artist songs. Gets all the songs for an artist and put each of them in the task_queue. Arguments -
As usual.

get_songs_from_page(self, thread_id, url, artist)

Method called by threader when task type is to get songs from an artist page.
Arguments -
As usual.

get_song(self, thread_id, url, song, artist)

Method called by threader to get song details.
Arguments - As usual.

get_song_details(self, raw_html)

User overrides this metho to get song details from raw HTML in following format -

(
    'album',
    'lyrics',
    [
        'lyricist1',
        'lyricist2'
    ],
    [
        'other_artist1',
        'other_artist2',
    ]
)

Arguments -
As usual.

get_artist_with_url(self, raw_html)

User overrides this method to get artists with URLs from a page in following format -

[
    ('url1', 'artist1'),
    ('url2', 'artist2')
]

Argumets -
As usual.

get_pages_for_artist(self, raw_html)

User overrides this method to get all pages that contains songs by an artist in following format -

[
    'url1',
    'url2'
]

Arguments -
As usual.

get_songs(self, raw_html)

User overrides this method to get list of songs with URLs in following format -

[
    ('url1', 'song1'),
    ('url2', 'song2'),
]

Arguments -
As usual.