Supporting a new site

Example of how to support a new site by creating a ripper.

In this example, I'll be supporting share-image.com

Getting Started

Learn python

Short refresher: learnxinyminutes/python Long: Learn Python the Hard Way by Zed Shaw

Clone the repo (if you haven't already):

git clone [email protected]:4pr0n/rip.git

Go to the /rip/sites/ directory via command-line
Ensure the darn thing works. Use the tester script

python test.py

Copy the skeleton ripper to a new file.

cp _testsite.py site_shareimage.py

Note that instead of 'shareimage', use the name of the site.

Now we're ready to write some python

Editing the Skeleton Ripper

_testsite.py contains the basic class structure for a ripper.

We've copied the file to site_shareimage.py, so we will be editing the contents of that file.

Rename the Class

Change the name of the class. From testsite:

class testsite(basesite):

to something more appropriate, like shareimage:

class shareimage(basesite):

sanitize_url()

Validates and enforces a given URL.

Edit sanitize_url() to check for share-image.

Before:

if not 'testsite.com/' in url:
  raise Exception('')

After:

if not 'share-image.com/' in url:
  raise Exception('')

The blank Exception should be thrown if the site is not related to this ripper. This tells the caller to move onto another site's ripper.

Enforce the desired URL

Share-image has lots of URLs. We only want to accept URLs that point to galleries.

The format for a share-image gallery is http://share-image.com/1234-gallery-name

Where 1234 is the gallery ID and 'gallery-name' is the name.

We want to make sure the format is share-image.com/[numbers]

# Crazy regex
if not re.compile('^.*share-image\.com\/\d*-?[a-zA-Z0-9\-]*$').match(url):
  raise Exception('required "share-image.com/[numbers]-..." not found')

This exception stops the ripping process and notifies the user of an issue with the URL. Be descriptive!

Format URL as needed

Sometimes we want to strip out unnecessary text from URLs (like hashtags or query strings).

if '#' in url: url = url[:url.find('#')]
if '?' in url: url = url[:url.find('?')]

This doesn't seem to be the case with share-image, but it might come in handy for other sites.

get_dir()

Returns a unique working directory, album name, and zip name for a given album.

Note: The name must be file-system safe! No special characters like /, ?, *, etc. Also, make the name unique so as to not clobber other existing galleries

Get a unique name or ID from the given URL.

For share-image, we only need the digits that follow share-image.com/

gid = url[url.rfind('/')+1:]

Using share-image.com/1234-abc, this would retrieve the 1234 and store it in gid.

Return the ripper name with the unique name.

return 'shareimage_%s' % gid

Using the above example, this would return the directory name shareimage_1234

download()

Downloads images in the album. The process is to: initialize, get the album source, find the images, and download them.

Initialize

self.init_dir()

This initializes (creates) the working directory.

Get album source

r = self.web.get(self.url)

This retrieves the contents of the URL and stores it in r

Find the images

This varies from site to site. You may have to find ways to get this to work w/ other sites. Sometimes you will have perform a while loop if there's multiple pages of images.

Share-image stores the full-size images very similarly to how the thumbnails are stored. Because of this, we can download the full-sized images if we know the thumbnail URLs.

In the share-image album source, thumbnail URLs are wrapped between _self"><img src=" and " tags.

thumbs = self.web.between(r, '_self"><img src="', '"')

Lots of things happen here. The self.web.between() method:

Looks through the page source `r`,

Returns a list of all strings in `r` between `_self"><img src="` and `"`.

Now thumbs is a list containing all thumbnail URLs for the album.

Download the images

Before we can download, we have to iterate over the thumbnails and alter the thumbnail URLs to point at the full-sized images.

for (index, thumb) in enumerate(thumbs):
  full = thumb.replace('pics.share-image.com', 'pictures.share-image.com')
  full = full.replace('/thumb/', '/big/')

full now contains the path to the full-sized image. Now we can download it:

  self.download_image( full, index + 1, total=len(thumbs) )

This kicks off the threaded downloader managed in the superclass basesite. This will download the image via a new thread and save it accordingly

Wait for threads to finish

We don't want to exit this function until the threads have completed. Luckily, there's a helper method for that:

self.wait_for_threads()

One benefit is that this method will also delete the working directory if no images are found after the threads finish. This prevents 'orphaned' or empty archives.

Result

A few more things needed to be added (explained via comments), but here's the final result:

class shareimage(basesite):
  def sanitize_url(self, url):
    if not 'share-image.com/' in url: # Verify URL is share-image
      raise Exception('')
    # Ensure URL points to image-share album
    if not re.compile('^.*share-image\.com\/\d*-?[a-zA-Z0-9\-]*$').match(url):
      raise Exception('required share-image.com/[numbers]-... not found in URL')
    # Strip excess fields from URL
    if '#' in url: url = url[:url.find('#')]
    if '?' in url: url = url[:url.find('?')]
    return url

  def get_dir(self, url):
    gid = url[url.rfind('/')+1:] # Get trailing full gallery name
    gid = gid[:gid.find('-')]    # Strip off trailing gallery name
    return 'shareimage_%s' % gid # Working directory is now 'shareimage_[galleryid]'

  def download(self):
    self.init_dir()
    r = self.web.get(self.url) # Get page source
    thumbs = self.web.between(r, '_self"><img src="', '"') # Extract thumbnail URLs
    for index, thumb in enumerate(thumbs):
      # Convert thumbnail URL to full-size image URL
      full = thumb.replace('pics.share-image.com', 'pictures.share-image.com')
      full = full.replace('/thumb/', '/big/')
      if self.urls_only:
        # User only wants URLs to direct images, not the downloaded images
        self.add_url(index, full, total=total)
      else:
        # Download the image (threaded)
        self.download_image(full, index + 1, total=len(thumbs))
      if self.hit_image_limit(): break # Stop if we hit the maximum number of images
    self.wait_for_threads()            # Wait for existing threads to finish

39 lines of code! That wasn't so bad.

Testing via Command-line

To test the ripper via command-line (which is much easier to debug, view stack traces than in the web UI).

This is accomplished using the ugly test.py class. We add an album to rip to this class and execute it.

This will require editing test.py, so open it up and get coding.

Import the new ripper

At the top of the test.py class, import the new ripper:

from site_shareimage import shareimage

Add a 'test case' for the new ripper

In this file, you will find lots of integration tests for various rippers. They all start with i = . Scroll to the bottom of this list, comment out the last test, and add your test.

i = shareimage('http://www.share-image.com/5078-tanya-a-cute-teen-with-puffy-nipples')

The code below this initializes the ripper stored at i and attempts to download the album. No other changes are needed.

Save test.py and execute it.

python test.py

You should see the ripper work it's magic.

Testing via Website

If the command-line ripper works for you, that's good enough for me. I can take it from there.

If you want to go all-out and get the Web UI to support your new ripper, you will need to edit one more file: rip.cgi in the base directory.

rip.cgi

Import your ripper

from site_shareimage import shareimage

Add your ripper to the list inside of get_ripper()

There is a list of rippers that the website iterates through to find the appropriate ripper to use.

The list starts with sites = [ and you can't miss it.

Add your ripper to the bottom (or top) of the list

shareimage, \

Direct your browser to your local Apache instance, paste in a URL, and hit Rip & Zip

Notes

You can add debug statements to the ripper using

self.debug('this is a debug statement')

These statements won't print unless debugging is enabled. This is done when the ripper is initialized:

i = shareimage('http://www.share-image.com/5078-tanya-a-cute-teen-with-puffy-nipples', debugging=True)

Do not throw exceptions within download() without calling self.wait_for_threads() first!

Not doing so may leave the working directory with whatever state it was in (log.txt may contain log lines, half-downloaded images could be left over). wait_for_threads() manages all of this for you.

Albums spread across multiple pages are tricky to rip. Look at other examples of rippers that handle pagination.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly