-
Notifications
You must be signed in to change notification settings - Fork 18
Supporting a new site
Example of how to support a new site by creating a ripper.
In this example, I'll be supporting share-image.com
- Learn python
Short refresher: learnxinyminutes/python Long: Learn Python the Hard Way by Zed Shaw
- Clone the repo (if you haven't already):
git clone [email protected]:4pr0n/rip.git
-
Go to the /rip/sites/ directory via command-line
-
Ensure the darn thing works. Use the tester script
python test.py
- Copy the skeleton ripper to a new file.
cp _testsite.py site_shareimage.py
Note that instead of 'shareimage', use the name of the site.
Now we're ready to write some python
_testsite.py
contains the basic class structure for a ripper.
We've copied the file to site_shareimage.py
, so we will be editing the contents of that file.
- Change the name of the class. From
testsite
:
class testsite(basesite):
to something more appropriate, like shareimage
:
class shareimage(basesite):
Validates and enforces a given URL.
- Edit
sanitize_url()
to check for share-image.
Before:
if not 'testsite.com/' in url:
raise Exception('')
After:
if not 'share-image.com/' in url:
raise Exception('')
The blank Exception
should be thrown if the site is not related to this ripper. This tells the caller to move onto another site's ripper.
- Enforce the desired URL
Share-image has lots of URLs. We only want to accept URLs that point to galleries.
The format for a share-image gallery is http://share-image.com/1234-gallery-name
Where 1234
is the gallery ID and 'gallery-name' is the name.
We want to make sure the format is share-image.com/[numbers]
# Crazy regex
if not re.compile('^.*share-image\.com\/\d*-?[a-zA-Z0-9\-]*$').match(url):
raise Exception('required "share-image.com/[numbers]-..." not found')
This exception stops the ripping process and notifies the user of an issue with the URL. Be descriptive!
-
Format URL as needed
Sometimes we want to strip out unnecessary text from URLs (like hashtags or query strings).
if '#' in url: url = url[:url.find('#')]
if '?' in url: url = url[:url.find('?')]
This doesn't seem to be the case with share-image, but it might come in handy for other sites.
Returns a unique working directory, album name, and zip name for a given album.
Note: The name must be file-system safe! No special characters like /
, ?
, *
, etc.
Also, make the name unique so as to not clobber other existing galleries
- Get a unique name or ID from the given URL.
For share-image, we only need the digits that follow share-image.com/
gid = url[url.rfind('/')+1:]
Using share-image.com/1234-abc
, this would retrieve the 1234
and store it in gid
.
- Return the ripper name with the unique name.
return 'shareimage_%s' % gid
Using the above example, this would return the directory name shareimage_1234
Downloads images in the album. The process is to: initialize, get the album source, find the images, and download them.
- Initialize
self.init_dir()
This initializes (creates) the working directory.
- Get album source
r = self.web.get(self.url)
This retrieves the contents of the URL and stores it in r
- Find the images
This varies from site to site. You may have to find ways to get this to work w/ other sites. Sometimes you will have perform a while
loop if there's multiple pages of images.
Share-image stores the full-size images very similarly to how the thumbnails are stored. Because of this, we can download the full-sized images if we know the thumbnail URLs.
In the share-image album source, thumbnail URLs are wrapped between _self"><img src="
and "
tags.
thumbs = self.web.between(r, '_self"><img src="', '"')
Lots of things happen here. The self.web.between()
method:
Looks through the page source `r`,
Returns a list of all strings in `r` between `_self"><img src="` and `"`.
Now thumbs
is a list containing all thumbnail URLs for the album.
- Download the images
Before we can download, we have to iterate over the thumbnails and alter the thumbnail URLs to point at the full-sized images.
for (index, thumb) in enumerate(thumbs):
full = thumb.replace('pics.share-image.com', 'pictures.share-image.com')
full = full.replace('/thumb/', '/big/')
full
now contains the path to the full-sized image. Now we can download it:
self.download_image( full, index + 1, total=len(thumbs) )
This kicks off the threaded downloader managed in the superclass basesite
. This will download the image via a new thread and save it accordingly
- Wait for threads to finish
We don't want to exit this function until the threads have completed. Luckily, there's a helper method for that:
self.wait_for_threads()
One benefit is that this method will also delete the working directory if no images are found after the threads finish. This prevents 'orphaned' or empty archives.
A few more things needed to be added (explained via comments), but here's the final result:
class shareimage(basesite):
def sanitize_url(self, url):
if not 'share-image.com/' in url: # Verify URL is share-image
raise Exception('')
# Ensure URL points to image-share album
if not re.compile('^.*share-image\.com\/\d*-?[a-zA-Z0-9\-]*$').match(url):
raise Exception('required share-image.com/[numbers]-... not found in URL')
# Strip excess fields from URL
if '#' in url: url = url[:url.find('#')]
if '?' in url: url = url[:url.find('?')]
return url
def get_dir(self, url):
gid = url[url.rfind('/')+1:] # Get trailing full gallery name
gid = gid[:gid.find('-')] # Strip off trailing gallery name
return 'shareimage_%s' % gid # Working directory is now 'shareimage_[galleryid]'
def download(self):
self.init_dir()
r = self.web.get(self.url) # Get page source
thumbs = self.web.between(r, '_self"><img src="', '"') # Extract thumbnail URLs
for index, thumb in enumerate(thumbs):
# Convert thumbnail URL to full-size image URL
full = thumb.replace('pics.share-image.com', 'pictures.share-image.com')
full = full.replace('/thumb/', '/big/')
if self.urls_only:
# User only wants URLs to direct images, not the downloaded images
self.add_url(index, full, total=total)
else:
# Download the image (threaded)
self.download_image(full, index + 1, total=len(thumbs))
if self.hit_image_limit(): break # Stop if we hit the maximum number of images
self.wait_for_threads() # Wait for existing threads to finish
39 lines of code! That wasn't so bad.
To test the ripper via command-line (which is much easier to debug, view stack traces than in the web UI).
This is accomplished using the ugly test.py
class. We add an album to rip to this class and execute it.
This will require editing test.py
, so open it up and get coding.
- Import the new ripper
At the top of the test.py
class, import the new ripper:
from site_shareimage import shareimage
- Add a 'test case' for the new ripper
In this file, you will find lots of integration tests for various rippers. They all start with i =
. Scroll to the bottom of this list, comment out the last test, and add your test.
i = shareimage('http://www.share-image.com/5078-tanya-a-cute-teen-with-puffy-nipples')
The code below this initializes the ripper stored at i
and attempts to download the album. No other changes are needed.
- Save
test.py
and execute it.
python test.py
You should see the ripper work it's magic.
If the command-line ripper works for you, that's good enough for me. I can take it from there.
If you want to go all-out and get the Web UI to support your new ripper, you will need to edit one more file: rip.cgi
in the base directory.
- Import your ripper
from site_shareimage import shareimage
- Add your ripper to the list inside of
get_ripper()
There is a list of rippers that the website iterates through to find the appropriate ripper to use.
The list starts with sites = [
and you can't miss it.
Add your ripper to the bottom (or top) of the list
shareimage, \
- Direct your browser to your local Apache instance, paste in a URL, and hit Rip & Zip
You can add debug statements to the ripper using
self.debug('this is a debug statement')
These statements won't print unless debugging is enabled. This is done when the ripper is initialized:
i = shareimage('http://www.share-image.com/5078-tanya-a-cute-teen-with-puffy-nipples', debugging=True)
Do not throw exceptions within download()
without calling self.wait_for_threads()
first!
Not doing so may leave the working directory with whatever state it was in (log.txt
may contain log lines, half-downloaded images could be left over). wait_for_threads()
manages all of this for you.
Albums spread across multiple pages are tricky to rip. Look at other examples of rippers that handle pagination.