Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets #2

Open
saurabh-khanna opened this issue Dec 20, 2024 · 9 comments
Open

Datasets #2

saurabh-khanna opened this issue Dec 20, 2024 · 9 comments
Assignees

Comments

@saurabh-khanna
Copy link
Member

saurabh-khanna commented Dec 20, 2024

Let's start looking at different book datasets, and add:

  • What all variables exist in the data
    • Does it have what we really need: Popularity (ratings/copies sold/views/publisher), Quality (Plot, synopsis, title)
  • How many books are covered
  • Pros of using
  • Cons of using
@saurabh-khanna
Copy link
Member Author

saurabh-khanna commented Dec 20, 2024

Possible starting point here.

One approach could be to get all possible ISBNs from Open Library (and anywhere else), and then send these ISBNs as API requests to ISBNDb, but we can decide on this once we have seen all variables from datasets etc.

@yuetongwu7
Copy link
Collaborator

yuetongwu7 commented Jan 8, 2025

Book Database Size Comparison

  • ISDBdb: Nearly 40 million  books.
  • Open Library: retrieved 34,640,099 ISBNs
  • BookCrossing: Includes 1,149,780 ratings for 271,379 books.

@yuetongwu7
Copy link
Collaborator

Comparison of Open Library and ISDBdb Metadata

Open Library ISDBDB
Title Title
Subtitle Long Title
Authors Authors
Works (Related Works) Related Books
ISBN-10 ISBN
ISBN-13 ISBN-13
Library of Congress Control Number (LCCN) Dewey Decimal Classification
OCLC Numbers
Local ID
Cover Image Cover Image
Links
Languages Language
Translated From
Translation Of
Edition Name Edition
Number of Pages Page Count
Pagination
Physical Dimensions Book Dimensions (Length, Width, Height, Weight)
Physical Format Binding Type
Copyright Date
Publish Country
Publish Date Publication Date
Publish Places
Publishers Publisher
Contributions
Dewey Decimal Classification Dewey Decimal Classification
Genres Subjects
Library of Congress Classifications (LCC)
Other Titles
Series
Source Records
Subjects Subjects
Work Titles
Table of Contents
Description Overview
First Sentence
Notes
Created Date
Last Modified Date
Revision History
MSRP (Manufacturer’s Suggested Retail Price)
Excerpt
Synopsis
Reviews
Prices from different merchants (Condition, Merchant, Shipping, Price, Total Price, Purchase Link)
Other ISBNs (with bindings)

@saurabh-khanna
Copy link
Member Author

saurabh-khanna commented Jan 15, 2025

this seems useful while handling ISBNs: https://github.com/xlcnd/isbnlib

@yuetongwu7

@yuetongwu7
Copy link
Collaborator

yuetongwu7 commented Jan 22, 2025

About Goodreads Data
I checked the scraper and it requires a book ID to scrape metadata and ratings. There is no official list of IDs, but they seem to follow a sequential pattern(maybe the uploading sequence), starting from 1 and increasing with each new addition. To collect data, maybe we can iterate through these sequential IDs and match the data using the ISBN.

For example, if we check a larger ID like https://www.goodreads.com/book/show/223810007 it represents a newly uploaded book(published in January 19, 2025) that not yet have any reviews. So maybe we can sequentially going through the IDs, we can scrape all available book data, and match them with their ISBNs.

Reference of Scraper Version
GrimmXoXo. (2024, May 14). Feature Additions and Improvements: Goodreads Scraper #43 [Pull request]. GitHub. maria-antoniak/goodreads-scraper#43

@saurabh-khanna
Copy link
Member Author

We also might be able to get some useful info from the ISBN itself?

Image

@saurabh-khanna
Copy link
Member Author

About Goodreads Data I checked the scraper and it requires a book ID to scrape metadata and ratings. There is no official list of IDs, but they seem to follow a sequential pattern(maybe the uploading sequence), starting from 1 and increasing with each new addition. To collect data, maybe we can iterate through these sequential IDs and match the data using the ISBN.

For example, if we check a larger ID like https://www.goodreads.com/book/show/223810007 it represents a newly uploaded book(published in January 19, 2025) that not yet have any reviews. So maybe we can sequentially going through the IDs, we can scrape all available book data, and match them with their ISBNs.

Reference of Scraper Version GrimmXoXo. (2024, May 14). Feature Additions and Improvements: Goodreads Scraper #43 [Pull request]. GitHub. maria-antoniak/goodreads-scraper#43

@yuetongwu7 I think we can reach the book using an ISBN. For example, I tried the ISBN (9781609450786) of My Brilliant Friend and it actually leads me to the book:
https://www.goodreads.com/search?q=9781609450786

Python code (we can tweak it to get the book id, and then feed it into the maria-anotniak package, OR we can directly scrape rating info ourselves):

from bs4 import BeautifulSoup
import requests

def get_goodreads_info(isbn):
    url = f"https://www.goodreads.com/search?q={isbn}"
    headers = {
        "User-Agent": "Mozilla/5.0"
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, "html.parser")
        book_title = soup.select_one("a.bookTitle span").text if soup.select_one("a.bookTitle span") else "Not found"
        return {"ISBN": isbn, "Title": book_title}
    return None

# Example usage
isbns = ["9780143126560", "9780062316097"]
for isbn in isbns:
    book_info = get_goodreads_info(isbn)
    print(book_info)

@saurabh-khanna
Copy link
Member Author

saurabh-khanna commented Jan 23, 2025

Check data quality

Next steps:

  • check the quality of relevant variables from google and isbndb and goodreads (summary stats, missingness using skimpy)
  • check amazon books API and google books API
  • query goodreads for popularity and quality
  • get access to isbndb and get metadata for all isbns

Send to @saurabh-khanna

  • List of clean ISBNs
  • Code to access isbndb API
  • Code to access isbnlib functions
  • Code to access goodreads

@saurabh-khanna
Copy link
Member Author

ISBN10 to 13 code:

def isbn10_to_isbn13(isbn10):
    """
    Convert an ISBN-10 to ISBN-13.
    
    Args:
    isbn10 (str): ISBN-10 number (with or without hyphens)
    
    Returns:
    str: Corresponding ISBN-13 number
    
    Raises:
    ValueError: If the input is not a valid ISBN-10
    """
    # Remove hyphens and spaces
    isbn10 = isbn10.replace('-', '').replace(' ', '')
    
    # Validate ISBN-10 format
    if len(isbn10) != 10 or not isbn10[:-1].isdigit() or (isbn10[-1] not in '0123456789X'):
        raise ValueError("Invalid ISBN-10 format")
    
    # Calculate check digit for ISBN-13
    prefix = '978' + isbn10[:9]
    
    # Calculate check digit
    total = sum((3 if i % 2 else 1) * int(digit) for i, digit in enumerate(prefix))
    check_digit = (10 - (total % 10)) % 10
    
    # Construct ISBN-13
    isbn13 = prefix + str(check_digit)
    
    return isbn13

# Example usage
print(isbn10_to_isbn13('0-306-40615-2'))  # Will print 978-0-306-40615-7
print(isbn10_to_isbn13('007-6092012X'))   # Will print 978-0-07-6092012-6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

2 participants