Datasets #2

saurabh-khanna · 2024-12-20T13:31:45Z

Let's start looking at different book datasets, and add:

What all variables exist in the data
- Does it have what we really need: Popularity (ratings/copies sold/views/publisher), Quality (Plot, synopsis, title)
How many books are covered
Pros of using
Cons of using

saurabh-khanna · 2024-12-20T13:37:27Z

Possible starting point here.

One approach could be to get all possible ISBNs from Open Library (and anywhere else), and then send these ISBNs as API requests to ISBNDb, but we can decide on this once we have seen all variables from datasets etc.

yuetongwu7 · 2025-01-08T16:48:20Z

Book Database Size Comparison

ISDBdb: Nearly 40 million books.
Open Library: retrieved 34,640,099 ISBNs
BookCrossing: Includes 1,149,780 ratings for 271,379 books.
- Kaggle Dataset

yuetongwu7 · 2025-01-08T16:53:19Z

Comparison of Open Library and ISDBdb Metadata

Open Library	ISDBDB
Title	Title
Subtitle	Long Title
Authors	Authors
Works (Related Works)	Related Books
ISBN-10	ISBN
ISBN-13	ISBN-13
Library of Congress Control Number (LCCN)	Dewey Decimal Classification
OCLC Numbers
Local ID
Cover Image	Cover Image
Links
Languages	Language
Translated From
Translation Of
Edition Name	Edition
Number of Pages	Page Count
Pagination
Physical Dimensions	Book Dimensions (Length, Width, Height, Weight)
Physical Format	Binding Type
Copyright Date
Publish Country
Publish Date	Publication Date
Publish Places
Publishers	Publisher
Contributions
Dewey Decimal Classification	Dewey Decimal Classification
Genres	Subjects
Library of Congress Classifications (LCC)
Other Titles
Series
Source Records
Subjects	Subjects
Work Titles
Table of Contents
Description	Overview
First Sentence
Notes
Created Date
Last Modified Date
Revision History
	MSRP (Manufacturer’s Suggested Retail Price)
	Excerpt
	Synopsis
	Reviews
	Prices from different merchants (Condition, Merchant, Shipping, Price, Total Price, Purchase Link)
	Other ISBNs (with bindings)

saurabh-khanna · 2025-01-15T01:33:43Z

this seems useful while handling ISBNs: https://github.com/xlcnd/isbnlib

@yuetongwu7

yuetongwu7 · 2025-01-22T11:49:15Z

About Goodreads Data
I checked the scraper and it requires a book ID to scrape metadata and ratings. There is no official list of IDs, but they seem to follow a sequential pattern(maybe the uploading sequence), starting from 1 and increasing with each new addition. To collect data, maybe we can iterate through these sequential IDs and match the data using the ISBN.

For example, if we check a larger ID like https://www.goodreads.com/book/show/223810007 it represents a newly uploaded book(published in January 19, 2025) that not yet have any reviews. So maybe we can sequentially going through the IDs, we can scrape all available book data, and match them with their ISBNs.

Reference of Scraper Version
GrimmXoXo. (2024, May 14). Feature Additions and Improvements: Goodreads Scraper #43 [Pull request]. GitHub. maria-antoniak/goodreads-scraper#43

saurabh-khanna · 2025-01-22T12:04:35Z

We also might be able to get some useful info from the ISBN itself?

saurabh-khanna · 2025-01-22T12:09:17Z

About Goodreads Data I checked the scraper and it requires a book ID to scrape metadata and ratings. There is no official list of IDs, but they seem to follow a sequential pattern(maybe the uploading sequence), starting from 1 and increasing with each new addition. To collect data, maybe we can iterate through these sequential IDs and match the data using the ISBN.

For example, if we check a larger ID like https://www.goodreads.com/book/show/223810007 it represents a newly uploaded book(published in January 19, 2025) that not yet have any reviews. So maybe we can sequentially going through the IDs, we can scrape all available book data, and match them with their ISBNs.

Reference of Scraper Version GrimmXoXo. (2024, May 14). Feature Additions and Improvements: Goodreads Scraper #43 [Pull request]. GitHub. maria-antoniak/goodreads-scraper#43

@yuetongwu7 I think we can reach the book using an ISBN. For example, I tried the ISBN (9781609450786) of My Brilliant Friend and it actually leads me to the book:
https://www.goodreads.com/search?q=9781609450786

Python code (we can tweak it to get the book id, and then feed it into the maria-anotniak package, OR we can directly scrape rating info ourselves):

from bs4 import BeautifulSoup
import requests

def get_goodreads_info(isbn):
    url = f"https://www.goodreads.com/search?q={isbn}"
    headers = {
        "User-Agent": "Mozilla/5.0"
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, "html.parser")
        book_title = soup.select_one("a.bookTitle span").text if soup.select_one("a.bookTitle span") else "Not found"
        return {"ISBN": isbn, "Title": book_title}
    return None

# Example usage
isbns = ["9780143126560", "9780062316097"]
for isbn in isbns:
    book_info = get_goodreads_info(isbn)
    print(book_info)

saurabh-khanna · 2025-01-23T15:38:02Z

Check data quality

Next steps:

check the quality of relevant variables from google and isbndb and goodreads (summary stats, missingness using skimpy)
check amazon books API and google books API
query goodreads for popularity and quality
get access to isbndb and get metadata for all isbns

Send to @saurabh-khanna

List of clean ISBNs
Code to access isbndb API
Code to access isbnlib functions
Code to access goodreads

saurabh-khanna · 2025-01-23T15:53:10Z

ISBN10 to 13 code:

def isbn10_to_isbn13(isbn10):
    """
    Convert an ISBN-10 to ISBN-13.
    
    Args:
    isbn10 (str): ISBN-10 number (with or without hyphens)
    
    Returns:
    str: Corresponding ISBN-13 number
    
    Raises:
    ValueError: If the input is not a valid ISBN-10
    """
    # Remove hyphens and spaces
    isbn10 = isbn10.replace('-', '').replace(' ', '')
    
    # Validate ISBN-10 format
    if len(isbn10) != 10 or not isbn10[:-1].isdigit() or (isbn10[-1] not in '0123456789X'):
        raise ValueError("Invalid ISBN-10 format")
    
    # Calculate check digit for ISBN-13
    prefix = '978' + isbn10[:9]
    
    # Calculate check digit
    total = sum((3 if i % 2 else 1) * int(digit) for i, digit in enumerate(prefix))
    check_digit = (10 - (total % 10)) % 10
    
    # Construct ISBN-13
    isbn13 = prefix + str(check_digit)
    
    return isbn13

# Example usage
print(isbn10_to_isbn13('0-306-40615-2'))  # Will print 978-0-306-40615-7
print(isbn10_to_isbn13('007-6092012X'))   # Will print 978-0-07-6092012-6

saurabh-khanna assigned yuetongwu7 Dec 20, 2024

saurabh-khanna added this to Team planning Dec 22, 2024

saurabh-khanna moved this to In Progress in Team planning Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets #2

Datasets #2

saurabh-khanna commented Dec 20, 2024 •

edited

Loading

saurabh-khanna commented Dec 20, 2024 •

edited

Loading

yuetongwu7 commented Jan 8, 2025 •

edited

Loading

yuetongwu7 commented Jan 8, 2025

saurabh-khanna commented Jan 15, 2025 •

edited

Loading

yuetongwu7 commented Jan 22, 2025 •

edited

Loading

saurabh-khanna commented Jan 22, 2025

saurabh-khanna commented Jan 22, 2025

saurabh-khanna commented Jan 23, 2025 •

edited

Loading

saurabh-khanna commented Jan 23, 2025

Datasets #2

Datasets #2

Comments

saurabh-khanna commented Dec 20, 2024 • edited Loading

saurabh-khanna commented Dec 20, 2024 • edited Loading

yuetongwu7 commented Jan 8, 2025 • edited Loading

Book Database Size Comparison

yuetongwu7 commented Jan 8, 2025

saurabh-khanna commented Jan 15, 2025 • edited Loading

yuetongwu7 commented Jan 22, 2025 • edited Loading

saurabh-khanna commented Jan 22, 2025

saurabh-khanna commented Jan 22, 2025

saurabh-khanna commented Jan 23, 2025 • edited Loading

Check data quality

Send to @saurabh-khanna

saurabh-khanna commented Jan 23, 2025

saurabh-khanna commented Dec 20, 2024 •

edited

Loading

saurabh-khanna commented Dec 20, 2024 •

edited

Loading

yuetongwu7 commented Jan 8, 2025 •

edited

Loading

saurabh-khanna commented Jan 15, 2025 •

edited

Loading

yuetongwu7 commented Jan 22, 2025 •

edited

Loading

saurabh-khanna commented Jan 23, 2025 •

edited

Loading