- File: "complete_w_ratings.csv"
- 13639 books (rows)
- Fields: "book_id", "book_title", "author", "publication_date", "genre", "summary", "ISBN13, "Book-Rating", "Rating-Count"
- This file is the result of running the script preprocessed_isbns.py
- Note: complete_w_ratings.csv is a subset of the "complete_w_embeddings" dataset, but for perfromance reasons we use this file, rather than the more resource intensive "complete_w_embeddings" file when embeddings are not required.
- Folder: "complete_w_embeddings"
- Contains the data from "complete_w_ratings.csv" augmented by semantic embeddings as described in "Embeddings.py"
- Due to space limitations, data is chunked into four files and dynamically reassembled when needed: ./complete_w_embeddings.csv_part_1.csv ./complete_w_embeddings.csv_part_2.csv ./complete_w_embeddings.csv_part_3.csv ./complete_w_embeddings.csv_part_4.csv
- File "distances_updated.npy"
- For each book, semantic distances to the next closest 21 books, based on semantic distances computed via Voyeate API
- Output of script ["Semantic Scores.py"](../../scripts/Semantic Scores.py)
- File "indices_updated.npy"
- For each book, indices of the top 21 closest books, based on semantic distances computed via Voyeage API
- Output of script ["Semantic Scores.py"](../../scripts/Semantic Scores.py)
- File "genre.csv"
- 26347 rows
- A single book can appear in multiple rows if multiple generic_genre classificatinos.
- Fields: "book_id", "book_title", "author", "publication_date", "genre", "summary", "ISBN13, "Book-Rating", "Rating-Count", "generic_genre"
- 26347 rows
Test analogues to above files. Generally same/similar datasets as described above on a samller scale, for purposes of testing.