diff --git a/.env.example b/.env.example index 759b915..7964e08 100644 --- a/.env.example +++ b/.env.example @@ -3,6 +3,7 @@ MIN_CONFIDENCE="0.5" CODA_TOKEN="" YOUTUBE_API_KEY="" +AIRTABLE_API_KEY="" ARD_DB_USER="user" ARD_DB_PASSWORD="we all live in a yellow submarine" diff --git a/README.md b/README.md index 9a55754..770a75e 100644 --- a/README.md +++ b/README.md @@ -64,146 +64,206 @@ Additional keys may be available depending on the source document. ## Development Environment -To set up the development environment, run the following steps: +### 1. Clone the repository: -```bash +```sh git clone https://github.com/StampyAI/alignment-research-dataset cd alignment-research-dataset -pip install -r requirements.txt ``` -### Database - -You'll also have to set up a MySQL database. To do so with Docker, you can run `./local_db.sh` which should spin up a container -with the database initialised. - -### CLI options - -The available CLI options are list, fetch, fetch-all, and count-tokens. +### 2. Set up Environment Variables: -To get a list of all available datasets: +Duplicate the provided `.env.example` to create your environment configuration: ```sh -python main.py list +cp .env.example .env ``` -To fetch a specific dataset, replace [DATASET_NAME] with the name of the dataset you want to fetch. The optional `--rebuild` parameter allows you to remove the previous build before running, scraping everything from scratch. Otherwise, only the new files will be scraped. +This `.env` file contains placeholders for several configuration options. Further details about how to configure them are in the [Configuration section](#configuration). + +### 3. Install Dependencies: ```sh -python main.py fetch [DATASET_NAME] --rebuild +pip install -r requirements.txt ``` -The command to fetch all datasets is below. Again, the optional `--rebuild` parameter allows you to scrape everything from scratch. +**Optional:** For testing purposes, you can also install testing dependencies: ```sh -python main.py fetch-all --rebuild +pip install -r requirements-test.txt ``` -To get a summary of the merged dataset, Replace [MERGED_DATASET_PATH] with the path to the merged dataset file. +### 4. Database Setup: + +Initialize a MySQL database. To do so with [Docker](https://docs.docker.com/get-docker/), and spin up a container with the database initialised, run the following: ```sh -python main.py count-tokens [MERGED_DATASET_PATH] +./local_db.sh ``` -## New Datasets +## Configuration -Adding a new dataset consists of: +Various subcomponents in this project rely on external services, so need credentials set. This is done via environment variables. The file `.env` is the central location for these settings. -1. Subclassing `AlignmentDataset` to implement any additional functionality needed -2. Creating an instance of your class somewhere -3. Adding the instance to `DATASET_REGISTRY` so it can be found +### Logging -### AlignmentDataset class +The log level can be configured with the `LOG_LEVEL` environment variable. The default level is 'WARNING'. -This is the main workhorse for processing datasets. The basic idea is that it provided a list of items to be processed, and after processing a given item, appends it to the appropriate jsonl file, where each line of the file is a JSON object with all the data. The `AlignmentDataset` class has various methods that can be implemented to handle various cases. A few assumptions are made as to the data it will use, i.e.: +### Coda -* `self.data_path` is where data will be written to and read from - by default it's the `data/` directory -* `self.raw_data_path` is where downloaded files etc. should go - by default it's the `data/raw` directory -* `self.files_path` is where data to be processed is expected to be. This is used e.g. when a collection of html files are to be processed -* `self.jsonl_path` is the path to the output JSONL file, by default `data/.jsonl` -* `self.txt_path` is the path to the debug file +To update the stampy portion of the dataset, you will need a Coda token. Follow these instructions: + 1. Go to [coda.io](https://coda.io/) + 2. Create an account and log in + 3. Go to the API SETTINGS section of your [account settings](https://coda.io/account), and select `Generate API token`. Give your API token a name, and add the following restrictions: + 1. Type of restriction: Doc or table + 2. Type of access: Read only + 3. Doc or table to grant access to: https://coda.io/d/_dfau7sl2hmG + 4. Copy this token to your `.env` file: `CODA_TOKEN=""` +It will be then accessible in `align_data/stampy/stampy.py`. -The `AlignmentDataset` is a dataclass, so it has a couple of settings that control it: +### MySQL -* `name` - this is a string that identifies the dataset, i.e. 'lesswrong' -* `done_key` - used to check if a given item has already been processed. This is a key in the JSON object that gets written to the output file - any subsequent entries with the same value for that key will be skipped -* `glob` - a glob used to select files from the `self.files_path` - this controls what files are processed -* `COOLDOWN` - an optional value of the amount of seconds to wait between processing items - this is useful e.g. when fetching items from an API in order to avoid triggering rate limits +The datasets are stored in MySQL. The connection string can be configured via the `ARD_DB_USER`, +`ARD_DB_PASSWORD`, `ARD_DB_HOST`, `ARD_DB_PORT` and `ARD_DB_NAME` environment variables in `.env`. A local +database can be started in Docker by running +```sh +./local_db.sh +``` -The basic processing flow is: +### Pinecone -1. `self.setup()` - any instance level initialization stuff should go here, e.g. fetching zip files with data -2. `self._load_outputted_items()` - go through `self.jsonl_path` and construct a set of the `self.done_key` values of each item - this is used to skip items that have already been processed -3. `self.items_list` - returns a list of items to be processed - the default is to use `self.glob` on `self.files_path` -4. `self.fetch_entries()` - for each of the resulting items: +For Pinecone updates to work, you'll need to configure the API key: -* extract its key, using `self.get_item_key(item)` -* check if its key has already been processed - if so, skip it -* run `self.process_entry(item)` to get a data entry, which is then yielded -* the data entry is written to `self.jsonl_path` +1. Get an API key, as described [here](https://docs.pinecone.io/docs/quickstart#2-get-and-verify-your-pinecone-api-key) +2. Create a Pinecone index named "stampy-chat-ard" (or whatever is set as `PINECONE_INDEX_NAME`) with the `dotproduct` metric and `1536` dimensions +3. Set the `PINECONE_API_KEY` to the key from step 1 +4. Set the `PINECONE_ENVIRONMENT` to whatever is the environment of your index -### Adding a new instance -There are Datasets defined for various types of data sources - first check if any of them match your use case. If so, it's just a matter of adding a new entry to the `__init__.py` module of the appropriate data source. If not, you'll have to add your own one - use the prexisting ones as examples. Either way, you should end up with an instance of an `AlignmentDataset` subclass added to one of the registries. If you add a new registry, make sure to add it to `align_data.DATASET_REGISTRY`. +### Google API + +To autopopulate the metadata files, you'll need Google Cloud credentials. This is a google system, so of course is complicated and prone to arbitrary changes, but as of writing this the process is: -## Running the code +1. Go to the [Google Cloud Console](https://console.cloud.google.com/) +2. Create a new project or select an existing project. +3. Google sheets etc will have to be enabled + * Enable the Google Sheets API for your project at https://console.cloud.google.com/apis/api/sheets.googleapis.com/metrics?project= + * Enable the Google Drive API for your project at https://console.cloud.google.com/apis/api/drive.googleapis.com/metrics?project= + * Enable the Youtube API for your project at https://console.cloud.google.com/apis/library/youtube.googleapis.com?project=. Note that the youtube API is quite limited in the number of requests it can perform. + An alternative to this step is that when running the program without these enabled, an exception will be raised telling you how to enable it - you can then just open the link in the exception message +4. Navigate to the "Credentials" section, and to `+ Create Credentials`. +5. Select "Service Account" +6. Fill in the required information for the service account: + 1. A descriptive name, a short service account ID, and description. Press `Create and Continue` + 2. Leave the optional sections empty +7. At https://console.cloud.google.com/apis/credentials?project=, select your new Service Account, and go to the KEYS section. Select ADD KEY, "Create New Key", the JSON key type and click "Create". +8. The JSON file containing your credentials will be downloaded. Save it as credentials.json in the top-level directory of the project. +9. Again in the "Credentials" section, `+ Create Credentials`, select API key, and add the created API key as your `YOUTUBE_API_KEY`. -When wishing to update the whole dataset, run `python main.py fetch_all`. You can also fetch a specific subsection of a dataset by its name, for example `python main.py fetch aisafety.info`. +Once you have working credentials, you will be able to fetch data from public sheets and gdrive. For writing to sheets and drives, or accessing private ones within the code, you will need to request permissions to the owner of the particular sheet/gdrive. -## Configuration +#### Metadata updates -Various subcomponents use various external services, so need credentials set. This is done via environment variables, the easiest way of setting which is by copying `~/.env.example` to `~/.env` and changing the appropriate values. +There are a couple of datasources that consist of singular articles (html, pdfs, ebooks, etc), rather than all the contents of a given website. These are managed in [Google sheets](https://docs.google.com/spreadsheets/d/1l3azVJVukGAvZPgg0GyeqiaQe8bEMZvycBJaA8cRXf4/edit#gid=0). It's assumed that the contents of that document are clean, in that all required fields are set, and that there is a `source_url` pointing to a valid document. Rather than having to manually fill these fields, there is a magical script that automatically populates them from a messy [input worksheet](https://docs.google.com/spreadsheets/d/1pgG3HzercOhf4gniaqp3tBc3uvZnHpPhXErwHcthmbI/edit?pli=1#gid=980957638), which contains all kinds of info. -### Logging +### OpenAI API -The log level can be configured with the `LOG_LEVEL` environment variable. The default level is 'WARNING'. +1. Go to [the openai api website](https://platform.openai.com/). Create an account if needed, and add payment information if needed. +2. In https://platform.openai.com/account/api-keys, create a new secret key or use a used one. +3. Add this secret key to the `.env`, as `OPENAI_API_KEY`. -### Coda +### Airtable API -To update the stampy portion of the dataset, you will need a Coda token. go to coda.io, log in, and generate an API token in your account settings. Add restrictions: Doc or table, Read only, for the doc with url https://coda.io/d/_dfau7sl2hmG. Then, create a .env file at the root of the alignment research dataset, and write CODA_TOKEN="". It will be accessible in align_data/stampy/stampy.py +The airtable we currently scrape is https://airtable.com/appbiNKDcn1sGPGOG/shro9Bx4f2i6QgtTM/tblSicSC1u6Ifddrq. #TODO: document how this is done / reproduceable -### MySQL +## CLI Usage -The datasets are stored in MySQL. The connection string can be configured via the `ARD_DB_USER`, -`ARD_DB_PASSWORD`, `ARD_DB_HOST`, `ARD_DB_PORT` and `ARD_DB_NAME` environment variables. A local -database can be started in Docker by running +There are various commands available to interact with the datasets: +- **Access the MySQL database in a separate terminal before running most commands:** + ```sh ./local_db.sh + ``` + +- **Listing all datasets:** + ```sh + python main.py list + ``` + +- **Fetching a specific dataset:** + Replace `[DATASET_NAME]` with the desired dataset. The optional `--rebuild` parameter allows you to remove the previous build before running, scraping everything from scratch. Otherwise, only the new files will be scraped. + + ```sh + python main.py fetch [DATASET_NAME] --rebuild + ``` + +- **Fetching all datasets:** + Again, the optional `--rebuild` parameter allows you to scrape everything from scratch. + ```sh + python main.py fetch-all --rebuild + ``` + +- **Getting a summary of a merged dataset:** + Replace `[MERGED_DATASET_PATH]` with your dataset's path. You'll get access to the dataset's total token count, word count and character count. + ```sh + python main.py count-tokens [MERGED_DATASET_PATH] + ``` + +- **Updating the metadata in the metadata spreadsheet:** + You can give the command optional information about the names and ids of the sheets, and the default will be using values defined in align_data/settings.py + ```sh + python main.py update_metadata + python main.py update_metadata + ``` -### Pinecone +- **Updating the pinecone index with newly modified entries:** + Replace `[DATASET_NAME]` with one or many dataset names whose entries you want to embed and add to the pinecone index. + `--force_update` is an optional parameter for updating all the dataset's articles, rather than newly fetched ones. + ```sh + python main.py pinecone_update [DATASET_NAME] --force_update + ``` + Or run it on all articles as seen below. It is not recommended to `--force_update` in this case. + ```sh + python main.py pinecone_update_all + ``` -For Pinecone updates to work, you'll need to configure the API key: +## Adding New Datasets -1. Get an API key, as described [here](https://docs.pinecone.io/docs/quickstart#2-get-and-verify-your-pinecone-api-key) -2. Create a Pinecone index named "stampy-chat-ard" (or whatever is set as `PINECONE_INDEX_NAME`) with the `dotproduct` metric and 1536 dimensions -3. Set the `PINECONE_API_KEY` to the key from step 1 -4. Set the `PINECONE_ENVIRONMENT` to whatever is the environment of your index +Adding a new dataset consists of: -### Metadata updates +1. Subclassing `AlignmentDataset` to implement any additional functionality needed, within align_data/sources/ +2. Creating an instance of your class somewhere, such as an __init__.py file (you can take inspiration on other such files) +3. Adding the instance to `DATASET_REGISTRY` so it can be found -There are a couple of datasources that consist of singular articles (html, pdfs, ebooks, etc), rather than all the contents of a given website. These are managed in [Google sheets](https://docs.google.com/spreadsheets/d/1l3azVJVukGAvZPgg0GyeqiaQe8bEMZvycBJaA8cRXf4/edit#gid=0). It's assumed that the contents of that document are clean, in that all required fields are set, and that there is a `source_url` pointing to a valid document. Rather than having to manually fill these fields, there is a magical script that automatically populates them from a messy [input worksheet](https://docs.google.com/spreadsheets/d/1pgG3HzercOhf4gniaqp3tBc3uvZnHpPhXErwHcthmbI/edit?pli=1#gid=980957638), which contains all kinds of info. The following will execute this script: +### AlignmentDataset class - python main.py update_metadata +This is the main workhorse for processing datasets. The basic idea is that it provides a list of items to be processed, and after processing a given item, creates an article object, which is added to the MySQL database. The `AlignmentDataset` class has various methods that can be implemented to handle various cases. A few assumptions are made as to the data it will use, i.e.: -which for the current documents would be: +* `self.data_path` is where data will be written to and read from - by default it's the `data/` directory +* `self.raw_data_path` is where downloaded files etc. should go - by default it's the `data/raw` directory +* `self.files_path` is where data to be processed is expected to be. This is used e.g. when a collection of html files are to be processed - python main.py update_metadata 1pgG3HzercOhf4gniaqp3tBc3uvZnHpPhXErwHcthmbI special_docs.csv 1l3azVJVukGAvZPgg0GyeqiaQe8bEMZvycBJaA8cRXf4 +The `AlignmentDataset` is a dataclass, so it has a couple of settings that control it: -#### Google API +* `name` - this is a string that identifies the dataset, i.e. 'lesswrong' +* `done_key` - used to check if a given item has already been processed. +* `COOLDOWN` - an optional value of the amount of seconds to wait between processing items - this is useful e.g. when fetching items from an API in order to avoid triggering rate limits -To autopopulate the metadata files, you'll need Google Cloud credentials. This is a google system, so of course is complicated and prone to arbitrary changes, but as of writing this the process is: +The basic processing flow is: -1. Go to the [Google Cloud Console](https://console.cloud.google.com/) -2. Create a new project or select an existing project (it doesn't matter either way) -3. Google sheets etc will have to be enabled - * Enable the Google Sheets API for your project at https://console.cloud.google.com/apis/api/sheets.googleapis.com/metrics?project= - * Enable the Google Drive API for your project at https://console.cloud.google.com/apis/api/drive.googleapis.com/metrics?project= - An alternative to this step is that when running the program without these enabled, an exception will be raised telling you how to enable it - you can then just open the link in the exception message -4. Navigate to the "Credentials" section -5. Click on "Create Credentials" and select "Service Account" -6. Fill in the required information for the service account -7. On the "Create key" page, select the JSON key type and click "Create" -8. The JSON file containing your credentials will be downloaded -> save as credentials.json in the folder from which you're running the code +1. `self.setup()` - any instance level initialization stuff should go here, e.g. fetching zip files with data +2. `self._load_outputted_items()` - goes through articles in the database, loads the value of their `self.done_key`, and outputs a simplified version of these strings using `normalize_url` +3. `self.items_list` - returns a list of items to be processed. +4. `self.fetch_entries()` - for each of the resulting items: + +* extract its key, using `self.get_item_key(item)` +* check if its key has already been processed - if so, skip it +* run `self.process_entry(item)` to get an article, which is then yielded +* the article is added to the database if it satisfies some conditions, like being a modification of the previous instance of that article, having the minimal required keys, etc. + +### Adding a new instance + +There are Datasets defined for various types of data sources - first check if any of them match your use case. If so, it's just a matter of adding a new entry to the `__init__.py` module of the appropriate data source. If not, you'll have to add your own one - use the prexisting ones as examples. Either way, you should end up with an instance of an `AlignmentDataset` subclass added to one of the registries. If you add a new registry, make sure to add it to `align_data.DATASET_REGISTRY`. ## Contributing