add video to readme

chiral-carbon · chiral-carbon · commit add96bc7dd63 · 2024-09-17T19:59:55.000-04:00
diff --git a/.gitattributes b/.gitattributes
@@ -1 +1,2 @@
 *.large_file_extension filter=lfs diff=lfs merge=lfs -text
+*.mp4 filter=lfs diff=lfs merge=lfs -text
diff --git a/README.md b/README.md
@@ -1,12 +1,20 @@
 # Mapping the Data Landscape For Generalizable Scientific Models
 
-This is a WIP that builds a knowledge base to store structured information extracted from scientific publications, datasets and articles using LLMs. 
+We introduce a method to build a knowledge base to store structured information extracted from scientific publications, datasets and articles by leveraging large language models!
 
-We want to cover all of "science", and perform semantic search and interact with the tool. 
+We want to cover all of "science", and perform semantic search over scientific literature for highly specific knowledge discovery. 
 
-This tool helps us identify the gaps where current foundation models lack coverage and where they can generalize well. It also helps us discover overlaps of methods used across different fields, that can help facilitate in building more unified foundation models for science. 
+This tool helps us to find aggregate information and statistics pertaining to current state of scientific research, identify the gaps where current foundation models lack coverage and where they can generalize well, and helps discover overlaps of methods used across different fields, which can help facilitate in building more unified foundation models for science. 
 
-We use the Llama-3-70B-Instruct model for structured information extraction. 
+### Example Preview: Concept Co-occurrence Connectivity Graph for Astrophysics!
+
+
+
+https://github.com/user-attachments/assets/d0c2c4ac-924d-4ba8-80d5-5c665d910652
+
+
+
+We use the Llama-3-70B-Instruct model with 2 A100 80GB GPUs for structured information extraction. 
 
 ## Workflow
 
@@ -54,7 +62,7 @@ pre-commit install
 
 ## Running the tool
 
-### 1. Download raw data from arXiv
+### On new data: Download raw data from arXiv
 
 Run `scripts/collect_data.py` to download papers for arXiv:
 ```
@@ -66,15 +74,15 @@ These are the default arguments, you can modify them to specify the arxiv channe
 The data is stored in the `data/raw/<out_dir>` directory.
 The `out_dir` is a required argument that creates a new directory in `data/raw` and stores the scraped data in a jsonl file inside `data/raw/<out_dir>`. Refer to the [raw data README](data/raw/README.md) to see how the files are named. 
 
-### 2. Schema and Annotations
+### Schema and Annotations
 
 A schema was prepared by the authors of this project who were also the annotators for a small subset of the downloaded papers. 
 This schema defined tags to extract concepts from the downloaded scientific papers. They were used as reference by the annotators when manually creating a small subset of annotated papers. They were also passed as instructions to the language model to tag the papers. A set of consituency tests were also defined to resolve ambiguity and guide the annotation process. 
 
 The schema, tests and manual_annotations are stored in the `data/manual` directory. Refer to the [README on manual work](data/manual/README.md) done for the extraction process.
 
 
-### 3. Run the model on downloaded arXiv raw data
+### Run the model on downloaded arXiv raw data
 
 Run `main.py` to call Llama-3 70B Instruct and perform extractions on the downloaded papers from any of the `data/raw` folders using Slurm jobs:
 ```
@@ -100,7 +108,7 @@ Options:
   --help                   Show this message and exit.
 ```
 
-We use 2 A100 80GB GPUs to perform extractions with Llama-3 70B. You can choose a different model if limited by memory and GPU.  Since we ust the Huggingface transformers API and the model hosted on Huggingface, any new model you want to load should be hosted there as well. 
+If bound by compute resources and unable to use the Llama-3-70B-Instruct model, you can choose a different model if limited by memory and GPU.  Since we use the Huggingface transformers library and the model hosted on Huggingface, any new model you want to load should be hosted there as well. 
 
 **Note:** In the eval mode when running on the dev set, the model was run for different sweeps for prompt optimization. The sweep details are stored in `sweep_config.json`.
 
@@ -123,7 +131,7 @@ The current best performance on the dev set:
 
 The processed data gets stored in `data/raw/results` under new directories named with arguments passed to `main.py`. Refer to the [results README](data/results/README.md) for inspecting the files that each directory stores and the naming convention.
 
-### 4. To create a SQLite3 database of the predictions, run:
+### Create a SQLite3 database of the predictions
 ```
 python scripts/create_db.py --data_path <path to the jsonl file with data> --pred_path <path to the predictions.json file>
 ```
@@ -141,11 +149,16 @@ Options:
   ```
 
 All current databases are in the ```data/databases``` directory which can be downloaded and loaded with ```sqlite3``` to run queries on your own terminal. Refer to the [databases README](data/databases/README.md) for information on the tables that constitute each of the databases.
+
+## Run an existing DB
+
+To run an existing database in the `databases` directory:
+
 ```
 sqlite3 databases/<table_name>
 ```
 
-### 5. To launch a Gradio interface for SQL query search over the created databases, run:
+## Launch a Gradio interface for SQL query search over the created databases
 ```
 gradio scripts/run_db_interface.py
 ```
@@ -162,4 +175,4 @@ The interface shows all the created databases in the `data/databases` directory
 ### Research Papers
 - Knowledge Graph in Astronomical Research with Large Language Models: Quantifying Driving Forces in Interdisciplinary Scientific Discovery | [arXiv](https://arxiv.org/pdf/2406.01391)
 - Graph of Thoughts: Solving Elaborate Problems with Large Language Models | [arXiv](https://arxiv.org/pdf/2308.09687)
-- Bring Your Own KG: Self-Supervised Program Synthesis for Zero-Shot KGQA | [arXiv](https://arxiv.org/pdf/2311.07850)
+- Bring Your Own KG: Self-Supervised Program Synthesis for Zero-Shot KGQA | [arXiv](https://arxiv.org/pdf/2311.07850)
diff --git a/misc/graph-prev.mp4 b/misc/graph-prev.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:263c60b4a26b6d32e96ccdacd92a4edf9787581f5dee9b9eb182d8b1502044c0
+size 7540171
diff --git a/misc/graph.mp4 b/misc/graph.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:28392771ef7d72e204adf2273cadc2728a6f89f9e54ab71c162bf387d596faa4
+size 21734557
diff --git a/misc/kg4s-demo.mp4 b/misc/kg4s-demo.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:f3faa731c6f5cb5bc694a56c80690acc0486356fd51bbe5163b4feff50d316f7
+size 90726535
diff --git a/misc/kg4s-graph.mp4 b/misc/kg4s-graph.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:ef8d4980ac07ef08ea940fbc6ddb5c0fed460e385074be6eae1f069a9c7c4844
+size 27332577
diff --git a/misc/kg4s-preview.mp4 b/misc/kg4s-preview.mp4
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:48078c4726cbea8dae008016bcafff51c40c16cd4d7914f8e01a16bac4102901
+size 8467193

Original file line number	Diff line number	Diff line change
`@@ -1 +1,2 @@`
`1`	`1`	`*.large_file_extension filter=lfs diff=lfs merge=lfs -text`
	`2`	`+*.mp4 filter=lfs diff=lfs merge=lfs -text`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+version https://git-lfs.github.com/spec/v1`
	`2`	`+oid sha256:263c60b4a26b6d32e96ccdacd92a4edf9787581f5dee9b9eb182d8b1502044c0`
	`3`	`+size 7540171`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+version https://git-lfs.github.com/spec/v1`
	`2`	`+oid sha256:28392771ef7d72e204adf2273cadc2728a6f89f9e54ab71c162bf387d596faa4`
	`3`	`+size 21734557`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+version https://git-lfs.github.com/spec/v1`
	`2`	`+oid sha256:f3faa731c6f5cb5bc694a56c80690acc0486356fd51bbe5163b4feff50d316f7`
	`3`	`+size 90726535`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+version https://git-lfs.github.com/spec/v1`
	`2`	`+oid sha256:ef8d4980ac07ef08ea940fbc6ddb5c0fed460e385074be6eae1f069a9c7c4844`
	`3`	`+size 27332577`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+version https://git-lfs.github.com/spec/v1`
	`2`	`+oid sha256:48078c4726cbea8dae008016bcafff51c40c16cd4d7914f8e01a16bac4102901`
	`3`	`+size 8467193`