Skip to content

Commit

Permalink
Document Processor v2 (#442)
Browse files Browse the repository at this point in the history
* wip: init refactor of document processor to JS

* add NodeJs PDF support

* wip: partity with python processor
feat: add pptx support

* fix: forgot files

* Remove python scripts totally

* wip:update docker to boot new collector

* add package.json support

* update dockerfile for new build

* update gitignore and linting

* add more protections on file lookup

* update package.json

* test build

* update docker commands to use cap-add=SYS_ADMIN so web scraper can run
update all scripts to reflect this
remove docker build for branch
  • Loading branch information
timothycarambat authored Dec 14, 2023
1 parent 5f6a013 commit 719521c
Show file tree
Hide file tree
Showing 69 changed files with 3,682 additions and 1,925 deletions.
12 changes: 3 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,10 +74,10 @@ Some cool features of AnythingLLM

### Technical Overview
This monorepo consists of three main sections:
- `collector`: Python tools that enable you to quickly convert online resources or local documents into LLM useable format.
- `frontend`: A viteJS + React frontend that you can run to easily create and manage all your content the LLM can use.
- `server`: A nodeJS + express server to handle all the interactions and do all the vectorDB management and LLM interactions.
- `server`: A NodeJS express server to handle all the interactions and do all the vectorDB management and LLM interactions.
- `docker`: Docker instructions and build process + information for building from source.
- `collector`: NodeJS express server that process and parses documents from the UI.

### Minimum Requirements
> [!TIP]
Expand All @@ -86,7 +86,6 @@ This monorepo consists of three main sections:
> you will be storing (documents, vectors, models, etc). Minimum 10GB recommended.
- `yarn` and `node` on your machine
- `python` 3.9+ for running scripts in `collector/`.
- access to an LLM running locally or remotely.

*AnythingLLM by default uses a built-in vector database powered by [LanceDB](https://github.com/lancedb/lancedb)
Expand All @@ -112,6 +111,7 @@ export STORAGE_LOCATION="/var/lib/anythingllm" && \
mkdir -p $STORAGE_LOCATION && \
touch "$STORAGE_LOCATION/.env" && \
docker run -d -p 3001:3001 \
--cap-add SYS_ADMIN \
-v ${STORAGE_LOCATION}:/app/server/storage \
-v ${STORAGE_LOCATION}/.env:/app/server/.env \
-e STORAGE_DIR="/app/server/storage" \
Expand Down Expand Up @@ -141,12 +141,6 @@ To boot the frontend locally (run commands from root of repo):

[Learn about vector caching](./server/storage/vector-cache/VECTOR_CACHE.md)

## Standalone scripts

This repo contains standalone scripts you can run to collect data from a Youtube Channel, Medium articles, local text files, word documents, and the list goes on. This is where you will use the `collector/` part of the repo.

[Go set up and run collector scripts](./collector/README.md)

## Contributing
- create issue
- create PR with branch name format of `<issue number>-<short name>`
Expand Down
5 changes: 2 additions & 3 deletions cloud-deployments/aws/cloudformation/DEPLOY.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# How to deploy a private AnythingLLM instance on AWS

With an AWS account you can easily deploy a private AnythingLLM instance on AWS. This will create a url that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys and they will not be exposed - however if you want your instance to be protected it is highly recommend that you set the `AUTH_TOKEN` and `JWT_SECRET` variables in the `docker/` ENV.
With an AWS account you can easily deploy a private AnythingLLM instance on AWS. This will create a url that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys and they will not be exposed - however if you want your instance to be protected it is highly recommend that you set a password one setup is complete.

**Quick Launch (EASY)**
1. Log in to your AWS account
Expand Down Expand Up @@ -30,12 +30,11 @@ The output of this cloudformation stack will be:

**Requirements**
- An AWS account with billing information.
- AnythingLLM (GUI + document processor) must use a t2.small minimum and 10Gib SSD hard disk volume

## Please read this notice before submitting issues about your deployment

**Note:**
Your instance will not be available instantly. Depending on the instance size you launched with it can take varying amounts of time to fully boot up.
Your instance will not be available instantly. Depending on the instance size you launched with it can take 5-10 minutes to fully boot up.

If you want to check the instance's progress, navigate to [your deployed EC2 instances](https://us-west-1.console.aws.amazon.com/ec2/home) and connect to your instance via SSH in browser.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@
"touch /home/ec2-user/anythingllm/.env\n",
"sudo chown ec2-user:ec2-user -R /home/ec2-user/anythingllm\n",
"docker pull mintplexlabs/anythingllm:master\n",
"docker run -d -p 3001:3001 -v /home/ec2-user/anythingllm:/app/server/storage -v /home/ec2-user/anythingllm/.env:/app/server/.env -e STORAGE_DIR=\"/app/server/storage\" mintplexlabs/anythingllm:master\n",
"docker run -d -p 3001:3001 --cap-add SYS_ADMIN -v /home/ec2-user/anythingllm:/app/server/storage -v /home/ec2-user/anythingllm/.env:/app/server/.env -e STORAGE_DIR=\"/app/server/storage\" mintplexlabs/anythingllm:master\n",
"echo \"Container ID: $(sudo docker ps --latest --quiet)\"\n",
"export ONLINE=$(curl -Is http://localhost:3001/api/ping | head -n 1|cut -d$' ' -f2)\n",
"echo \"Health check: $ONLINE\"\n",
Expand Down
8 changes: 2 additions & 6 deletions cloud-deployments/digitalocean/terraform/DEPLOY.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
# How to deploy a private AnythingLLM instance on DigitalOcean using Terraform

With a DigitalOcean account, you can easily deploy a private AnythingLLM instance using Terraform. This will create a URL that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys, and they will not be exposed. However, if you want your instance to be protected, it is highly recommended that you set the `AUTH_TOKEN` and `JWT_SECRET` variables in the `docker/` ENV.

[Refer to .env.example](../../../docker/HOW_TO_USE_DOCKER.md) for data format.
With a DigitalOcean account, you can easily deploy a private AnythingLLM instance using Terraform. This will create a URL that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys, and they will not be exposed. However, if you want your instance to be protected, it is highly recommended that you set a password one setup is complete.

The output of this Terraform configuration will be:
- 1 DigitalOcean Droplet
Expand All @@ -12,8 +10,6 @@ The output of this Terraform configuration will be:
- An DigitalOcean account with billing information
- Terraform installed on your local machine
- Follow the instructions in the [official Terraform documentation](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli) for your operating system.
- `.env` file that is filled out with your settings and set up in the `docker/` folder


## How to deploy on DigitalOcean
Open your terminal and navigate to the `digitalocean/terraform` folder
Expand All @@ -36,7 +32,7 @@ terraform destroy
## Please read this notice before submitting issues about your deployment
**Note:**
Your instance will not be available instantly. Depending on the instance size you launched with it can take anywhere from 10-20 minutes to fully boot up.
Your instance will not be available instantly. Depending on the instance size you launched with it can take anywhere from 5-10 minutes to fully boot up.
If you want to check the instances progress, navigate to [your deployed instances](https://cloud.digitalocean.com/droplets) and connect to your instance via SSH in browser.
Expand Down
2 changes: 1 addition & 1 deletion cloud-deployments/digitalocean/terraform/user_data.tp1
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ mkdir -p /home/anythingllm
touch /home/anythingllm/.env

sudo docker pull mintplexlabs/anythingllm:master
sudo docker run -d -p 3001:3001 -v /home/anythingllm:/app/server/storage -v /home/anythingllm/.env:/app/server/.env -e STORAGE_DIR="/app/server/storage" mintplexlabs/anythingllm:master
sudo docker run -d -p 3001:3001 --cap-add SYS_ADMIN -v /home/anythingllm:/app/server/storage -v /home/anythingllm/.env:/app/server/.env -e STORAGE_DIR="/app/server/storage" mintplexlabs/anythingllm:master
echo "Container ID: $(sudo docker ps --latest --quiet)"

export ONLINE=$(curl -Is http://localhost:3001/api/ping | head -n 1|cut -d$' ' -f2)
Expand Down
15 changes: 4 additions & 11 deletions cloud-deployments/gcp/deployment/DEPLOY.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
# How to deploy a private AnythingLLM instance on GCP

With a GCP account you can easily deploy a private AnythingLLM instance on GCP. This will create a url that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys and they will not be exposed - however if you want your instance to be protected it is highly recommend that you set the `AUTH_TOKEN` and `JWT_SECRET` variables in the `docker/` ENV.

[Refer to .env.example](../../../docker/HOW_TO_USE_DOCKER.md) for data format.
With a GCP account you can easily deploy a private AnythingLLM instance on GCP. This will create a url that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys and they will not be exposed - however if you want your instance to be protected it is highly recommend that you set a password one setup is complete.

The output of this cloudformation stack will be:
- 1 GCP VM
Expand All @@ -11,19 +9,15 @@ The output of this cloudformation stack will be:

**Requirements**
- An GCP account with billing information.
- AnythingLLM (GUI + document processor) must use a n1-standard-1 minimum and 10Gib SSD hard disk volume
- `.env` file that is filled out with your settings and set up in the `docker/` folder

## How to deploy on GCP
Open your terminal
1. Generate your specific cloudformation document by running `yarn generate:gcp_deployment` from the project root directory.
2. This will create a new file (`gcp_deploy_anything_llm_with_env.yaml`) in the `gcp/deployment` folder.
3. Log in to your GCP account using the following command:
1. Log in to your GCP account using the following command:
```
gcloud auth login
```
4. After successful login, Run the following command to create a deployment using the Deployment Manager CLI:
2. After successful login, Run the following command to create a deployment using the Deployment Manager CLI:
```

Expand Down Expand Up @@ -57,5 +51,4 @@ If you want to check the instances progress, navigate to [your deployed instance
Once connected run `sudo tail -f /var/log/cloud-init-output.log` and wait for the file to conclude deployment of the docker image.
Additionally, your use of this deployment process means you are responsible for any costs of these GCP resources fully.
Additionally, your use of this deployment process means you are responsible for any costs of these GCP resources fully.
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ resources:
touch /home/anythingllm/.env
sudo docker pull mintplexlabs/anythingllm:master
sudo docker run -d -p 3001:3001 -v /home/anythingllm:/app/server/storage -v /home/anythingllm/.env:/app/server/.env -e STORAGE_DIR="/app/server/storage" mintplexlabs/anythingllm:master
sudo docker run -d -p 3001:3001 --cap-add SYS_ADMIN -v /home/anythingllm:/app/server/storage -v /home/anythingllm/.env:/app/server/.env -e STORAGE_DIR="/app/server/storage" mintplexlabs/anythingllm:master
echo "Container ID: $(sudo docker ps --latest --quiet)"
export ONLINE=$(curl -Is http://localhost:3001/api/ping | head -n 1|cut -d$' ' -f2)
Expand Down
61 changes: 0 additions & 61 deletions cloud-deployments/gcp/deployment/generate.mjs

This file was deleted.

1 change: 0 additions & 1 deletion collector/.env.example

This file was deleted.

10 changes: 4 additions & 6 deletions collector/.gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
outputs/*/*.json
hotdir/*
hotdir/processed/*
hotdir/failed/*
!hotdir/__HOTDIR__.md
!hotdir/processed
!hotdir/failed

yarn-error.log
!yarn.lock
outputs
scripts
1 change: 1 addition & 0 deletions collector/.nvmrc
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
v18.13.0
62 changes: 0 additions & 62 deletions collector/README.md

This file was deleted.

32 changes: 0 additions & 32 deletions collector/api.py

This file was deleted.

16 changes: 1 addition & 15 deletions collector/hotdir/__HOTDIR__.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,3 @@
### What is the "Hot directory"

This is the location where you can dump all supported file types and have them automatically converted and prepared to be digested by the vectorizing service and selected from the AnythingLLM frontend.

Files dropped in here will only be processed when you are running `python watch.py` from the `collector` directory.

Once converted the original file will be moved to the `hotdir/processed` folder so that the original document is still able to be linked to when referenced when attached as a source document during chatting.

**Supported File types**
- `.md`
- `.txt`
- `.pdf`

__requires more development__
- `.png .jpg etc`
- `.mp3`
- `.mp4`
This is a pre-set file location that documents will be written to when uploaded by AnythingLLM. There is really no need to touch it.
Loading

3 comments on commit 719521c

@franzbischoff
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I found the mess!!! Goodbye Python?

@timothycarambat
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes finally!! Which should make the devcontainer stuff now easy to wrap my head around since most the unknowns came from supporting that!

@franzbischoff
Copy link
Contributor

@franzbischoff franzbischoff commented on 719521c Dec 18, 2023 via email

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sign in to comment.