Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dockerized and modified ingest.py #144

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Include any files or directories that you don't want to be copied to your
# container here (e.g., local build artifacts, temporary files, etc.).
#
# For more help, visit the .dockerignore file reference guide at
# https://docs.docker.com/go/build-context-dockerignore/

**/.DS_Store
**/__pycache__
**/.venv
**/.classpath
**/.dockerignore
**/.env
**/.git
**/.gitignore
**/.project
**/.settings
**/.toolstarget
**/.vs
**/.vscode
**/*.*proj.user
**/*.dbmdl
**/*.jfm
**/bin
**/charts
**/docker-compose*
**/compose.y*ml
**/Dockerfile*
**/node_modules
**/npm-debug.log
**/obj
**/secrets.dev.yaml
**/values.dev.yaml
LICENSE
README.md
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,9 @@ TODO.md
runs/
_*/
env/
/.idea/.gitignore
/.idea/aws.xml
/.idea/misc.xml
/.idea/modules.xml
/.idea/vcs.xml
/.idea/warc-gpt-public.iml
30 changes: 30 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# syntax=docker/dockerfile:1

# Comments are provided throughout this file to help you get started.
# If you need more help, visit the Dockerfile reference guide at
# https://docs.docker.com/go/dockerfile-reference/

# Want to help us make this template better? Share your feedback here: https://forms.gle/ybq9Krt8jtBL3iCk7

ARG PYTHON_VERSION=3.12.4
FROM python:${PYTHON_VERSION}-slim AS base

# Prevents Python from writing pyc files.
ENV PYTHONDONTWRITEBYTECODE=1

# Keeps Python from buffering stdout and stderr to avoid situations where
# the application crashes without emitting any logs due to buffering.
ENV PYTHONUNBUFFERED=1

# Copy the source code into the container.
COPY . .

# Install poetry app
RUN pip install poetry
RUN poetry env use 3.12 && poetry install

# Run the application on localhost:5000
#CMD ["poetry", "run", "flask", "run"]

# Uncomment to run the application on 0.0.0.0:5000
CMD ["poetry", "run", "flask", "run", "--host", "0.0.0.0"]
28 changes: 28 additions & 0 deletions README.Docker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
### Building and running your application

When you're ready, start your application by running:
`docker compose up --build`.

Your application will be available at http://localhost:5000.

### Deploying your application to the cloud

First, build your image, e.g.: `docker build -t myapp .`.
If your cloud uses a different CPU architecture than your development
machine (e.g., you are on a Mac M1 and your cloud provider is amd64),
you'll want to build the image for that platform, e.g.:
`docker build --platform=linux/amd64 -t myapp .`.

Then, push it to your registry, e.g. `docker push myregistry.com/myapp`.

Consult Docker's [getting started](https://docs.docker.com/go/get-started-sharing/)
docs for more detail on building and pushing.

### Executing Ingestion and Visualization Commands

ingest: `docker exec -it warc-gpt poetry run flask ingest`

visualize: `docker exec -it warc-gpt poetry run flask visualize`

### References
* [Docker's Python guide](https://docs.docker.com/language/python/)
19 changes: 19 additions & 0 deletions docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
services:
warc-gpt:
container_name: warc-gpt
env_file: .env
build: .
ports:
- "5000:5000"
restart: always
environment:
OLLAMA_API_URL: ${OLLAMA_API_URL}
OPENAI_API_KEY: ${OPENAI_API_KEY}
OPENAI_ORG_ID: ${OPENAI_ORG_ID}
WARC_FOLDER_PATH: ${WARC_FOLDER_PATH}
VISUALIZATIONS_FOLDER_PATH: ${VISUALIZATIONS_FOLDER_PATH}
VECTOR_SEARCH_PATH: ${VECTOR_SEARCH_PATH}
volumes:
- "${WARC_FOLDER_PATH}:/warc"
- "${VECTOR_SEARCH_PATH}:/chromadb"
- "${VISUALIZATIONS_FOLDER_PATH}:/visualizations"
11 changes: 7 additions & 4 deletions warc_gpt/commands/ingest.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,10 +174,13 @@ def ingest(batch_size) -> None:
#
if record_data["warc_record_content_type"].startswith("application/pdf"):
raw = io.BytesIO(record.raw_stream.read())
pdf = PdfReader(raw)

for page in pdf.pages:
record_data["warc_record_text"] += page.extract_text()
try:
pdf = PdfReader(raw)
for page in pdf.pages:
record_data["warc_record_text"] += page.extract_text()
except Exception as exc:
print(f"- Could not extract text from {record_data['warc_record_target_uri']}")
continue

#
# Stop here if we don't have text, or text contains less than 5 words
Expand Down