Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] dataprep doesn't support uploading *.md files #1241

Open
2 of 7 tasks
lianhao opened this issue Jan 26, 2025 · 4 comments
Open
2 of 7 tasks

[Bug] dataprep doesn't support uploading *.md files #1241

lianhao opened this issue Jan 26, 2025 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@lianhao
Copy link
Collaborator

lianhao commented Jan 26, 2025

Priority

P2-High

OS type

Ubuntu

Hardware type

Xeon-ICX

Installation method

  • Pull docker images from hub.docker.com
  • Build docker images from source
  • Other

Deploy method

  • Docker
  • Docker Compose
  • Kubernetes Helm Charts
  • Other

Running nodes

Single Node

What's the version?

docker compose file is from git commit 09c3eeb
docker image version is 1.2

Description

When uploading the file https://raw.githubusercontent.com/opea-project/GenAIExamples/refs/heads/main/README.md to 1.2 version of dataprep, we've encountered the following error in the dataprep container:

... ...
File "/home/user/.local/lib/python3.11/site-packages/nltk/tokenize/init.py", line 105, in _get_punkt_tokenizer
return PunktTokenizer(language)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/nltk/tokenize/punkt.py", line 1744, in init
self.load_lang(lang)
File "/home/user/.local/lib/python3.11/site-packages/nltk/tokenize/punkt.py", line 1749, in load_lang
lang_dir = find(f"tokenizers/punkt_tab/{lang}/")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/nltk/data.py", line 579, in find
raise LookupError(resource_not_found)
LookupError:


Resource ^[[93mpunkt_tab^[[0m not found.
Please use the NLTK Downloader to obtain the resource:

^[[31m>>> import nltk

nltk.download('punkt_tab')
^[[0m
For more information see: https://www.nltk.org/data.html

Attempted to load ^[[93mtokenizers/punkt_tab/english/^[[0m

Searched in:
- '/home/user/nltk_data'
- '/usr/local/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/local/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- '/home/user/.local/lib/python3.11/site-packages/llama_index/core/_static/nltk_cache'


INFO: 127.0.0.1:44654 - "POST /v1/dataprep/ingest HTTP/1.1" 200 OK

Reproduce steps

  1. build and launch the dataprep microservice using docker compose
  2. curl http://localhost:6007/v1/dataprep/ingest -X POST -H "Content-Type: multipart/form-data" -F "files=@./README.md"

Raw log

Attachments

No response

@lianhao lianhao added the bug Something isn't working label Jan 26, 2025
@ashahba ashahba self-assigned this Jan 31, 2025
@dmsuehir
Copy link
Contributor

dmsuehir commented Jan 31, 2025

We have seen this same issue with reading from a URL in DocSum, and found the root cause to be updates to recent libraries that are causing the need for additional downloads from nltk (punkt_tab and averaged_perceptron_tagger_eng). We can help with fixing this.

This is the similar PR from @okhleif-IL for DocSum: opea-project/GenAIExamples#1487

@dmsuehir
Copy link
Contributor

@lianhao Which variant of dataprep are you running? I am trying to reproduce this issue. I'm running data prep with redis and I've tried building the container from main as well as from the 09c3eeb and in both cases the data ingestion is working.

I was following the README instructions running with docker. I set the env vars that are specified and built the dataprep image.

I started the redis container:

docker run -d -p 6379:6379 -p 8001:8001 redis/redis-stack:7.2.0-v9

And then started the data prep container:

Note the 6007:5000 port

docker run -d --name="dataprep-redis-server" -p 6007:5000 --runtime=runc --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e REDIS_URL=$REDIS_URL -e INDEX_NAME=$INDEX_NAME -e TEI_ENDPOINT=$TEI_ENDPOINT -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN opea/dataprep:latest

I saw that you were using docker compose, however the dataprep redis README file says that is deprecated.

After the data prep container is ready, I downloaded your test file and ingested it:

wget https://raw.githubusercontent.com/opea-project/GenAIExamples/refs/heads/main/README.md

curl -X POST     -H "Content-Type: multipart/form-data"     -F "files=@./README.md"     http://$your_ip:6007/v1/dataprep/ingest
{"status":200,"message":"Data preparation succeeded"}

I also checked the logs and I don't see the error that you're seeing.

@lianhao
Copy link
Collaborator Author

lianhao commented Feb 5, 2025

@lianhao Which variant of dataprep are you running? I am trying to reproduce this issue. I'm running data prep with redis and I've tried building the container from main as well as from the 09c3eeb and in both cases the data ingestion is working.

I've updated the bug description. I'm using image opea/dataprep:1.2 with redis DB. The 09c3eeb version of dataprep image seems fine. I don't know why 1.2 version is buggy though

@dmsuehir
Copy link
Contributor

dmsuehir commented Feb 5, 2025

This is an issue that has been fixed in a recent release of unstructured. With version 0.16.16 and later of unstructured, it will auto download for nltk, so we should not see this error anymore.

To work around this issue, you can build the container from source (using the v1.2 tag, if that's the release that you want to use), which will get you the latest unstructured library that includes the auto download for nltk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants