-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] dataprep doesn't support uploading *.md files #1241
Comments
We have seen this same issue with reading from a URL in DocSum, and found the root cause to be updates to recent libraries that are causing the need for additional downloads from nltk (punkt_tab and averaged_perceptron_tagger_eng). We can help with fixing this. This is the similar PR from @okhleif-IL for DocSum: opea-project/GenAIExamples#1487 |
@lianhao Which variant of dataprep are you running? I am trying to reproduce this issue. I'm running data prep with redis and I've tried building the container from I was following the README instructions running with docker. I set the env vars that are specified and built the dataprep image. I started the redis container:
And then started the data prep container:
I saw that you were using docker compose, however the dataprep redis README file says that is deprecated. After the data prep container is ready, I downloaded your test file and ingested it:
I also checked the logs and I don't see the error that you're seeing. |
I've updated the bug description. I'm using image opea/dataprep:1.2 with redis DB. The |
This is an issue that has been fixed in a recent release of To work around this issue, you can build the container from source (using the |
Priority
P2-High
OS type
Ubuntu
Hardware type
Xeon-ICX
Installation method
Deploy method
Running nodes
Single Node
What's the version?
docker compose file is from git commit 09c3eeb
docker image version is 1.2
Description
When uploading the file
https://raw.githubusercontent.com/opea-project/GenAIExamples/refs/heads/main/README.md
to 1.2 version of dataprep, we've encountered the following error in the dataprep container:... ...
File "/home/user/.local/lib/python3.11/site-packages/nltk/tokenize/init.py", line 105, in _get_punkt_tokenizer
return PunktTokenizer(language)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/nltk/tokenize/punkt.py", line 1744, in init
self.load_lang(lang)
File "/home/user/.local/lib/python3.11/site-packages/nltk/tokenize/punkt.py", line 1749, in load_lang
lang_dir = find(f"tokenizers/punkt_tab/{lang}/")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/nltk/data.py", line 579, in find
raise LookupError(resource_not_found)
LookupError:
Resource ^[[93mpunkt_tab^[[0m not found.
Please use the NLTK Downloader to obtain the resource:
^[[31m>>> import nltk
Attempted to load ^[[93mtokenizers/punkt_tab/english/^[[0m
Searched in:
- '/home/user/nltk_data'
- '/usr/local/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/local/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- '/home/user/.local/lib/python3.11/site-packages/llama_index/core/_static/nltk_cache'
INFO: 127.0.0.1:44654 - "POST /v1/dataprep/ingest HTTP/1.1" 200 OK
Reproduce steps
Raw log
Attachments
No response
The text was updated successfully, but these errors were encountered: