Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Add ML limitation for ingesting large documents #2882

Merged
merged 1 commit into from
Nov 28, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@
The following limitations and known problems apply to the {version} release of
the Elastic {nlp} trained models feature.

[discrete]
[[ml-nlp-large-documents-limit-10k-10mb]]
== Document size limitations when using `semantic_text` fields

When using semantic text to ingest documents, chunking takes place automatically. The number of chunks is limited by the {ref}/mapping-settings-limit.html[`index.mapping.nested_objects.limit`] cluster setting, which defaults to 10k. Documents that are too large will cause errors during ingestion. To avoid this issue, please split your documents into roughly 1MB parts before ingestion.

[discrete]
[[ml-nlp-elser-v1-limit-512]]
== ELSER semantic search is limited to 512 tokens per field that inference is applied to
Expand All @@ -17,4 +23,4 @@ When you use ELSER for semantic search, only the first 512 extracted tokens from
each field of the ingested documents that ELSER is applied to are taken into
account for the search process. If your data set contains long documents, divide
them into smaller segments before ingestion if you need the full text to be
searchable.
searchable.