From 98c163e1f24a87e2e386cc80bb2c7d567db3bf5d Mon Sep 17 00:00:00 2001 From: Max Hniebergall <137079448+maxhniebergall@users.noreply.github.com> Date: Wed, 27 Nov 2024 15:01:08 -0500 Subject: [PATCH] [ML] Add ML limitation for ingesting large documents (#2877) * Update ml-nlp-limitations.asciidoc * change link * Update docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> --------- Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> --- docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc b/docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc index 8673cdb19..e505bb63b 100644 --- a/docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc +++ b/docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc @@ -9,6 +9,12 @@ The following limitations and known problems apply to the {version} release of the Elastic {nlp} trained models feature. +[discrete] +[[ml-nlp-large-documents-limit-10k-10mb]] +== Document size limitations when using `semantic_text` fields + +When using semantic text to ingest documents, chunking takes place automatically. The number of chunks is limited by the {ref}/mapping-settings-limit.html[`index.mapping.nested_objects.limit`] cluster setting, which defaults to 10k. Documents that are too large will cause errors during ingestion. To avoid this issue, please split your documents into roughly 1MB parts before ingestion. + [discrete] [[ml-nlp-elser-v1-limit-512]] == ELSER semantic search is limited to 512 tokens per field that inference is applied to @@ -17,4 +23,4 @@ When you use ELSER for semantic search, only the first 512 extracted tokens from each field of the ingested documents that ELSER is applied to are taken into account for the search process. If your data set contains long documents, divide them into smaller segments before ingestion if you need the full text to be -searchable. \ No newline at end of file +searchable.