Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative for SetNextReader to return all strings #910

Closed
1 task done
mowali opened this issue Jan 30, 2024 · 1 comment
Closed
1 task done

Alternative for SetNextReader to return all strings #910

mowali opened this issue Jan 30, 2024 · 1 comment
Labels
docs Applies to the API docs or website

Comments

@mowali
Copy link

mowali commented Jan 30, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Describe the documentation issue

PaulVrugt was asking this question, but never got a response to it:

The FieldCache GetStrings method was replace by GetTerms, but GetTerms requires an AtomicReader, we used to be able to pass an IndexReader into this method and it used to return a string array containing the values. How to I get the same kind of behavior from the GetTerms method?

Is there no way to have the same behavior that GetStrings did in version 3.0.3?

Additional context

Here is the link to that thread:
#398
No response

@mowali mowali added the docs Applies to the API docs or website label Jan 30, 2024
@NightOwl888
Copy link
Contributor

The Migration Guide covers this very issue with an example:

LUCENE-2380: FieldCache.GetStrings/Index --> FieldCache.GetDocTerms/Index

  • The field values returned when sorting by SortField.STRING are now
    BytesRef. You can call value.Utf8ToString() to convert back to
    string, if necessary.

  • In FieldCache, GetStrings (returning string[]) has been replaced
    with GetTerms (returning a BinaryDocValues instance).
    BinaryDocValues provides a Get method, taking a docID and a BytesRef
    to fill (which must not be null), and it fills it in with the
    reference to the bytes for that term.

    If you had code like this before:

    string[] values = FieldCache.DEFAULT.GetStrings(reader, field);
    ...
    string aValue = values[docID];

    you can do this instead:

    BinaryDocValues values = FieldCache.DEFAULT.GetTerms(reader, field);
    ...
    BytesRef term = new BytesRef();
    values.Get(docID, term);
    string aValue = term.Utf8ToString();

    Note however that it can be costly to convert to String, so it's better to work directly with the BytesRef.

  • Similarly, in FieldCache, GetStringIndex (returning a StringIndex
    instance, with direct arrays int[] order and String[] lookup) has
    been replaced with GetTermsIndex (returning a
    SortedDocValues instance). SortedDocValues provides the
    GetOrd(int docID) method to lookup the int order for a document,
    LookupOrd(int ord, BytesRef result) to lookup the term from a given
    order, and the sugar method Get(int docID, BytesRef result)
    which internally calls GetOrd and then LookupOrd.

    If you had code like this before:

    StringIndex idx = FieldCache.DEFAULT.GetStringIndex(reader, field);
    ...
    int ord = idx.order[docID];
    String aValue = idx.lookup[ord];

    you can do this instead:

    DocTermsIndex idx = FieldCache.DEFAULT.GetTermsIndex(reader, field);
    ...
    int ord = idx.GetOrd(docID);
    BytesRef term = new BytesRef();
    idx.LookupOrd(ord, term);
    string aValue = term.Utf8ToString();

    Note however that it can be costly to convert to String, so it's better to work directly with the BytesRef.

    DocTermsIndex also has a GetTermsEnum() method, which returns an iterator (TermsEnum) over the term values in the index (ie, iterates ord = 0..NumOrd-1).

Furthermore, if you drill down into the issue LUCENE-2380, there is an explanation for the change: primarily, this was done for performance reasons. There is no longer a string[] stored in the field cache, the underlying data is now a byte[] so extra steps are required to get a UTF8 string.

Do note that you are meant to reuse the BytesRef instance that is passed in to get better performance.

@paulirwin paulirwin closed this as not planned Won't fix, can't repro, duplicate, stale Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Applies to the API docs or website
Projects
None yet
Development

No branches or pull requests

3 participants