Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

search stemming is overzealous #1563

Open
alexduryee opened this issue Nov 5, 2024 · 2 comments
Open

search stemming is overzealous #1563

alexduryee opened this issue Nov 5, 2024 · 2 comments

Comments

@alexduryee
Copy link

Following up from the November 4 community call, there was discussion around search term stemming in Solr, and how it's currently too aggressive. Users have found that the following terms are getting buried due to stemming:

  • eugenics matches eugene
  • organs matches organization
    There's no way for the user to search exact terms without stemming, since quotation marks only group phrases and won't bypass stemming.

Duke's approach to this was to include an unstemmed index field (https://gitlab.oit.duke.edu/dul-its/dul-arclight/-/blob/main/solr/arclight/conf/solrconfig.xml#L133), which is weighted above the stemmed ones.

Questions to discuss:

  • Do alternative Solr stemmers provide a better search experience?
  • Does Duke's approach meet user expectations? Are terms still being buried?
  • How important is exact-term searching via quotation marks? Can that be implemented?
@corylown
Copy link
Contributor

corylown commented Nov 5, 2024

Some initial responses:

  • Solr includes a variety of stemming options. It could be worth investigating whether switching to one of these different stemming strategies improves things. I think we've used EnglishMinimalStemFilterFactory in other projects and it's less aggressive.
  • Duke's approach -- indexing both stemmed and unstemmed copies of fields and giving a boost to the unstemmed matches is the typical approach to this problem. The person searching doesn't have to know any specific querying techniques for it to work and relevance ranking pushes unstemmed matches to the top of the results and less exact matches further down. I think we should consider implementing this strategy in ArcLight. There may also be fields that are being stemmed that we should stop stemming altogether (for example, any fields for names).
  • Quotes indicate a phrase query to Solr's query parser. I'm wary of trying to implement something in ArcLight that would try to use quotes in a query to mean something different from what Solr expects. You'd have to manage query parsing at the application level and translate to meaningful queries for Solr.
  • Another option that is available to implementers (I'm not sure adding this to ArcLight out of the box makes sense), would be to configure a fielded search option that includes only unstemmed copies of fields for cases where the searcher knows they don't want stemming. I think Duke's approach is better, but this would provide the expert searcher with more control.

@bibliotechy
Copy link

For reference in this conversation, Blacklight core ships with a single boosted unstemmed field in the default search. Details below.

I think the inclusion of this in Blacklight core makes a strong case that it would not be heavy handed to also include it in Arclight by default.


In the solrconfig.xml:

<str name="pf">
  all_text_timv^10
</str>

In the schema

all_text_timv is defined as a text field.

<field name="all_text_timv" type="text" stored="false" indexed="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>

text fieldType is defined with no stemming in the analysis

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer>
    <tokenizer class="solr.ICUTokenizerFactory"/>
    <filter class="solr.ICUFoldingFilterFactory"/>  <!-- NFKC, case folding, diacritics removed -->
    <filter class="solr.TrimFilterFactory"/>
  </analyzer>
</fieldType>

all_text_timv is the destination of multiple copy fields

<copyField source="*_tsim" dest="all_text_timv" maxChars="3000"/>
<copyField source="*_tesim" dest="all_text_timv" maxChars="3000"/>
<copyField source="*_ssim" dest="all_text_timv" maxChars="3000"/>
<copyField source="*_si" dest="all_text_timv" maxChars="3000"/>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants