Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No hits with search threshold 0 on documents containing words with common root #911

Open
fturmel opened this issue Mar 12, 2025 · 4 comments

Comments

@fturmel
Copy link
Contributor

fturmel commented Mar 12, 2025

Describe the bug

When doing full text search with threshold 0 on a document that contains a few words with common roots, we don't get a hit until we've typed enough characters to disambiguate them.

To Reproduce

Search with threshold 0 the following test cases:

On the indexed value "Phone, phonogram":

  • search for "p", "ph", "pho" or "phon" -> no hits (we should get a hit obviously)
  • search for "phone" or "phono" -> 1 hit (as expected)

On the indexed value "Bet, better":

  • search for "b", "be" or "bet" -> no hits (we should get a hit, it's even worst than the previous case because "bet" is actually a full word match)
  • search for "bett", "bette" or ""better" -> 1 hit (as expected)
  • search for "bet hi" -> 1 hit (searching for an additional word now gives us a hit for "bet", puzzling...)

On the indexed value "Some random sentence"

  • search for "s" -> no hits (we have 2 words that start with s, should be getting a hit)
  • search for "r" -> 1 hit
  • search for "se" or "so" -> 1 hit

Expected behavior

see previous reproduction description

Environment Info

OS: macOS 15.3.2
Node: 22.14.0
Orama: 3.1.2

Affected areas

Search

Additional context

No response

@fturmel
Copy link
Contributor Author

fturmel commented Mar 12, 2025

@micheleriva here are the unit tests to add to packages/orama/tests/threshold.test.ts. 8 out of 14 are failing at the moment.

t.test('should return results for words with same root if threshold is 0', async t => {
  // related issue: https://github.com/oramasearch/orama/issues/911

  const db = create({
    schema: {
      title: 'string'
    }
  })

  await insert(db, { title: 'Phone, phonogram' })
  await insert(db, { title: 'Bet, better' })
  await insert(db, { title: 'Some random sentence' })

  const testCases: [string, number][] = [
    ['p', 1],
    ['ph', 1],
    ['pho', 1],
    ['phone', 1],
    ['phono', 1],

    ['b', 1],
    ['be', 1],
    ['bet', 1],
    ['bett', 1],
    ['bet hi', 0], // the term "hi" is not in any document, there should be no hits with threshold 0

    ['s', 1],
    ['r', 1],
    ['se', 1],
    ['so', 1]
  ]

  t.plan(testCases.length)

  for (const [term, expectedCount] of testCases) {
    const result = await search(db, { term, threshold: 0 })
    t.same(
      result.count,
      expectedCount,
      `Search term "${term}" with threshold 0 should match ${expectedCount} record(s), but matched ${result.count}`
    )
  }
})

@fturmel
Copy link
Contributor Author

fturmel commented Mar 19, 2025

I'll just add that as far as I can tell, this is a regression from Orama v2.

@micheleriva Is there any way you could confirm this is a bug and not a usage/comprehension issue on my end? I have to solve this for a project, which will require either going back to v2 or dropping Orama altogether. I don't think I have the time or sufficient understanding of the internals to work on a PR myself at the moment.

Let me know if any additional info would be helpful here. Thanks!

@gaurav21r
Copy link

gaurav21r commented Mar 24, 2025

@fturmel I can confirm this as well. Thanks for the suggestion! Backporting to 2.0.24 makes this work but that has issues too.

I have a feeling this error might be due to some mismatch between tolerance and threshold though I'm not a Search Algorithm expert so won't comment further without proper investigation.

I am using Orama for a large Food Dataset and 3.x is basically unusable for me regarding the same issue that @fturmel mentioned, @micheleriva I think its imperative to add what he's mentioned to the unit test. I'll also try to contrubute more. Since I have a proprietary database right out of a PhD lab, I'll need to do processing on the data / paperwork to present a small test case here.

@micheleriva
Copy link
Member

Looking at this. Thanks for noticing the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants