You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/reference/aggregations/search-aggregations-bucket-significantterms-aggregation.md
+15-15Lines changed: 15 additions & 15 deletions
Original file line number
Diff line number
Diff line change
@@ -253,8 +253,8 @@ Like most design decisions, this is the basis of a trade-off in which we have ch
253
253
The JLH score can be used as a significance score by adding the parameter
254
254
255
255
```js
256
-
"jlh": {
257
-
}
256
+
"jlh": {
257
+
}
258
258
```
259
259
260
260
The scores are derived from the doc frequencies in *foreground* and *background* sets. The *absolute* change in popularity (foregroundPercent - backgroundPercent) would favor common terms whereas the *relative* change in popularity (foregroundPercent/ backgroundPercent) would favor rare terms. Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall.
@@ -265,9 +265,9 @@ The scores are derived from the doc frequencies in *foreground* and *background*
265
265
Mutual information as described in "Information Retrieval", Manning et al., Chapter 13.5.1 can be used as significance score by adding the parameter
266
266
267
267
```js
268
-
"mutual_information": {
269
-
"include_negatives":true
270
-
}
268
+
"mutual_information": {
269
+
"include_negatives":true
270
+
}
271
271
```
272
272
273
273
Mutual information does not differentiate between terms that are descriptive for the subset or for documents outside the subset. The significant terms therefore can contain terms that appear more or less frequent in the subset than outside the subset. To filter out the terms that appear less often in the subset than in documents outside the subset, `include_negatives` can be set to `false`.
@@ -284,8 +284,8 @@ Per default, the assumption is that the documents in the bucket are also contain
284
284
Chi square as described in "Information Retrieval", Manning et al., Chapter 13.5.2 can be used as significance score by adding the parameter
285
285
286
286
```js
287
-
"chi_square": {
288
-
}
287
+
"chi_square": {
288
+
}
289
289
```
290
290
291
291
Chi square behaves like mutual information and can be configured with the same parameters `include_negatives` and `background_is_superset`.
@@ -296,8 +296,8 @@ Chi square behaves like mutual information and can be configured with the same p
296
296
Google normalized distance as described in ["The Google Similarity Distance", Cilibrasi and Vitanyi, 2007](https://arxiv.org/pdf/cs/0412098v3.pdf) can be used as significance score by adding the parameter
297
297
298
298
```js
299
-
"gnd": {
300
-
}
299
+
"gnd": {
300
+
}
301
301
```
302
302
303
303
`gnd` also accepts the `background_is_superset` parameter.
@@ -394,8 +394,8 @@ The benefit of this heuristic is that the scoring logic is simple to explain to
394
394
It would be hard for a seasoned boxer to win a championship if the prize was awarded purely on the basis of percentage of fights won - by these rules a newcomer with only one fight under their belt would be impossible to beat. Multiple observations are typically required to reinforce a view so it is recommended in these cases to set both `min_doc_count` and `shard_min_doc_count` to a higher value such as 10 in order to filter out the low-frequency terms that otherwise take precedence.
395
395
396
396
```js
397
-
"percentage": {
398
-
}
397
+
"percentage": {
398
+
}
399
399
```
400
400
401
401
@@ -413,11 +413,11 @@ If none of the above measures suits your usecase than another option is to imple
413
413
Customized scores can be implemented via a script:
0 commit comments