-
Notifications
You must be signed in to change notification settings - Fork 0
/
rss.xml
875 lines (865 loc) · 732 KB
/
rss.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
<?xml version="1.0"?>
<rss version="2.0">
<channel>
<title>Life</title>
<subtitle></subtitle>
<icon>https://songlinlife.top/images/favicon.ico</icon>
<link>https://songlinlife.top</link>
<author>
<name>Kalice</name>
</author>
<description>Life is not about lifestyle, it means Lithium and Ferrum.</description>
<language>zh-CN</language>
<pubDate>Thu, 09 Jun 2022 14:09:30 +0800</pubDate>
<lastBuildDate>Thu, 09 Jun 2022 14:09:30 +0800</lastBuildDate>
<item>
<guid isPermalink="true">https://songlinlife.top/2022/vespa-%E4%BA%BF%E7%BA%A7%E5%90%91%E9%87%8F%E7%B4%A2%E5%BC%95%E6%96%B9%E6%A1%88/</guid>
<title>vespa 亿级向量索引方案</title>
<link>https://songlinlife.top/2022/vespa-%E4%BA%BF%E7%BA%A7%E5%90%91%E9%87%8F%E7%B4%A2%E5%BC%95%E6%96%B9%E6%A1%88/</link>
<category term="数据库" scheme="https://songlinlife.top/categories/%E6%95%B0%E6%8D%AE%E5%BA%93/" />
<category term="ANNS" scheme="https://songlinlife.top/tags/ANNS/" />
<pubDate>Thu, 09 Jun 2022 14:09:30 +0800</pubDate>
<description><![CDATA[ <p>vespa 公司也是一个造向量数据库的,并且受欢迎的程度和 milvus 不相上下?这篇博客主要是对 vespa 的 billion vector 的方案进行解析<span class="exturl" data-url="aHR0cHM6Ly9ibG9nLnZlc3BhLmFpL3Zlc3BhLWh5YnJpZC1iaWxsaW9uLXNjYWxlLXZlY3Rvci1zZWFyY2gv"> Billion-scale vector search using hybrid HNSW-IF</span>。我看完了第一遍后,第一想法是没准真能行,这个方案有点 spann 的意思,但是比 spann 更加精炼。</p>
<h3 id="spann"><a class="anchor" href="#spann">#</a> Spann</h3>
<p>spann 这个算法我之前看过但是没有写博客,原因就是我觉得这玩意太啰嗦了,很难被工程化。我这里简单提一下这个算法:</p>
<ol>
<li>层次 kmeans,进行质心的选取,质心用 SPTAG 建立索引。</li>
<li>非质心节点 assign 到对应的 post list(postlist 中包括了 id 和 vector)</li>
</ol>
<p>spann 在这里做了两个优化:</p>
<p><span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mtable rowspacing="0.15999999999999992em" columnalign="right" columnspacing="1em"><mtr><mtd><mstyle scriptlevel="0" displaystyle="false"><mrow><mi mathvariant="bold">x</mi><mo>∈</mo><msub><mi mathvariant="bold">X</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>⟺</mo><mi mathvariant="normal">Dist</mi><mo></mo><mrow><mo fence="true">(</mo><mi mathvariant="bold">x</mi><mo separator="true">,</mo><msub><mi mathvariant="bold">c</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo fence="true">)</mo></mrow><mo>≤</mo><mrow><mo fence="true">(</mo><mn>1</mn><mo>+</mo><msub><mi>ϵ</mi><mn>1</mn></msub><mo fence="true">)</mo></mrow><mo>×</mo><mi mathvariant="normal">Dist</mi><mo></mo><mrow><mo fence="true">(</mo><mi mathvariant="bold">x</mi><mo separator="true">,</mo><msub><mi mathvariant="bold">c</mi><mrow><mi>i</mi><mn>1</mn></mrow></msub><mo fence="true">)</mo></mrow></mrow></mstyle></mtd></mtr><mtr><mtd><mstyle scriptlevel="0" displaystyle="false"><mrow><mi mathvariant="normal">Dist</mi><mo></mo><mrow><mo fence="true">(</mo><mi mathvariant="bold">x</mi><mo separator="true">,</mo><msub><mi mathvariant="bold">c</mi><mrow><mi>i</mi><mn>1</mn></mrow></msub><mo fence="true">)</mo></mrow><mo>≤</mo><mi mathvariant="normal">Dist</mi><mo></mo><mrow><mo fence="true">(</mo><mi mathvariant="bold">x</mi><mo separator="true">,</mo><msub><mi mathvariant="bold">c</mi><mrow><mi>i</mi><mn>2</mn></mrow></msub><mo fence="true">)</mo></mrow><mo>≤</mo><mo>⋯</mo><mo>≤</mo><mi mathvariant="normal">Dist</mi><mo></mo><mrow><mo fence="true">(</mo><mi mathvariant="bold">x</mi><mo separator="true">,</mo><msub><mi mathvariant="bold">c</mi><mrow><mi>i</mi><mi>K</mi></mrow></msub><mo fence="true">)</mo></mrow></mrow></mstyle></mtd></mtr></mtable><annotation encoding="application/x-tex">\begin{array}{r}
\mathbf{x} \in \mathbf{X}_{i j} \Longleftrightarrow \operatorname{Dist}\left(\mathbf{x}, \mathbf{c}_{i j}\right) \leq\left(1+\epsilon_{1}\right) \times \operatorname{Dist}\left(\mathbf{x}, \mathbf{c}_{i 1}\right) \\
\operatorname{Dist}\left(\mathbf{x}, \mathbf{c}_{i 1}\right) \leq \operatorname{Dist}\left(\mathbf{x}, \mathbf{c}_{i 2}\right) \leq \cdots \leq \operatorname{Dist}\left(\mathbf{x}, \mathbf{c}_{i K}\right)
\end{array}
</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:2.4000000000000004em;vertical-align:-0.9500000000000004em;"></span><span class="mord"><span class="mtable"><span class="arraycolsep" style="width:0.5em;"></span><span class="col-align-r"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.45em;"><span style="top:-3.61em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord"><span class="mord mathbf">x</span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">∈</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mord"><span class="mord"><span class="mord mathbf">X</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.311664em;"><span style="top:-2.5500000000000003em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mord mathnormal mtight" style="margin-right:0.05724em;">j</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.286108em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">⟺</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mop"><span class="mord mathrm">D</span><span class="mord mathrm">i</span><span class="mord mathrm">s</span><span class="mord mathrm">t</span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">(</span><span class="mord"><span class="mord mathbf">x</span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord"><span class="mord mathbf">c</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.311664em;"><span style="top:-2.5500000000000003em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mord mathnormal mtight" style="margin-right:0.05724em;">j</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.286108em;"><span></span></span></span></span></span></span><span class="mclose delimcenter" style="top:0em;">)</span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">≤</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">(</span><span class="mord">1</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mord"><span class="mord mathnormal">ϵ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose delimcenter" style="top:0em;">)</span></span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mop"><span class="mord mathrm">D</span><span class="mord mathrm">i</span><span class="mord mathrm">s</span><span class="mord mathrm">t</span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">(</span><span class="mord"><span class="mord mathbf">x</span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord"><span class="mord mathbf">c</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.31166399999999994em;"><span style="top:-2.5500000000000003em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose delimcenter" style="top:0em;">)</span></span></span></span><span style="top:-2.4099999999999997em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mop"><span class="mord mathrm">D</span><span class="mord mathrm">i</span><span class="mord mathrm">s</span><span class="mord mathrm">t</span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">(</span><span class="mord"><span class="mord mathbf">x</span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord"><span class="mord mathbf">c</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.31166399999999994em;"><span style="top:-2.5500000000000003em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose delimcenter" style="top:0em;">)</span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">≤</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mop"><span class="mord mathrm">D</span><span class="mord mathrm">i</span><span class="mord mathrm">s</span><span class="mord mathrm">t</span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">(</span><span class="mord"><span class="mord mathbf">x</span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord"><span class="mord mathbf">c</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.31166399999999994em;"><span style="top:-2.5500000000000003em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mord mtight">2</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose delimcenter" style="top:0em;">)</span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">≤</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="minner">⋯</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">≤</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mop"><span class="mord mathrm">D</span><span class="mord mathrm">i</span><span class="mord mathrm">s</span><span class="mord mathrm">t</span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">(</span><span class="mord"><span class="mord mathbf">x</span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord"><span class="mord mathbf">c</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.32833099999999993em;"><span style="top:-2.5500000000000003em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mord mathnormal mtight" style="margin-right:0.07153em;">K</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose delimcenter" style="top:0em;">)</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.9500000000000004em;"><span></span></span></span></span></span><span class="arraycolsep" style="width:0.5em;"></span></span></span></span></span></span></span></p>
<p>一个是非质心节点 assign post list 的个数,也就是如果其他质心与 x 的距离比离 x 最近的质心的距离要大过一定比例,那么就直接放弃。</p>
<p>另一个优化就是 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="normal">Dist</mi><mo></mo><mrow><mo fence="true">(</mo><msub><mi mathvariant="bold">c</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo separator="true">,</mo><mi mathvariant="bold">x</mi><mo fence="true">)</mo></mrow><mo>&gt;</mo><mi mathvariant="normal">Dist</mi><mo></mo><mrow><mo fence="true">(</mo><msub><mi mathvariant="bold">c</mi><mrow><mi>i</mi><mi>j</mi><mo>−</mo><mn>1</mn></mrow></msub><mo separator="true">,</mo><msub><mi mathvariant="bold">c</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo fence="true">)</mo></mrow></mrow><annotation encoding="application/x-tex">\operatorname{Dist}\left(\mathbf{c}_{i j}, \mathbf{x}\right)&gt;\operatorname{Dist}\left(\mathbf{c}_{i j-1}, \mathbf{c}_{i j}\right)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.036108em;vertical-align:-0.286108em;"></span><span class="mop"><span class="mord mathrm">D</span><span class="mord mathrm">i</span><span class="mord mathrm">s</span><span class="mord mathrm">t</span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">(</span><span class="mord"><span class="mord"><span class="mord mathbf">c</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.311664em;"><span style="top:-2.5500000000000003em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mord mathnormal mtight" style="margin-right:0.05724em;">j</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.286108em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord mathbf">x</span></span><span class="mclose delimcenter" style="top:0em;">)</span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1.036108em;vertical-align:-0.286108em;"></span><span class="mop"><span class="mord mathrm">D</span><span class="mord mathrm">i</span><span class="mord mathrm">s</span><span class="mord mathrm">t</span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">(</span><span class="mord"><span class="mord"><span class="mord mathbf">c</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.311664em;"><span style="top:-2.5500000000000003em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mord mathnormal mtight" style="margin-right:0.05724em;">j</span><span class="mbin mtight">−</span><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.286108em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord"><span class="mord mathbf">c</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.311664em;"><span style="top:-2.5500000000000003em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mord mathnormal mtight" style="margin-right:0.05724em;">j</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.286108em;"><span></span></span></span></span></span></span><span class="mclose delimcenter" style="top:0em;">)</span></span></span></span></span> ,简单来说也就是用 rng 的思想。</p>
<p>搜索也是一样:</p>
<p><span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mtable rowspacing="0.15999999999999992em" columnalign="right" columnspacing="1em"><mtr><mtd><mstyle scriptlevel="0" displaystyle="false"><mrow><mi mathvariant="bold">q</mi><mo><mover><mo><mo>⟶</mo></mo><mtext> search </mtext></mover></mo><msub><mi mathvariant="bold">X</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>⟺</mo><mi mathvariant="normal">Dist</mi><mo></mo><mrow><mo fence="true">(</mo><mi mathvariant="bold">q</mi><mo separator="true">,</mo><msub><mi mathvariant="bold">c</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo fence="true">)</mo></mrow><mo>≤</mo><mrow><mo fence="true">(</mo><mn>1</mn><mo>+</mo><msub><mi>ϵ</mi><mn>2</mn></msub><mo fence="true">)</mo></mrow><mo>×</mo><mi mathvariant="normal">Dist</mi><mo></mo><mrow><mo fence="true">(</mo><mi mathvariant="bold">q</mi><mo separator="true">,</mo><msub><mi mathvariant="bold">c</mi><mrow><mi>i</mi><mn>1</mn></mrow></msub><mo fence="true">)</mo></mrow><mo separator="true">,</mo></mrow></mstyle></mtd></mtr><mtr><mtd><mstyle scriptlevel="0" displaystyle="false"><mrow><mi mathvariant="normal">Dist</mi><mo></mo><mrow><mo fence="true">(</mo><mi mathvariant="bold">q</mi><mo separator="true">,</mo><msub><mi mathvariant="bold">c</mi><mrow><mi>i</mi><mn>1</mn></mrow></msub><mo fence="true">)</mo></mrow><mo>≤</mo><mi mathvariant="normal">Dist</mi><mo></mo><mrow><mo fence="true">(</mo><mi mathvariant="bold">q</mi><mo separator="true">,</mo><msub><mi mathvariant="bold">c</mi><mrow><mi>i</mi><mn>2</mn></mrow></msub><mo fence="true">)</mo></mrow><mo>≤</mo><mo>⋯</mo><mo>≤</mo><mi mathvariant="normal">Dist</mi><mo></mo><mrow><mo fence="true">(</mo><mi mathvariant="bold">q</mi><mo separator="true">,</mo><msub><mi mathvariant="bold">c</mi><mrow><mi>i</mi><mi>K</mi></mrow></msub><mo fence="true">)</mo></mrow></mrow></mstyle></mtd></mtr></mtable><annotation encoding="application/x-tex">\begin{array}{r}
\mathbf{q} \stackrel{\text { search }}{\longrightarrow} \mathbf{X}_{i j} \Longleftrightarrow \operatorname{Dist}\left(\mathbf{q}, \mathbf{c}_{i j}\right) \leq\left(1+\epsilon_{2}\right) \times \operatorname{Dist}\left(\mathbf{q}, \mathbf{c}_{i 1}\right), \\
\operatorname{Dist}\left(\mathbf{q}, \mathbf{c}_{i 1}\right) \leq \operatorname{Dist}\left(\mathbf{q}, \mathbf{c}_{i 2}\right) \leq \cdots \leq \operatorname{Dist}\left(\mathbf{q}, \mathbf{c}_{i K}\right)
\end{array}
</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:2.857108em;vertical-align:-1.178554em;"></span><span class="mord"><span class="mtable"><span class="arraycolsep" style="width:0.5em;"></span><span class="col-align-r"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.678554em;"><span style="top:-3.678554em;"><span class="pstrut" style="height:3.297108em;"></span><span class="mord"><span class="mord"><span class="mord mathbf">q</span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel"><span class="mop op-limits"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.2971080000000001em;"><span style="top:-3em;"><span class="pstrut" style="height:3em;"></span><span><span class="mop">⟶</span></span></span><span style="top:-3.7110000000000003em;margin-left:0em;"><span class="pstrut" style="height:3em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord mtight"> search </span></span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.011em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mord"><span class="mord"><span class="mord mathbf">X</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.311664em;"><span style="top:-2.5500000000000003em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mord mathnormal mtight" style="margin-right:0.05724em;">j</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.286108em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">⟺</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mop"><span class="mord mathrm">D</span><span class="mord mathrm">i</span><span class="mord mathrm">s</span><span class="mord mathrm">t</span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">(</span><span class="mord"><span class="mord mathbf">q</span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord"><span class="mord mathbf">c</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.311664em;"><span style="top:-2.5500000000000003em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mord mathnormal mtight" style="margin-right:0.05724em;">j</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.286108em;"><span></span></span></span></span></span></span><span class="mclose delimcenter" style="top:0em;">)</span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">≤</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">(</span><span class="mord">1</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mord"><span class="mord mathnormal">ϵ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">2</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose delimcenter" style="top:0em;">)</span></span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mop"><span class="mord mathrm">D</span><span class="mord mathrm">i</span><span class="mord mathrm">s</span><span class="mord mathrm">t</span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">(</span><span class="mord"><span class="mord mathbf">q</span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord"><span class="mord mathbf">c</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.31166399999999994em;"><span style="top:-2.5500000000000003em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose delimcenter" style="top:0em;">)</span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mpunct">,</span></span></span><span style="top:-2.478554em;"><span class="pstrut" style="height:3.297108em;"></span><span class="mord"><span class="mop"><span class="mord mathrm">D</span><span class="mord mathrm">i</span><span class="mord mathrm">s</span><span class="mord mathrm">t</span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">(</span><span class="mord"><span class="mord mathbf">q</span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord"><span class="mord mathbf">c</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.31166399999999994em;"><span style="top:-2.5500000000000003em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose delimcenter" style="top:0em;">)</span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">≤</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mop"><span class="mord mathrm">D</span><span class="mord mathrm">i</span><span class="mord mathrm">s</span><span class="mord mathrm">t</span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">(</span><span class="mord"><span class="mord mathbf">q</span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord"><span class="mord mathbf">c</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.31166399999999994em;"><span style="top:-2.5500000000000003em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mord mtight">2</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose delimcenter" style="top:0em;">)</span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">≤</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="minner">⋯</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">≤</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mop"><span class="mord mathrm">D</span><span class="mord mathrm">i</span><span class="mord mathrm">s</span><span class="mord mathrm">t</span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">(</span><span class="mord"><span class="mord mathbf">q</span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord"><span class="mord mathbf">c</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.32833099999999993em;"><span style="top:-2.5500000000000003em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mord mathnormal mtight" style="margin-right:0.07153em;">K</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose delimcenter" style="top:0em;">)</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.178554em;"><span></span></span></span></span></span><span class="arraycolsep" style="width:0.5em;"></span></span></span></span></span></span></span></p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220609154844581.png" alt="image-20220609154844581" /></p>
<h3 id="hybrid-hnsw-if"><a class="anchor" href="#hybrid-hnsw-if">#</a> Hybrid HNSW-IF</h3>
<p>作者直接说,他们是受到了 spann 论文的启发</p>
<blockquote>
<p>Inspired by the <em>SPANN</em> <span class="exturl" data-url="aHR0cHM6Ly9hcnhpdi5vcmcvYWJzLzIxMTEuMDg1NjY=">paper</span>, we at the Vespa team implemented a simplified version of <code>SPANN</code> using <em>Vespa primitives</em>, released as a Vespa <span class="exturl" data-url="aHR0cHM6Ly9naXRodWIuY29tL3Zlc3BhLWVuZ2luZS9zYW1wbGUtYXBwcy90cmVlL21hc3Rlci9iaWxsaW9uLXNjYWxlLXZlY3Rvci1zZWFyY2g=">sample application</span>. We call this <em>hybrid</em> ANN search method for <code>HNSW-IF</code> .</p>
</blockquote>
<h4 id="索引的构建"><a class="anchor" href="#索引的构建">#</a> 索引的构建</h4>
<h5 id="质心向量的索引构建"><a class="anchor" href="#质心向量的索引构建">#</a> 质心向量的索引构建</h5>
<p>不同于 spann 使用层次 kmeans 来进行质心的搜寻,HNSW-IF 随机地将 20% 的向量作为质心,然后对这些质心向量进行 HNSW 索引构建。构建得到的图还有向量全部存放在内存当中。</p>
<h5 id="非质心向量的索引构建"><a class="anchor" href="#非质心向量的索引构建">#</a> 非质心向量的索引构建</h5>
<p>直接通过对 HNSW 进行 search 得到 k 个接近的 centroid,然后根据 spann 的裁剪策略 assign 到这些 post list 中。注意,这里不再是存放 id+vector 向量,而是存放 distance+id,也就是它的倒排索引的含义。</p>
<h4 id="查询算法"><a class="anchor" href="#查询算法">#</a> 查询算法</h4>
<p>先从 HNSW 中查对应的 centoid 节点。</p>
<h5 id="cluster-centroid-dynamic-pruning"><a class="anchor" href="#cluster-centroid-dynamic-pruning">#</a> cluster centroid dynamic pruning</h5>
<p>这个就是上面 spann 的剪枝策略。先查出 centroid 然后就可以导入 post list。</p>
<h4 id="retrieve-using-dynamic-pruning"><a class="anchor" href="#retrieve-using-dynamic-pruning">#</a> <strong>Retrieve using dynamic pruning</strong></h4>
<p>这里文章中提到了一个两阶段 rank 策略。</p>
<ol>
<li>根据 closeness (q, centroid) * closeness (centroid, v) 计算 post list 中点的权重 z,根据权重进行排序,取固定数量的点。</li>
<li>从 disk 中取出全精度的向量,然后再计算与 query 的距离,最后进行 rank 输出 topk。</li>
</ol>
<h4 id="构建参数"><a class="anchor" href="#构建参数">#</a> 构建参数</h4>
<p>HSNW:</p>
<pre><code> &lt;nodes deploy:environment=&quot;perf&quot; count=&quot;1&quot; groups=&quot;1&quot;&gt;
&lt;resources memory=&quot;128GB&quot; vcpu=&quot;16&quot;
disk=&quot;200Gb&quot; storage-type=&quot;remote&quot;/&gt;
&lt;/nodes&gt;
</code></pre>
<p>非质心索引构建:</p>
<pre><code>&lt;nodes deploy:environment=&quot;perf&quot; count=&quot;4&quot; groups=&quot;1&quot;&gt;
&lt;resources memory=&quot;32GB&quot; vcpu=&quot;8&quot;
disk=&quot;300Gb&quot; storage-type=&quot;local&quot;/&gt;
&lt;/nodes&gt;
</code></pre>
]]></description>
</item>
<item>
<guid isPermalink="true">https://songlinlife.top/2022/graphEmbedding/</guid>
<title>graphEmbedding</title>
<link>https://songlinlife.top/2022/graphEmbedding/</link>
<category term="AI" scheme="https://songlinlife.top/categories/ai/" />
<category term="GNN" scheme="https://songlinlife.top/tags/GNN/" />
<pubDate>Sat, 07 May 2022 11:16:32 +0800</pubDate>
<description><![CDATA[ <p>趁着周末,久违的看看之前的 embedding 相关知识。</p>
<p>deepWalk 代码:</p>
<figure class="highlight python"><figcaption data-lang="python"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">def</span> <span class="token function">deepwalk_walk</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> walk_length<span class="token punctuation">,</span> start_node<span class="token punctuation">)</span><span class="token punctuation">:</span></pre></td></tr><tr><td data-num="2"></td><td><pre></pre></td></tr><tr><td data-num="3"></td><td><pre> walk <span class="token operator">=</span> <span class="token punctuation">[</span>start_node<span class="token punctuation">]</span></pre></td></tr><tr><td data-num="4"></td><td><pre></pre></td></tr><tr><td data-num="5"></td><td><pre> <span class="token keyword">while</span> <span class="token builtin">len</span><span class="token punctuation">(</span>walk<span class="token punctuation">)</span> <span class="token operator">&lt;</span> walk_length<span class="token punctuation">:</span></pre></td></tr><tr><td data-num="6"></td><td><pre> cur <span class="token operator">=</span> walk<span class="token punctuation">[</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">]</span></pre></td></tr><tr><td data-num="7"></td><td><pre> cur_nbrs <span class="token operator">=</span> <span class="token builtin">list</span><span class="token punctuation">(</span>self<span class="token punctuation">.</span>G<span class="token punctuation">.</span>neighbors<span class="token punctuation">(</span>cur<span class="token punctuation">)</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="8"></td><td><pre> <span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>cur_nbrs<span class="token punctuation">)</span> <span class="token operator">></span> <span class="token number">0</span><span class="token punctuation">:</span></pre></td></tr><tr><td data-num="9"></td><td><pre> walk<span class="token punctuation">.</span>append<span class="token punctuation">(</span>random<span class="token punctuation">.</span>choice<span class="token punctuation">(</span>cur_nbrs<span class="token punctuation">)</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="10"></td><td><pre> <span class="token keyword">else</span><span class="token punctuation">:</span></pre></td></tr><tr><td data-num="11"></td><td><pre> <span class="token keyword">break</span></pre></td></tr><tr><td data-num="12"></td><td><pre> <span class="token keyword">return</span> walk</pre></td></tr></table></figure> ]]></description>
</item>
<item>
<guid isPermalink="true">https://songlinlife.top/2022/PPT%E5%AD%A6%E4%B9%A0/</guid>
<title>PPT学习</title>
<link>https://songlinlife.top/2022/PPT%E5%AD%A6%E4%B9%A0/</link>
<category term="琐事" scheme="https://songlinlife.top/categories/ss/" />
<category term="ppt" scheme="https://songlinlife.top/tags/ppt/" />
<pubDate>Sun, 01 May 2022 17:27:34 +0800</pubDate>
<description><![CDATA[ ]]></description>
</item>
<item>
<guid isPermalink="true">https://songlinlife.top/2022/MIT6-s081-lab7thread/</guid>
<title>MIT6.s081: lab7thread</title>
<link>https://songlinlife.top/2022/MIT6-s081-lab7thread/</link>
<category term="linux" scheme="https://songlinlife.top/categories/linux/" />
<category term="MIT" scheme="https://songlinlife.top/tags/MIT/" />
<pubDate>Sun, 01 May 2022 12:02:54 +0800</pubDate>
<description><![CDATA[ <p>这个实验我只是做了 <code>Uthread</code> 。后面关于 <code>pthread</code> 部分并没有做,一来是懒,二来现在 Pthread 用的不多了,主要还是用 <code>OpenMP</code> 来进行多线程编程。</p>
<p><code>thread</code> 数据结构:</p>
<figure class="highlight c"><figcaption data-lang="c"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">struct</span> <span class="token class-name">thread</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="2"></td><td><pre> <span class="token keyword">char</span> stack<span class="token punctuation">[</span>STACK_SIZE<span class="token punctuation">]</span><span class="token punctuation">;</span> <span class="token comment">/* the thread's stack */</span></pre></td></tr><tr><td data-num="3"></td><td><pre> <span class="token keyword">int</span> state<span class="token punctuation">;</span> <span class="token comment">/* FREE, RUNNING, RUNNABLE */</span></pre></td></tr><tr><td data-num="4"></td><td><pre> <span class="token keyword">char</span> context<span class="token punctuation">[</span><span class="token number">4096</span><span class="token punctuation">]</span><span class="token punctuation">;</span> <span class="token comment">//for context save. 这里的 context 实际上是 char * 类型,也就是 4096 字节的地址空间。</span></pre></td></tr><tr><td data-num="5"></td><td><pre><span class="token punctuation">&#125;</span><span class="token punctuation">;</span></pre></td></tr></table></figure><p><code>create_thread</code> :</p>
<figure class="highlight c"><figcaption data-lang="c"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">void</span> </pre></td></tr><tr><td data-num="2"></td><td><pre><span class="token function">thread_create</span><span class="token punctuation">(</span><span class="token keyword">void</span> <span class="token punctuation">(</span><span class="token operator">*</span>func<span class="token punctuation">)</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="3"></td><td><pre><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="4"></td><td><pre> <span class="token keyword">struct</span> <span class="token class-name">thread</span> <span class="token operator">*</span>t<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="5"></td><td><pre></pre></td></tr><tr><td data-num="6"></td><td><pre> <span class="token keyword">for</span> <span class="token punctuation">(</span>t <span class="token operator">=</span> all_thread<span class="token punctuation">;</span> t <span class="token operator">&lt;</span> all_thread <span class="token operator">+</span> MAX_THREAD<span class="token punctuation">;</span> t<span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="7"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>t<span class="token operator">-></span>state <span class="token operator">==</span> FREE<span class="token punctuation">)</span> <span class="token keyword">break</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="8"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="9"></td><td><pre> t<span class="token operator">-></span>state <span class="token operator">=</span> RUNNABLE<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="10"></td><td><pre> </pre></td></tr><tr><td data-num="11"></td><td><pre> <span class="token comment">// YOUR CODE HERE</span></pre></td></tr><tr><td data-num="12"></td><td><pre> <span class="token operator">*</span><span class="token punctuation">(</span>uint64<span class="token operator">*</span><span class="token punctuation">)</span><span class="token punctuation">(</span>t<span class="token operator">-></span>context<span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token punctuation">(</span>uint64<span class="token punctuation">)</span>func<span class="token punctuation">;</span> <span class="token comment">// 这个就是使用 func</span></pre></td></tr><tr><td data-num="13"></td><td><pre> <span class="token operator">*</span><span class="token punctuation">(</span>uint64<span class="token operator">*</span><span class="token punctuation">)</span><span class="token punctuation">(</span>t<span class="token operator">-></span>context <span class="token operator">+</span> <span class="token number">8</span><span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token punctuation">(</span>uint64<span class="token punctuation">)</span><span class="token punctuation">(</span>t<span class="token operator">-></span>stack<span class="token operator">+</span>STACK_SIZE<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="14"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr></table></figure><p><code>进程调度</code> :</p>
<figure class="highlight c"><figcaption data-lang="c"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">void</span> </pre></td></tr><tr><td data-num="2"></td><td><pre><span class="token function">thread_schedule</span><span class="token punctuation">(</span><span class="token keyword">void</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="3"></td><td><pre><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="4"></td><td><pre> <span class="token keyword">struct</span> <span class="token class-name">thread</span> <span class="token operator">*</span>t<span class="token punctuation">,</span> <span class="token operator">*</span>next_thread<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="5"></td><td><pre></pre></td></tr><tr><td data-num="6"></td><td><pre> <span class="token comment">/* Find another runnable thread. */</span></pre></td></tr><tr><td data-num="7"></td><td><pre> next_thread <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="8"></td><td><pre> t <span class="token operator">=</span> current_thread <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="9"></td><td><pre> <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">int</span> i <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> i <span class="token operator">&lt;</span> MAX_THREAD<span class="token punctuation">;</span> i<span class="token operator">++</span><span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="10"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span>t <span class="token operator">>=</span> all_thread <span class="token operator">+</span> MAX_THREAD<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="11"></td><td><pre> t <span class="token operator">=</span> all_thread<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="12"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span>t<span class="token operator">-></span>state <span class="token operator">==</span> RUNNABLE<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="13"></td><td><pre> next_thread <span class="token operator">=</span> t<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="14"></td><td><pre> <span class="token keyword">break</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="15"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="16"></td><td><pre> t <span class="token operator">=</span> t <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="17"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="18"></td><td><pre></pre></td></tr><tr><td data-num="19"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>next_thread <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="20"></td><td><pre> <span class="token function">printf</span><span class="token punctuation">(</span><span class="token string">"thread_schedule: no runnable threads\n"</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="21"></td><td><pre> <span class="token function">exit</span><span class="token punctuation">(</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="22"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="23"></td><td><pre></pre></td></tr><tr><td data-num="24"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>current_thread <span class="token operator">!=</span> next_thread<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span> <span class="token comment">/* switch threads? */</span></pre></td></tr><tr><td data-num="25"></td><td><pre> next_thread<span class="token operator">-></span>state <span class="token operator">=</span> RUNNING<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="26"></td><td><pre> t <span class="token operator">=</span> current_thread<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="27"></td><td><pre> current_thread <span class="token operator">=</span> next_thread<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="28"></td><td><pre> <span class="token function">thread_switch</span><span class="token punctuation">(</span><span class="token punctuation">(</span>uint64<span class="token punctuation">)</span>t<span class="token operator">-></span>context<span class="token punctuation">,</span> <span class="token punctuation">(</span>uint64<span class="token punctuation">)</span>current_thread<span class="token operator">-></span>context<span class="token punctuation">)</span><span class="token punctuation">;</span> </pre></td></tr><tr><td data-num="29"></td><td><pre> <span class="token comment">/* YOUR CODE HERE</pre></td></tr><tr><td data-num="30"></td><td><pre> * Invoke thread_switch to switch from t to next_thread:</pre></td></tr><tr><td data-num="31"></td><td><pre> * thread_switch(??, ??);</pre></td></tr><tr><td data-num="32"></td><td><pre> */</span></pre></td></tr><tr><td data-num="33"></td><td><pre></pre></td></tr><tr><td data-num="34"></td><td><pre> <span class="token punctuation">&#125;</span> <span class="token keyword">else</span></pre></td></tr><tr><td data-num="35"></td><td><pre> next_thread <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="36"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr></table></figure><p><code>切换函数</code> :</p>
<pre><code class="language-asm"> .globl thread_switch
thread_switch:
/* YOUR CODE HERE */
sd ra, 0(a0)
sd sp, 8(a0)
sd s0, 16(a0)
sd s1, 24(a0)
sd s2, 32(a0)
sd s3, 40(a0)
sd s4, 48(a0)
sd s5, 56(a0)
sd s6, 64(a0)
sd s7, 72(a0)
sd s8, 80(a0)
sd s9, 88(a0)
sd s10, 96(a0)
sd s11, 104(a0)
ld ra, 0(a1)
ld sp, 8(a1)
ld s0, 16(a1)
ld s1, 24(a1)
ld s2, 32(a1)
ld s3, 40(a1)
ld s4, 48(a1)
ld s5, 56(a1)
ld s6, 64(a1)
ld s7, 72(a1)
ld s8, 80(a1)
ld s9, 88(a1)
ld s10, 96(a1)
ld s11, 104(a1)
ret /* return to ra */
</code></pre>
<p>主要有意思的点就是,create 的时候,自己制定了 ra 和 sp 两个寄存器的数据。这种直接用指针进行操作内存的方式,不得不感叹真的自由度高。</p>
]]></description>
</item>
<item>
<guid isPermalink="true">https://songlinlife.top/2022/CPP%E5%AD%A6%E4%B9%A0%EF%BC%9Aday1/</guid>
<title>CPP学习:day1</title>
<link>https://songlinlife.top/2022/CPP%E5%AD%A6%E4%B9%A0%EF%BC%9Aday1/</link>
<category term="cpp" scheme="https://songlinlife.top/categories/cpp/" />
<pubDate>Fri, 29 Apr 2022 19:33:20 +0800</pubDate>
<description><![CDATA[ <h3 id="模板"><a class="anchor" href="#模板">#</a> 模板</h3>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220429200336968.png" alt="image-20220429200336968" /></p>
<h3 id="openmp"><a class="anchor" href="#openmp">#</a> OpenMP</h3>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220429222521347.png" alt="image-20220429222521347" /></p>
<p><code>for</code> 和 <code>single</code> 会自动添加 <code>barrier</code> ,所以要</p>
]]></description>
</item>
<item>
<guid isPermalink="true">https://songlinlife.top/2022/ANNS%E5%B8%B8%E7%94%A8%E7%9A%84%E6%95%B0%E6%8D%AE%E9%9B%86/</guid>
<title>ANNS常用的数据集</title>
<link>https://songlinlife.top/2022/ANNS%E5%B8%B8%E7%94%A8%E7%9A%84%E6%95%B0%E6%8D%AE%E9%9B%86/</link>
<category term="数据库" scheme="https://songlinlife.top/categories/%E6%95%B0%E6%8D%AE%E5%BA%93/" />
<category term="ANNS" scheme="https://songlinlife.top/tags/ANNS/" />
<pubDate>Fri, 29 Apr 2022 14:10:54 +0800</pubDate>
<description><![CDATA[ <p>用于记录自己在 ANNS 学习中使用到或者可能会使用到的数据集,他们的参数设置以及如何该存储。</p>
<h3 id="sift数据集"><a class="anchor" href="#sift数据集">#</a> SIFT 数据集</h3>
<p>官方地址:<span class="exturl" data-url="aHR0cDovL2NvcnB1cy10ZXhtZXguaXJpc2EuZnIv">http://corpus-texmex.irisa.fr/</span></p>
]]></description>
</item>
<item>
<guid isPermalink="true">https://songlinlife.top/2022/linux/MIT6-s081-%E5%A4%9A%E7%BA%BF%E7%A8%8B/</guid>
<title>MIT6.s081:多线程</title>
<link>https://songlinlife.top/2022/linux/MIT6-s081-%E5%A4%9A%E7%BA%BF%E7%A8%8B/</link>
<category term="linux" scheme="https://songlinlife.top/categories/linux/" />
<pubDate>Fri, 29 Apr 2022 10:27:19 +0800</pubDate>
<description><![CDATA[ <h3 id="线程的概念"><a class="anchor" href="#线程的概念">#</a> 线程的概念</h3>
<p>说实话即使是现在我对线程也不能说完全懂了。<strong>进程时系统资源分配的最小单位</strong>,<strong>线程时 cpu 操作和调度的最小单位</strong>,本质是一组寄存器的状态,是操作系统对寄存器状态的抽象。XV6 每个进程只有一个页内存用于栈,也就是每个线程只对应一个线程。所以线程的切换等价于进程的切换。在 xv6 中所有的内核进程是共享内存的,而用户进程是完全内存隔离的。进程的切换和之前 trap 很类似,但是不同的是 trap 结束后返回的是同一个进程,而 switch 要切换到另一个进程。</p>
<p>内核进程和用户进程到底什么关系我之前也思考了很久,我想对 CPU 来说用户进程和内核进程或许差不多,它们都是相同的 pid,拥有差不多的 proc 数据结构。但是对于内核进程和用户进程来说,这完全不一样,他们有着不同的权限,不同的数据结构,不同的内存空间。因为 trap 会切换 <code>stap</code> 寄存器,这会导致整个内存地址空间发生了改变, <code>whole world changed</code> 。但是它们又是运行在相同的 CPU 上。</p>
<p>对于进程的切换与 trap 最大的不同在于,用户进程进入内核空间后才能进行进程切换,这就需要再保存当前内核线程的 context,然后切换到 scheduler 线程,由 scheduler 线程再切换到另一个 Runable 的线程。然后再返回用户空间实现了用户线程之间的切换。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220429105242330.png" alt="image-20220429105242330" /></p>
<p>linux 中的多线程可以实现一个用户进程多个线程,这些线程共享进程的内存。但是这里的多线程可以认为多个进程但是这些进程共享同样的内存,可以是使用相同的页表,或者页表指向的 pa 相同。但是不管怎么,线程的切换不能在用户态进行,必须走到内核态,然后内核态切换到 scheduler 线程,scheduler 再切换到 Runable 的线程。因此这里并没有保证同一个进程的线程一定会运行在不同的 CPU 上?</p>
<h3 id="scheduler时钟中断"><a class="anchor" href="#scheduler时钟中断">#</a> Scheduler(时钟中断)</h3>
<p>首先看 <code>usertrap</code> ,这里面有这样代码,如果判断是时钟中断会执行放弃 CPU 命令。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220429114013490.png" alt="image-20220429114013490" /></p>
<p>进入 <code>yield</code> 函数中,yield 先进行加锁,然后设置当前进程从 RUNNING 为 RUNNABLE。加锁的作用是防止其他 CPU 核调用该进程,因为此时的进程虽然声明了不在运行但实际上还是运行的。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220429114037227.png" alt="image-20220429114037227" /></p>
<p>继续进入 <code>sched</code> 中,忽略正确性检查的代码, <code>swtch</code> 是核心所在,swtch 会保存当前线程的 context,然后把 scheduler 的 context 给 load,注意 scheduler 的 context 是直接保存在 cpu 上的,因为我们调度肯定是 CPU 来完成。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220429114233882.png" alt="image-20220429114233882" /></p>
<p>swtch 函数,只保存了 ra、sp 和 callee registers。因为 swtch 相当于一个函数调用,我们使用函数调用的处理方式来处理,把编译器没有保存的寄存器给保存就行了。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220429114338234.png" alt="image-20220429114338234" /></p>
<p>scheduler 函数,我们可以看到 swtch 函数返回了,scheduler 继续执行,找到下一个 runnable 的 proc,然后执行 switch 操作。时刻记住 swtch 就相当于一个函数调用。因为内核的进程共享内存,使用相同的页表。不同的是不同的进程有不同的内核栈。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220429114726849.png" alt="image-20220429114726849" /></p>
]]></description>
</item>
<item>
<guid isPermalink="true">https://songlinlife.top/2022/NSG-%E6%BA%90%E7%A0%81%E9%98%85%E8%AF%BB/</guid>
<title>NSG 源码阅读</title>
<link>https://songlinlife.top/2022/NSG-%E6%BA%90%E7%A0%81%E9%98%85%E8%AF%BB/</link>
<category term="数据库" scheme="https://songlinlife.top/categories/%E6%95%B0%E6%8D%AE%E5%BA%93/" />
<category term="ANNS" scheme="https://songlinlife.top/tags/ANNS/" />
<pubDate>Thu, 28 Apr 2022 17:38:22 +0800</pubDate>
<description><![CDATA[ <h3 id="代码阅读"><a class="anchor" href="#代码阅读">#</a> 代码阅读</h3>
<p>把 Index 部分读完了,挺有意思的。做了个 Index 流程图:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesNSG_Index.png" alt="NSG_Index" /></p>
<p>NSG 的代码 search 做了一些非常有意思的优化,但是它们的论文中却没有提到。</p>
<p>1)对于每个 point,把它的全精度向量和它的邻居存放在一起。这样可以尽可能保证局部性。</p>
<p>2)使用 <code>_mm_prefetch</code> 进行预取数据</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token function">_mm_prefetch</span><span class="token punctuation">(</span>opt_graph_ <span class="token operator">+</span> node_size <span class="token operator">*</span> id<span class="token punctuation">,</span> _MM_HINT_T0<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">//fetch 到所有缓存级别中</span></pre></td></tr></table></figure><p><code>_mm_prefetch</code> 代码的作用其实很简单,就是把当前地址的数据 fetch 到特定的缓存级别中,fetch 的数据大小是 <code>cache line</code> 。 <code>_MM_HINT_T0</code> 表示加载到全部缓存级别。</p>
<h3 id="search算法实现"><a class="anchor" href="#search算法实现">#</a> search 算法实现</h3>
<p>记录一下 <code>search</code> 算法的实现</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">void</span> <span class="token class-name">IndexNSG</span><span class="token double-colon punctuation">::</span><span class="token function">get_neighbors</span><span class="token punctuation">(</span><span class="token keyword">const</span> <span class="token keyword">float</span> <span class="token operator">*</span>query<span class="token punctuation">,</span> <span class="token keyword">const</span> Parameters <span class="token operator">&amp;</span>parameter<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="2"></td><td><pre> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span>Neighbor<span class="token operator">></span> <span class="token operator">&amp;</span>retset<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="3"></td><td><pre> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span>Neighbor<span class="token operator">></span> <span class="token operator">&amp;</span>fullset<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="4"></td><td><pre> <span class="token keyword">unsigned</span> L <span class="token operator">=</span> parameter<span class="token punctuation">.</span><span class="token generic-function"><span class="token function">Get</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token string">"L"</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="5"></td><td><pre></pre></td></tr><tr><td data-num="6"></td><td><pre> retset<span class="token punctuation">.</span><span class="token function">resize</span><span class="token punctuation">(</span>L <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="7"></td><td><pre> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span> <span class="token function">init_ids</span><span class="token punctuation">(</span>L<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="8"></td><td><pre> <span class="token comment">// initializer_->Search(query, nullptr, L, parameter, init_ids.data());</span></pre></td></tr><tr><td data-num="9"></td><td><pre></pre></td></tr><tr><td data-num="10"></td><td><pre> boost<span class="token double-colon punctuation">::</span>dynamic_bitset<span class="token operator">&lt;</span><span class="token operator">></span> flags<span class="token punctuation">&#123;</span>nd_<span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">&#125;</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="11"></td><td><pre> L <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="12"></td><td><pre> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">unsigned</span> i <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> i <span class="token operator">&lt;</span> init_ids<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">&amp;&amp;</span> i <span class="token operator">&lt;</span> final_graph_<span class="token punctuation">[</span>ep_<span class="token punctuation">]</span><span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="13"></td><td><pre> i<span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="14"></td><td><pre> init_ids<span class="token punctuation">[</span>i<span class="token punctuation">]</span> <span class="token operator">=</span> final_graph_<span class="token punctuation">[</span>ep_<span class="token punctuation">]</span><span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">;</span> <span class="token comment">//init 阶段会 rand 一个 ep_</span></pre></td></tr><tr><td data-num="15"></td><td><pre> flags<span class="token punctuation">[</span>init_ids<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token boolean">true</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="16"></td><td><pre> L<span class="token operator">++</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="17"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="18"></td><td><pre> <span class="token keyword">while</span> <span class="token punctuation">(</span></pre></td></tr><tr><td data-num="19"></td><td><pre> L <span class="token operator">&lt;</span></pre></td></tr><tr><td data-num="20"></td><td><pre> init_ids<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span> <span class="token comment">// 如果不够 L 就随机加上几个 point,直到 candidate 池到达 L</span></pre></td></tr><tr><td data-num="21"></td><td><pre> <span class="token keyword">unsigned</span> id <span class="token operator">=</span> <span class="token function">rand</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">%</span> nd_<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="22"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>flags<span class="token punctuation">[</span>id<span class="token punctuation">]</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="23"></td><td><pre> <span class="token keyword">continue</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="24"></td><td><pre> init_ids<span class="token punctuation">[</span>L<span class="token punctuation">]</span> <span class="token operator">=</span> id<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="25"></td><td><pre> L<span class="token operator">++</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="26"></td><td><pre> flags<span class="token punctuation">[</span>id<span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token boolean">true</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="27"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="28"></td><td><pre></pre></td></tr><tr><td data-num="29"></td><td><pre> L <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="30"></td><td><pre> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">unsigned</span> i <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> i <span class="token operator">&lt;</span> init_ids<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> i<span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="31"></td><td><pre> <span class="token keyword">unsigned</span> id <span class="token operator">=</span> init_ids<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="32"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>id <span class="token operator">>=</span> nd_<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="33"></td><td><pre> <span class="token keyword">continue</span><span class="token punctuation">;</span> <span class="token comment">// 这个判断条件意义?</span></pre></td></tr><tr><td data-num="34"></td><td><pre> <span class="token comment">// std::cout&lt;&lt;id&lt;&lt;std::endl;</span></pre></td></tr><tr><td data-num="35"></td><td><pre> <span class="token keyword">float</span> dist <span class="token operator">=</span> distance_<span class="token operator">-></span><span class="token function">compare</span><span class="token punctuation">(</span>data_ <span class="token operator">+</span> dimension_ <span class="token operator">*</span> <span class="token punctuation">(</span>size_t<span class="token punctuation">)</span>id<span class="token punctuation">,</span> query<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="36"></td><td><pre> <span class="token punctuation">(</span><span class="token keyword">unsigned</span><span class="token punctuation">)</span>dimension_<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 得到 distance</span></pre></td></tr><tr><td data-num="37"></td><td><pre> retset<span class="token punctuation">[</span>i<span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token function">Neighbor</span><span class="token punctuation">(</span>id<span class="token punctuation">,</span> dist<span class="token punctuation">,</span> <span class="token boolean">true</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="38"></td><td><pre> <span class="token comment">// flags[id] = 1;</span></pre></td></tr><tr><td data-num="39"></td><td><pre> L<span class="token operator">++</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="40"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="41"></td><td><pre></pre></td></tr><tr><td data-num="42"></td><td><pre> std<span class="token double-colon punctuation">::</span><span class="token function">sort</span><span class="token punctuation">(</span>retset<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> retset<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span> L<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="43"></td><td><pre> <span class="token keyword">int</span> k <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="44"></td><td><pre> <span class="token comment">// 不得不吐槽啊,这代码写得真混乱</span></pre></td></tr><tr><td data-num="45"></td><td><pre> <span class="token keyword">while</span> <span class="token punctuation">(</span>k <span class="token operator">&lt;</span> <span class="token punctuation">(</span><span class="token keyword">int</span><span class="token punctuation">)</span>L<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="46"></td><td><pre> <span class="token keyword">int</span> nk <span class="token operator">=</span> L<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="47"></td><td><pre></pre></td></tr><tr><td data-num="48"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>retset<span class="token punctuation">[</span>k<span class="token punctuation">]</span><span class="token punctuation">.</span>flag<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span> <span class="token comment">// 防止回溯</span></pre></td></tr><tr><td data-num="49"></td><td><pre> retset<span class="token punctuation">[</span>k<span class="token punctuation">]</span><span class="token punctuation">.</span>flag <span class="token operator">=</span> <span class="token boolean">false</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="50"></td><td><pre> <span class="token keyword">unsigned</span> n <span class="token operator">=</span> retset<span class="token punctuation">[</span>k<span class="token punctuation">]</span><span class="token punctuation">.</span>id<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="51"></td><td><pre></pre></td></tr><tr><td data-num="52"></td><td><pre> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">unsigned</span> m <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> m <span class="token operator">&lt;</span> final_graph_<span class="token punctuation">[</span>n<span class="token punctuation">]</span><span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token operator">++</span>m<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="53"></td><td><pre> <span class="token keyword">unsigned</span> id <span class="token operator">=</span> final_graph_<span class="token punctuation">[</span>n<span class="token punctuation">]</span><span class="token punctuation">[</span>m<span class="token punctuation">]</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="54"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>flags<span class="token punctuation">[</span>id<span class="token punctuation">]</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="55"></td><td><pre> <span class="token keyword">continue</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="56"></td><td><pre> flags<span class="token punctuation">[</span>id<span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token number">1</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="57"></td><td><pre></pre></td></tr><tr><td data-num="58"></td><td><pre> <span class="token keyword">float</span> dist <span class="token operator">=</span> distance_<span class="token operator">-></span><span class="token function">compare</span><span class="token punctuation">(</span>query<span class="token punctuation">,</span> data_ <span class="token operator">+</span> dimension_ <span class="token operator">*</span> <span class="token punctuation">(</span>size_t<span class="token punctuation">)</span>id<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="59"></td><td><pre> <span class="token punctuation">(</span><span class="token keyword">unsigned</span><span class="token punctuation">)</span>dimension_<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="60"></td><td><pre> Neighbor <span class="token function">nn</span><span class="token punctuation">(</span>id<span class="token punctuation">,</span> dist<span class="token punctuation">,</span> <span class="token boolean">true</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="61"></td><td><pre> fullset<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>nn<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="62"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>dist <span class="token operator">>=</span> retset<span class="token punctuation">[</span>L <span class="token operator">-</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">.</span>distance<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="63"></td><td><pre> <span class="token keyword">continue</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="64"></td><td><pre> <span class="token keyword">int</span> r <span class="token operator">=</span> <span class="token function">InsertIntoPool</span><span class="token punctuation">(</span>retset<span class="token punctuation">.</span><span class="token function">data</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> L<span class="token punctuation">,</span> nn<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="65"></td><td><pre></pre></td></tr><tr><td data-num="66"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>L <span class="token operator">+</span> <span class="token number">1</span> <span class="token operator">&lt;</span> retset<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="67"></td><td><pre> <span class="token operator">++</span>L<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="68"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>r <span class="token operator">&lt;</span> nk<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="69"></td><td><pre> nk <span class="token operator">=</span> r<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="70"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="71"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="72"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>nk <span class="token operator">&lt;=</span> k<span class="token punctuation">)</span> <span class="token comment">// 说明新插入的那个 point 比当前的 point 离 query 更加进。</span></pre></td></tr><tr><td data-num="73"></td><td><pre> k <span class="token operator">=</span> nk<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="74"></td><td><pre> <span class="token keyword">else</span></pre></td></tr><tr><td data-num="75"></td><td><pre> <span class="token operator">++</span>k<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="76"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="77"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr></table></figure><h3 id="总结"><a class="anchor" href="#总结">#</a> 总结</h3>
<p><code>NSG</code> 代码是我第一次看的代码,实现的很精妙。一开始自己完全不懂 SIMD 和 OMP 的相关知识,看得很痛苦。。。补了些相关知识后,终于花了一天把这个实现读完了。</p>
]]></description>
</item>
<item>
<guid isPermalink="true">https://songlinlife.top/2022/MIT6-s081-lab5-COW/</guid>
<title>MIT6.s081: lab5 COW</title>
<link>https://songlinlife.top/2022/MIT6-s081-lab5-COW/</link>
<category term="linux" scheme="https://songlinlife.top/categories/linux/" />
<category term="MIT" scheme="https://songlinlife.top/tags/MIT/" />
<pubDate>Thu, 28 Apr 2022 16:06:57 +0800</pubDate>
<description><![CDATA[ <h3 id="cow"><a class="anchor" href="#cow">#</a> COW</h3>
<p><code>cow</code> 就是 <code>copy on write</code> 的简称,当进程通过 fork 创建子进程时候,因为子进程和父进程的数据还有文件描述符等数据结构都是一样的,所以需要把父进程的所有数据拷贝到子进程,就必须申请和父进程相同的内存。但是比如说 <code>sh</code> 进程,它首先 fork 生成了一个子进程,子进程接着去执行 exec,exec 会清除进程数据然后 load 新的数据和指令。原来 fork 操作的数据拷贝就显得浪费了,所以就有了 cow 技术。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220428161229686.png" alt="image-20220428161229686" /></p>
<p>当 fork 的时候并不执行新的内存申请,而是页表项映射到 parent 的实地址中。这样我们就省下了申请新的空间并拷贝数据的开销,但是这样也有一个问题,那就是如果子进程虽然由父进程 fork 得到,但是他们之间应该存在数据隔离,如果修改了父进程的数据子进程不应该察觉到。因此我们需要把父进程和子进程的页表项的 flag 同时设定为只读。</p>
<p>当子进程或父进程需要修改这些 cow 内存页那么就会触发 page fault,然后 trap 会重新分配一个空闲页,并把数据拷贝进去。这样就完成了 cow。</p>
<p>这里还有一个问题,我们应该如何回收这些 cow 页。xv6 给出的办法是维护一个数组,内容为内存页的 reference 个数。但 ref 为 0 时候就可以回收这个 cow 页。</p>
<h3 id="riscvh-添加宏"><a class="anchor" href="#riscvh-添加宏">#</a> riscv.h 添加宏</h3>
<figure class="highlight c"><figcaption data-lang="c"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name">PTE_COW</span> <span class="token expression"><span class="token punctuation">(</span><span class="token number">1L</span> <span class="token operator">&lt;&lt;</span> <span class="token number">8</span><span class="token punctuation">)</span> </span><span class="token comment">//cow flag,用于表示该页表项为 cow 页表项</span></span></pre></td></tr><tr><td data-num="2"></td><td><pre><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name function">PA2PGREF_ID</span><span class="token expression"><span class="token punctuation">(</span>pa<span class="token punctuation">)</span> <span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token punctuation">(</span>pa<span class="token punctuation">)</span> <span class="token operator">-</span> KERNBASE<span class="token punctuation">)</span><span class="token operator">/</span>PGSIZE<span class="token punctuation">)</span> </span><span class="token comment">// 该 physical address 对应 ref 数组的索引</span></span></pre></td></tr><tr><td data-num="3"></td><td><pre><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name">REF_MAX</span> <span class="token expression"><span class="token function">PA2PGREF_ID</span><span class="token punctuation">(</span>PHYSTOP<span class="token punctuation">)</span> </span><span class="token comment">//ref 数据长度</span></span></pre></td></tr></table></figure><h3 id="kallocc代码"><a class="anchor" href="#kallocc代码">#</a> kalloc.c 代码</h3>
<figure class="highlight c"><figcaption data-lang="c"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">struct</span> <span class="token class-name">page_ref</span> <span class="token punctuation">&#123;</span> <span class="token comment">//ref 数组元素数据结构</span></pre></td></tr><tr><td data-num="2"></td><td><pre> <span class="token keyword">int</span> cnt<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="3"></td><td><pre> <span class="token keyword">struct</span> <span class="token class-name">spinlock</span> lock<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="4"></td><td><pre><span class="token punctuation">&#125;</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="5"></td><td><pre><span class="token keyword">struct</span> <span class="token class-name">page_ref</span> page_ref_list<span class="token punctuation">[</span>REF_MAX<span class="token punctuation">]</span><span class="token punctuation">;</span> <span class="token comment">//ref 数组</span></pre></td></tr><tr><td data-num="6"></td><td><pre></pre></td></tr><tr><td data-num="7"></td><td><pre><span class="token keyword">int</span></pre></td></tr><tr><td data-num="8"></td><td><pre><span class="token function">ref_page</span><span class="token punctuation">(</span>uint64 pa<span class="token punctuation">,</span> <span class="token keyword">int</span> i<span class="token punctuation">)</span><span class="token punctuation">&#123;</span> <span class="token comment">// 修改 ref, 并返回修改后的 ref 数</span></pre></td></tr><tr><td data-num="9"></td><td><pre> <span class="token function">acquire</span><span class="token punctuation">(</span><span class="token operator">&amp;</span>page_ref_list<span class="token punctuation">[</span><span class="token function">PA2PGREF_ID</span><span class="token punctuation">(</span><span class="token punctuation">(</span>uint64<span class="token punctuation">)</span>pa<span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">.</span>lock<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="10"></td><td><pre> page_ref_list<span class="token punctuation">[</span><span class="token function">PA2PGREF_ID</span><span class="token punctuation">(</span><span class="token punctuation">(</span>uint64<span class="token punctuation">)</span>pa<span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">.</span>cnt <span class="token operator">+=</span> i<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="11"></td><td><pre> <span class="token function">release</span><span class="token punctuation">(</span><span class="token operator">&amp;</span>page_ref_list<span class="token punctuation">[</span><span class="token function">PA2PGREF_ID</span><span class="token punctuation">(</span><span class="token punctuation">(</span>uint64<span class="token punctuation">)</span>pa<span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">.</span>lock<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="12"></td><td><pre> <span class="token keyword">return</span> page_ref_list<span class="token punctuation">[</span><span class="token function">PA2PGREF_ID</span><span class="token punctuation">(</span><span class="token punctuation">(</span>uint64<span class="token punctuation">)</span>pa<span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">.</span>cnt<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="13"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="14"></td><td><pre><span class="token keyword">void</span></pre></td></tr><tr><td data-num="15"></td><td><pre><span class="token function">kinit</span><span class="token punctuation">(</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="16"></td><td><pre><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="17"></td><td><pre> <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">int</span> i<span class="token operator">=</span><span class="token number">0</span><span class="token punctuation">;</span> i<span class="token operator">&lt;</span>REF_MAX<span class="token punctuation">;</span> <span class="token operator">++</span>i<span class="token punctuation">)</span><span class="token punctuation">&#123;</span> <span class="token comment">// 初始化 ref 为 1, 这里初始化为 0 会报错可能 xv6 内核还可以通过其他方式分配内存?</span></pre></td></tr><tr><td data-num="18"></td><td><pre> <span class="token function">initlock</span><span class="token punctuation">(</span><span class="token operator">&amp;</span><span class="token punctuation">(</span>page_ref_list<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">.</span>lock<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">"kpage_ref"</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="19"></td><td><pre> page_ref_list<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">.</span>cnt <span class="token operator">=</span><span class="token number">1</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="20"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="21"></td><td><pre> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span></pre></td></tr><tr><td data-num="22"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="23"></td><td><pre></pre></td></tr><tr><td data-num="24"></td><td><pre><span class="token keyword">void</span></pre></td></tr><tr><td data-num="25"></td><td><pre><span class="token function">kfree</span><span class="token punctuation">(</span><span class="token keyword">void</span> <span class="token operator">*</span>pa<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="26"></td><td><pre><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="27"></td><td><pre> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span></pre></td></tr><tr><td data-num="28"></td><td><pre> <span class="token keyword">int</span> ref <span class="token operator">=</span> <span class="token function">ref_page</span><span class="token punctuation">(</span><span class="token punctuation">(</span>uint64<span class="token punctuation">)</span>pa<span class="token punctuation">,</span> <span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="29"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span>ref <span class="token operator">></span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="30"></td><td><pre> <span class="token keyword">return</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="31"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="32"></td><td><pre> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span></pre></td></tr><tr><td data-num="33"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="34"></td><td><pre></pre></td></tr><tr><td data-num="35"></td><td><pre><span class="token keyword">void</span> <span class="token operator">*</span></pre></td></tr><tr><td data-num="36"></td><td><pre><span class="token function">kalloc</span><span class="token punctuation">(</span><span class="token keyword">void</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="37"></td><td><pre><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="38"></td><td><pre> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span></pre></td></tr><tr><td data-num="39"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span>r<span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="40"></td><td><pre> <span class="token function">memset</span><span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token keyword">char</span><span class="token operator">*</span><span class="token punctuation">)</span>r<span class="token punctuation">,</span> <span class="token number">5</span><span class="token punctuation">,</span> PGSIZE<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// fill with junk</span></pre></td></tr><tr><td data-num="41"></td><td><pre> <span class="token function">acquire</span><span class="token punctuation">(</span><span class="token operator">&amp;</span>page_ref_list<span class="token punctuation">[</span><span class="token function">PA2PGREF_ID</span><span class="token punctuation">(</span><span class="token punctuation">(</span>uint64<span class="token punctuation">)</span>r<span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">.</span>lock<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="42"></td><td><pre> page_ref_list<span class="token punctuation">[</span><span class="token function">PA2PGREF_ID</span><span class="token punctuation">(</span><span class="token punctuation">(</span>uint64<span class="token punctuation">)</span>r<span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">.</span>cnt <span class="token operator">=</span><span class="token number">1</span><span class="token punctuation">;</span> <span class="token comment">// 分配的时候设置为 1</span></pre></td></tr><tr><td data-num="43"></td><td><pre> <span class="token function">release</span><span class="token punctuation">(</span><span class="token operator">&amp;</span>page_ref_list<span class="token punctuation">[</span><span class="token function">PA2PGREF_ID</span><span class="token punctuation">(</span><span class="token punctuation">(</span>uint64<span class="token punctuation">)</span>r<span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">.</span>lock<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="44"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="45"></td><td><pre> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span></pre></td></tr><tr><td data-num="46"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr></table></figure><h3 id="trapc-代码"><a class="anchor" href="#trapc-代码">#</a> trap.c 代码</h3>
<figure class="highlight c"><figcaption data-lang="c"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">void</span></pre></td></tr><tr><td data-num="2"></td><td><pre><span class="token function">usertrap</span><span class="token punctuation">(</span><span class="token keyword">void</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="3"></td><td><pre><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="4"></td><td><pre> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span></pre></td></tr><tr><td data-num="5"></td><td><pre> <span class="token punctuation">&#125;</span><span class="token keyword">else</span> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token function">r_scause</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">13</span> <span class="token operator">||</span> <span class="token function">r_scause</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token operator">==</span><span class="token number">15</span><span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="6"></td><td><pre> <span class="token comment">//cow</span></pre></td></tr><tr><td data-num="7"></td><td><pre> uint64 addr <span class="token operator">=</span> <span class="token function">r_stval</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="8"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token function">cow_check</span><span class="token punctuation">(</span>p<span class="token operator">-></span>pagetable<span class="token punctuation">,</span> addr<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="9"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token function">cow</span><span class="token punctuation">(</span>addr<span class="token punctuation">)</span><span class="token operator">==</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="10"></td><td><pre> p<span class="token operator">-></span>killed <span class="token operator">=</span> <span class="token number">1</span><span class="token punctuation">;</span> </pre></td></tr><tr><td data-num="11"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="12"></td><td><pre> <span class="token punctuation">&#125;</span><span class="token keyword">else</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="13"></td><td><pre> p<span class="token operator">-></span>killed <span class="token operator">=</span> <span class="token number">1</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="14"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="15"></td><td><pre> <span class="token punctuation">&#125;</span><span class="token keyword">else</span> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token punctuation">(</span>which_dev <span class="token operator">=</span> <span class="token function">devintr</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">!=</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="16"></td><td><pre> <span class="token comment">// ok</span></pre></td></tr><tr><td data-num="17"></td><td><pre> <span class="token punctuation">&#125;</span> <span class="token keyword">else</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="18"></td><td><pre> <span class="token function">printf</span><span class="token punctuation">(</span><span class="token string">"usertrap(): unexpected scause %p pid=%d\n"</span><span class="token punctuation">,</span> <span class="token function">r_scause</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> p<span class="token operator">-></span>pid<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="19"></td><td><pre> <span class="token function">printf</span><span class="token punctuation">(</span><span class="token string">" sepc=%p stval=%p\n"</span><span class="token punctuation">,</span> <span class="token function">r_sepc</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">r_stval</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="20"></td><td><pre> p<span class="token operator">-></span>killed <span class="token operator">=</span> <span class="token number">1</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="21"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="22"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="23"></td><td><pre></pre></td></tr><tr><td data-num="24"></td><td><pre></pre></td></tr><tr><td data-num="25"></td><td><pre><span class="token keyword">extern</span> <span class="token class-name">pte_t</span><span class="token operator">*</span> <span class="token function">walk</span><span class="token punctuation">(</span><span class="token class-name">pagetable_t</span> pagetable<span class="token punctuation">,</span> uint64 va<span class="token punctuation">,</span> <span class="token keyword">int</span> alloc<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="26"></td><td><pre>uint64 <span class="token function">cow</span><span class="token punctuation">(</span>uint64 addr<span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="27"></td><td><pre> <span class="token class-name">pte_t</span><span class="token operator">*</span> pte<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="28"></td><td><pre> uint64 va <span class="token operator">=</span> <span class="token function">PGROUNDDOWN</span><span class="token punctuation">(</span>addr<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="29"></td><td><pre> <span class="token keyword">struct</span> <span class="token class-name">proc</span><span class="token operator">*</span> p <span class="token operator">=</span> <span class="token function">myproc</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="30"></td><td><pre> pte <span class="token operator">=</span> <span class="token function">walk</span><span class="token punctuation">(</span>p<span class="token operator">-></span>pagetable<span class="token punctuation">,</span> addr<span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="31"></td><td><pre> uint64 pa <span class="token operator">=</span> <span class="token function">PTE2PA</span><span class="token punctuation">(</span><span class="token punctuation">(</span>uint64<span class="token punctuation">)</span><span class="token operator">*</span>pte<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="32"></td><td><pre> <span class="token keyword">char</span><span class="token operator">*</span> mem <span class="token operator">=</span> <span class="token function">kalloc</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="33"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span>mem <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="34"></td><td><pre> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="35"></td><td><pre> <span class="token punctuation">&#125;</span><span class="token keyword">else</span><span class="token punctuation">&#123;</span> </pre></td></tr><tr><td data-num="36"></td><td><pre> <span class="token function">memset</span><span class="token punctuation">(</span>mem<span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">,</span> PGSIZE<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="37"></td><td><pre> <span class="token function">memmove</span><span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token keyword">void</span><span class="token operator">*</span><span class="token punctuation">)</span>mem<span class="token punctuation">,</span> <span class="token punctuation">(</span><span class="token keyword">void</span><span class="token operator">*</span><span class="token punctuation">)</span>pa<span class="token punctuation">,</span> PGSIZE<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="38"></td><td><pre> <span class="token operator">*</span>pte <span class="token operator">=</span> <span class="token operator">*</span>pte <span class="token operator">&amp;</span> <span class="token operator">~</span>PTE_V<span class="token punctuation">;</span> <span class="token comment">// 申请该页表项无效,防止 remap</span></pre></td></tr><tr><td data-num="39"></td><td><pre> uint64 flag <span class="token operator">=</span> <span class="token function">PTE_FLAGS</span><span class="token punctuation">(</span><span class="token operator">*</span>pte<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="40"></td><td><pre> flag <span class="token operator">=</span> flag <span class="token operator">|</span> PTE_W<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="41"></td><td><pre> flag <span class="token operator">=</span> flag <span class="token operator">&amp;</span> <span class="token punctuation">(</span><span class="token operator">~</span>PTE_COW<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="42"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token function">mappages</span><span class="token punctuation">(</span>p<span class="token operator">-></span>pagetable<span class="token punctuation">,</span> va <span class="token punctuation">,</span>PGSIZE <span class="token punctuation">,</span> <span class="token punctuation">(</span>uint64<span class="token punctuation">)</span>mem<span class="token punctuation">,</span> flag<span class="token punctuation">)</span><span class="token operator">!=</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="43"></td><td><pre> <span class="token function">kfree</span><span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token keyword">void</span><span class="token operator">*</span><span class="token punctuation">)</span>mem<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="44"></td><td><pre> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="45"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="46"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="47"></td><td><pre> <span class="token function">kfree</span><span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token keyword">void</span><span class="token operator">*</span><span class="token punctuation">)</span><span class="token function">PGROUNDDOWN</span><span class="token punctuation">(</span>pa<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="48"></td><td><pre> <span class="token comment">/* panic("cow"); */</span></pre></td></tr><tr><td data-num="49"></td><td><pre> <span class="token keyword">return</span> <span class="token punctuation">(</span>uint64<span class="token punctuation">)</span>mem<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="50"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="51"></td><td><pre><span class="token keyword">int</span> <span class="token function">cow_check</span><span class="token punctuation">(</span><span class="token class-name">pagetable_t</span> pagetable<span class="token punctuation">,</span> uint64 va<span class="token punctuation">)</span><span class="token punctuation">&#123;</span> <span class="token comment">// 检查是否为有效 cow</span></pre></td></tr><tr><td data-num="52"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>va <span class="token operator">></span> MAXVA<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="53"></td><td><pre> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="54"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="55"></td><td><pre> <span class="token class-name">pte_t</span> <span class="token operator">*</span>pte <span class="token operator">=</span> <span class="token function">walk</span><span class="token punctuation">(</span>pagetable<span class="token punctuation">,</span> va<span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="56"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>pte <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="57"></td><td><pre> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="58"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="59"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token punctuation">(</span> <span class="token punctuation">(</span><span class="token operator">*</span>pte<span class="token punctuation">)</span> <span class="token operator">&amp;</span> PTE_V<span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="60"></td><td><pre> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="61"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="62"></td><td><pre> <span class="token keyword">return</span> <span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token operator">*</span>pte<span class="token punctuation">)</span> <span class="token operator">&amp;</span> <span class="token punctuation">(</span>PTE_COW<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="63"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr></table></figure><h3 id="vmc-代码"><a class="anchor" href="#vmc-代码">#</a> vm.c 代码</h3>
<figure class="highlight c"><figcaption data-lang="c"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">int</span></pre></td></tr><tr><td data-num="2"></td><td><pre><span class="token function">uvmcopy</span><span class="token punctuation">(</span><span class="token class-name">pagetable_t</span> old<span class="token punctuation">,</span> <span class="token class-name">pagetable_t</span> new<span class="token punctuation">,</span> uint64 sz<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="3"></td><td><pre><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="4"></td><td><pre> <span class="token class-name">pte_t</span> <span class="token operator">*</span>pte<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="5"></td><td><pre> uint64 pa<span class="token punctuation">,</span> i<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="6"></td><td><pre> uint flags<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="7"></td><td><pre> <span class="token comment">/* char *mem; */</span></pre></td></tr><tr><td data-num="8"></td><td><pre></pre></td></tr><tr><td data-num="9"></td><td><pre> <span class="token keyword">for</span><span class="token punctuation">(</span>i <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> i <span class="token operator">&lt;</span> sz<span class="token punctuation">;</span> i <span class="token operator">+=</span> PGSIZE<span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="10"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token punctuation">(</span>pte <span class="token operator">=</span> <span class="token function">walk</span><span class="token punctuation">(</span>old<span class="token punctuation">,</span> i<span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="11"></td><td><pre> <span class="token function">panic</span><span class="token punctuation">(</span><span class="token string">"uvmcopy: pte should exist"</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="12"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token operator">*</span>pte <span class="token operator">&amp;</span> PTE_V<span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="13"></td><td><pre> <span class="token function">panic</span><span class="token punctuation">(</span><span class="token string">"uvmcopy: page not present"</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="14"></td><td><pre> <span class="token operator">*</span>pte <span class="token operator">=</span> <span class="token punctuation">(</span><span class="token operator">*</span>pte<span class="token operator">|</span>PTE_COW<span class="token punctuation">)</span> <span class="token operator">&amp;</span> <span class="token operator">~</span><span class="token punctuation">(</span>PTE_W<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 只是将该页表项设置为 cow 页表项,并且设置不可写。</span></pre></td></tr><tr><td data-num="15"></td><td><pre> pa <span class="token operator">=</span> <span class="token function">PTE2PA</span><span class="token punctuation">(</span><span class="token operator">*</span>pte<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="16"></td><td><pre> <span class="token function">ref_page</span><span class="token punctuation">(</span><span class="token punctuation">(</span>uint64<span class="token punctuation">)</span>pa<span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="17"></td><td><pre> flags <span class="token operator">=</span> <span class="token function">PTE_FLAGS</span><span class="token punctuation">(</span><span class="token operator">*</span>pte<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="18"></td><td><pre> <span class="token comment">/* if((mem = kalloc()) == ) */</span></pre></td></tr><tr><td data-num="19"></td><td><pre> <span class="token comment">/* goto err; */</span></pre></td></tr><tr><td data-num="20"></td><td><pre> <span class="token comment">/* memmove(mem, (char*)pa, PGSIZE); */</span></pre></td></tr><tr><td data-num="21"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token function">mappages</span><span class="token punctuation">(</span>new<span class="token punctuation">,</span> i<span class="token punctuation">,</span> PGSIZE<span class="token punctuation">,</span> <span class="token punctuation">(</span>uint64<span class="token punctuation">)</span>pa<span class="token punctuation">,</span> flags<span class="token punctuation">)</span> <span class="token operator">!=</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="22"></td><td><pre> <span class="token comment">/* kfree(mem); */</span></pre></td></tr><tr><td data-num="23"></td><td><pre> <span class="token keyword">goto</span> err<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="24"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="25"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="26"></td><td><pre> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="27"></td><td><pre></pre></td></tr><tr><td data-num="28"></td><td><pre> err<span class="token operator">:</span></pre></td></tr><tr><td data-num="29"></td><td><pre> <span class="token function">uvmunmap</span><span class="token punctuation">(</span>new<span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">,</span> i <span class="token operator">/</span> PGSIZE<span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="30"></td><td><pre> <span class="token keyword">return</span> <span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="31"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="32"></td><td><pre></pre></td></tr><tr><td data-num="33"></td><td><pre><span class="token keyword">int</span></pre></td></tr><tr><td data-num="34"></td><td><pre><span class="token function">copyout</span><span class="token punctuation">(</span><span class="token class-name">pagetable_t</span> pagetable<span class="token punctuation">,</span> uint64 dstva<span class="token punctuation">,</span> <span class="token keyword">char</span> <span class="token operator">*</span>src<span class="token punctuation">,</span> uint64 len<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="35"></td><td><pre><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="36"></td><td><pre> uint64 n<span class="token punctuation">,</span> va0<span class="token punctuation">,</span> pa0<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="37"></td><td><pre></pre></td></tr><tr><td data-num="38"></td><td><pre> <span class="token keyword">while</span><span class="token punctuation">(</span>len <span class="token operator">></span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="39"></td><td><pre> va0 <span class="token operator">=</span> <span class="token function">PGROUNDDOWN</span><span class="token punctuation">(</span>dstva<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="40"></td><td><pre> pa0 <span class="token operator">=</span> <span class="token function">walkaddr</span><span class="token punctuation">(</span>pagetable<span class="token punctuation">,</span> va0<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="41"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token function">cow_check</span><span class="token punctuation">(</span>pagetable<span class="token punctuation">,</span> va0<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">&#123;</span> <span class="token comment">// </span></pre></td></tr><tr><td data-num="42"></td><td><pre> <span class="token comment">/* panic("copy"); */</span></pre></td></tr><tr><td data-num="43"></td><td><pre> pa0 <span class="token operator">=</span> <span class="token function">cow</span><span class="token punctuation">(</span>va0<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="44"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="45"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span>pa0 <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="46"></td><td><pre> <span class="token keyword">return</span> <span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="47"></td><td><pre> n <span class="token operator">=</span> PGSIZE <span class="token operator">-</span> <span class="token punctuation">(</span>dstva <span class="token operator">-</span> va0<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="48"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span>n <span class="token operator">></span> len<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="49"></td><td><pre> n <span class="token operator">=</span> len<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="50"></td><td><pre> <span class="token function">memmove</span><span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token keyword">void</span> <span class="token operator">*</span><span class="token punctuation">)</span><span class="token punctuation">(</span>pa0 <span class="token operator">+</span> <span class="token punctuation">(</span>dstva <span class="token operator">-</span> va0<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">,</span> src<span class="token punctuation">,</span> n<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="51"></td><td><pre></pre></td></tr><tr><td data-num="52"></td><td><pre> len <span class="token operator">-=</span> n<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="53"></td><td><pre> src <span class="token operator">+=</span> n<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="54"></td><td><pre> dstva <span class="token operator">=</span> va0 <span class="token operator">+</span> PGSIZE<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="55"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="56"></td><td><pre> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="57"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr></table></figure><p>这里 copyout 需要修改,但用户的虚地址指向的是 cow 地址,说明这个地址不能写,那么为什么这里不能让 trap 来处理呢?因为我们只是修改了 usertrap 函数来处理用户态的缺页中断,而这里 copy_out 是发生在内核态的,而 kerneltrap 并没有处理缺页中断的函数,因此需要在 <code>copyout</code> 这里手动进行修改。</p>
<p>至此,整个实验已经全部结束了,难度真的高。。。</p>
<h3 id="结果"><a class="anchor" href="#结果">#</a> 结果</h3>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220428163903522.png" alt="image-20220428163903522" /></p>
<h3 id="参考资料"><a class="anchor" href="#参考资料">#</a> 参考资料</h3>
<p>MIT 6.s081 xv6-lab6-cow - 大尾巴羊的文章 - 知乎 <span class="exturl" data-url="aHR0cHM6Ly96aHVhbmxhbi56aGlodS5jb20vcC80Mjk4MjE5NDA=">https://zhuanlan.zhihu.com/p/429821940</span></p>
]]></description>
</item>
<item>
<guid isPermalink="true">https://songlinlife.top/2022/%E6%95%B0%E6%8D%AE%E5%BA%93/ANNS-EFANN%E8%AE%BA%E6%96%87%E9%98%85%E8%AF%BB/</guid>
<title>ANNS: EFANNA论文阅读</title>
<link>https://songlinlife.top/2022/%E6%95%B0%E6%8D%AE%E5%BA%93/ANNS-EFANN%E8%AE%BA%E6%96%87%E9%98%85%E8%AF%BB/</link>
<category term="数据库" scheme="https://songlinlife.top/categories/%E6%95%B0%E6%8D%AE%E5%BA%93/" />
<category term="ANNS" scheme="https://songlinlife.top/tags/ANNS/" />
<pubDate>Wed, 27 Apr 2022 17:36:22 +0800</pubDate>
<description><![CDATA[ <p><code>EFANNA</code> 这篇文章我本来不是很想读,一来太老,二来性能也不是特别出色。但是看到 <code>nsg</code> 代码库使用 <code>efanna</code> 来初始化得到 KNN graph,所以还是捡起来看了看。</p>
<h3 id="作者的动机"><a class="anchor" href="#作者的动机">#</a> 作者的动机</h3>
<p><code>EFANNA</code> 最大的创新点在于它同时使用了 KD-Tree 和 Graph 两种数据结构,KD-Tree 作为辅助数据结构可以更好地帮助 KNN graph 的构建和后续的查询。作者提出 EFANNA 的动机就是希望能同时使用到 tree 和 graph 两种数据结构的优点。这种辅助数据结构值得思考。</p>
<h3 id="算法拆解"><a class="anchor" href="#算法拆解">#</a> 算法拆解</h3>
<h4 id="树构建算法"><a class="anchor" href="#树构建算法">#</a> 树构建算法</h4>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220427201215094.png" alt="image-20220427201215094" /></p>
<p>和传统的 KD-tree 最大的不同就是它的叶子结点中包含多个 point。</p>
<h4 id="initial-graph"><a class="anchor" href="#initial-graph">#</a> initial graph</h4>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220427201342115.png" alt="image-20220427201342115" /></p>
<p>KGraph 中使用的是 random graph 作为初始图,然后使用 NN-descent 来进行图优化,作者的创新点在于它使用 KD-graph 来构建初始图。这里的算法挺抽象的,但是也很有意思。关键理解的点在于逐层 merge 思想。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220427201736932.png" alt="image-20220427201736932" /></p>
<p>来看这张图,对于查询点 q,这个 tree 最终导向节点 8。但是仅仅有节点 8 是不够的,从上面的图中可以看到节点 9 和节点 10 也离查询点比较近,于是我们需要 merge。merge 的思路就是从下往上。对于 level 2 来说,4 ➡️9 这个方向没有查,于是查 4-&gt;9,得到 9 节点,把节点 9 加入 candidate。对于 level 1 来说,2-&gt;5 这个方向没有查,于是用 q 查询,得到 10 节点,把节点 10 加入 candidate。对于 level 0 来说,1-&gt;3 这个方向没有查,查询得到节点 12,把节点 12 加入 candidate。这就是逐层 merge 的思想,当然实际中我们只会 merge 到一定层不会 merge 到 root。</p>
<h4 id="graph-refine"><a class="anchor" href="#graph-refine">#</a> graph refine</h4>
<p>这里它的算法我压根看不懂,然而作者来说他重写了 NN-Descent 为了更好地理解。总而言之,这里的 refine 算法就是 NN-descent。</p>
<h3 id="查询算法"><a class="anchor" href="#查询算法">#</a> 查询算法</h3>
<p>特点在于使用 KD-tree 来初始化 candidate:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220427210749507.png" alt="image-20220427210749507" /></p>
]]></description>
</item>
<item>
<guid isPermalink="true">https://songlinlife.top/2022/linux/MIT6-s081-PageFault/</guid>
<title>MIT6.s081: lab5 PageFault</title>
<link>https://songlinlife.top/2022/linux/MIT6-s081-PageFault/</link>
<category term="linux" scheme="https://songlinlife.top/categories/linux/" />
<pubDate>Wed, 27 Apr 2022 09:46:14 +0800</pubDate>
<description><![CDATA[ <p>先说结论过了所有的测试点,我是能够过那个 <code>usertests</code> ,但是 <code>lazytests</code> 会一直卡在 <code>out of mem</code> 这个测试点,而且怎么也没法过,折腾了好久真的无语。</p>
<p>这个原因是因为 <code>sbrk</code> 是懒加载,我们直接用 <code>addr+n</code> 表示 proc 的 size。这样有个问题,xv6 最大的虚内存实际上是一个 <code>MAXVA</code> ,如果不加限制的话,walk 阶段就会 walk 到 MAXVA 导致一直 panic。解决办法其实很简单,在 uvmunmap 代码加个限定条件,强制 va 不能超过 MAXVA。</p>
<p>懒加载代码:</p>
<figure class="highlight c"><figcaption data-lang="c"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">int</span> <span class="token function">lazy_alloc</span><span class="token punctuation">(</span>uint64 addr<span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="2"></td><td><pre> uint64 va <span class="token operator">=</span> <span class="token function">PGROUNDDOWN</span><span class="token punctuation">(</span>addr<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="3"></td><td><pre> <span class="token keyword">struct</span> <span class="token class-name">proc</span><span class="token operator">*</span> p <span class="token operator">=</span> <span class="token function">myproc</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="4"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span>addr <span class="token operator">>=</span> p<span class="token operator">-></span>sz <span class="token operator">||</span> addr <span class="token operator">&lt;</span> p<span class="token operator">-></span>trapframe<span class="token operator">-></span>sp <span class="token punctuation">)</span><span class="token punctuation">&#123;</span> <span class="token comment">//addr 如果大于 size 或者越栈了就报错</span></pre></td></tr><tr><td data-num="5"></td><td><pre> <span class="token keyword">return</span> <span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="6"></td><td><pre> <span class="token punctuation">&#125;</span><span class="token keyword">else</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="7"></td><td><pre> <span class="token keyword">char</span><span class="token operator">*</span> mem <span class="token operator">=</span> <span class="token function">kalloc</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="8"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span>mem <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="9"></td><td><pre> <span class="token keyword">return</span> <span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="10"></td><td><pre> <span class="token punctuation">&#125;</span><span class="token keyword">else</span><span class="token punctuation">&#123;</span> </pre></td></tr><tr><td data-num="11"></td><td><pre> <span class="token function">memset</span><span class="token punctuation">(</span>mem<span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">,</span> PGSIZE<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="12"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token function">mappages</span><span class="token punctuation">(</span>p<span class="token operator">-></span>pagetable<span class="token punctuation">,</span>va <span class="token punctuation">,</span>PGSIZE <span class="token punctuation">,</span> <span class="token punctuation">(</span>uint64<span class="token punctuation">)</span>mem<span class="token punctuation">,</span> PTE_W<span class="token operator">|</span>PTE_X<span class="token operator">|</span>PTE_R<span class="token operator">|</span>PTE_U<span class="token punctuation">)</span><span class="token operator">!=</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="13"></td><td><pre> <span class="token function">kfree</span><span class="token punctuation">(</span>mem<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="14"></td><td><pre> <span class="token function">uvmunmap</span><span class="token punctuation">(</span>p<span class="token operator">-></span>pagetable<span class="token punctuation">,</span>va<span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="15"></td><td><pre> <span class="token keyword">return</span> <span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="16"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="17"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="18"></td><td><pre> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="19"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="20"></td><td><pre> <span class="token keyword">return</span> <span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="21"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr></table></figure><p>usertrap 修改:</p>
<figure class="highlight c"><figcaption data-lang="c"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">else</span> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token function">r_scause</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">15</span> <span class="token operator">||</span> <span class="token function">r_scause</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">13</span><span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="2"></td><td><pre> uint64 addr <span class="token operator">=</span> <span class="token function">r_stval</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="3"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token function">lazy_alloc</span><span class="token punctuation">(</span>addr<span class="token punctuation">)</span> <span class="token operator">!=</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="4"></td><td><pre> <span class="token function">printf</span><span class="token punctuation">(</span><span class="token string">"usertrap(): unexpected scause %p pid=%d\n"</span><span class="token punctuation">,</span> <span class="token function">r_scause</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> p<span class="token operator">-></span>pid<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="5"></td><td><pre> <span class="token function">printf</span><span class="token punctuation">(</span><span class="token string">" sepc=%p stval=%p\n"</span><span class="token punctuation">,</span> <span class="token function">r_sepc</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">r_stval</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="6"></td><td><pre> p<span class="token operator">-></span>killed <span class="token operator">=</span> <span class="token number">1</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="7"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr></table></figure><p>修改 uvmunmap 代码:</p>
<figure class="highlight c"><figcaption data-lang="c"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">void</span></pre></td></tr><tr><td data-num="2"></td><td><pre><span class="token function">uvmunmap</span><span class="token punctuation">(</span><span class="token class-name">pagetable_t</span> pagetable<span class="token punctuation">,</span> uint64 va<span class="token punctuation">,</span> uint64 npages<span class="token punctuation">,</span> <span class="token keyword">int</span> do_free<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="3"></td><td><pre><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="4"></td><td><pre> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span></pre></td></tr><tr><td data-num="5"></td><td><pre> <span class="token keyword">for</span><span class="token punctuation">(</span>a <span class="token operator">=</span> va<span class="token punctuation">;</span> a <span class="token operator">&lt;</span> va <span class="token operator">+</span> npages<span class="token operator">*</span>PGSIZE<span class="token punctuation">;</span> a <span class="token operator">+=</span> PGSIZE<span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="6"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span>a <span class="token operator">>=</span>MAXVA<span class="token punctuation">)</span><span class="token keyword">break</span><span class="token punctuation">;</span> <span class="token comment">// 一定要加这个条件,a 不可能超过最大虚内存</span></pre></td></tr><tr><td data-num="7"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token punctuation">(</span>pte <span class="token operator">=</span> <span class="token function">walk</span><span class="token punctuation">(</span>pagetable<span class="token punctuation">,</span> a<span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="8"></td><td><pre> <span class="token comment">/* panic("uvmunmap: walk"); */</span></pre></td></tr><tr><td data-num="9"></td><td><pre> <span class="token keyword">continue</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="10"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token operator">*</span>pte <span class="token operator">&amp;</span> PTE_V<span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="11"></td><td><pre> <span class="token comment">/* panic("uvmunmap: not mapped"); */</span></pre></td></tr><tr><td data-num="12"></td><td><pre> <span class="token keyword">continue</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="13"></td><td><pre> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span></pre></td></tr><tr><td data-num="14"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="15"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr></table></figure><p>walkaddr 修改,使得传入合法地址时候进行懒加载,讲道理这一步其实可以不用写的,我试了一下没有写这个 walkaddr 修改,照样也能跑通所有测试。证明了我的猜想,因为缺页中断发生在 load 等防存指令,如果地址不能被 walkaddr 说明发生了缺页,这时候硬件就会陷入 trap,然后进行处理。所以我们不在 walkaddr 中处理缺页也是没问题的,直接让 trap 来进行处理。但是 trap 是很贵的,每次我们要执行几百条甚至上千条指令,所以一个有效的解决办法就是 walkaddress 的时候也执行 lazy_alloc。而且有的函数比如 read、write 如果可能它没有在 walkaddress 中得到正确的 pa 可能也不会执行 trap 而是直接返回?因为不确定是否真的有这个页。。。所以还是在 walkaddress 这里吧 lazy_alloc 写一下:</p>
<figure class="highlight c"><figcaption data-lang="c"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">if</span><span class="token punctuation">(</span>pte <span class="token operator">==</span> <span class="token number">0</span> <span class="token operator">||</span> <span class="token punctuation">(</span><span class="token operator">*</span>pte <span class="token operator">&amp;</span> PTE_V<span class="token punctuation">)</span> <span class="token operator">==</span><span class="token number">0</span> <span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="2"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token function">lazy_alloc</span><span class="token punctuation">(</span>va<span class="token punctuation">)</span> <span class="token operator">!=</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="3"></td><td><pre> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="4"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="5"></td><td><pre> pte <span class="token operator">=</span> <span class="token function">walk</span><span class="token punctuation">(</span>pagetable<span class="token punctuation">,</span> va<span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="6"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr></table></figure><p>uvmcopy 修改,使得满足 fork:</p>
<figure class="highlight c"><figcaption data-lang="c"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">if</span><span class="token punctuation">(</span><span class="token punctuation">(</span>pte <span class="token operator">=</span> <span class="token function">walk</span><span class="token punctuation">(</span>old<span class="token punctuation">,</span> i<span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="2"></td><td><pre> <span class="token comment">/* panic("uvmcopy: pte should exist"); */</span></pre></td></tr><tr><td data-num="3"></td><td><pre> <span class="token keyword">continue</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="4"></td><td><pre><span class="token keyword">if</span><span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token operator">*</span>pte <span class="token operator">&amp;</span> PTE_V<span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="5"></td><td><pre> <span class="token comment">/* panic("uvmcopy: page not present"); */</span></pre></td></tr><tr><td data-num="6"></td><td><pre> <span class="token keyword">continue</span><span class="token punctuation">;</span></pre></td></tr></table></figure><p>最后就是 sbrk 代码了:</p>
<figure class="highlight c"><figcaption data-lang="c"></figcaption><table><tr><td data-num="1"></td><td><pre>uint64</pre></td></tr><tr><td data-num="2"></td><td><pre><span class="token function">sys_sbrk</span><span class="token punctuation">(</span><span class="token keyword">void</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="3"></td><td><pre><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="4"></td><td><pre> <span class="token keyword">int</span> addr<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="5"></td><td><pre> <span class="token keyword">int</span> n<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="6"></td><td><pre></pre></td></tr><tr><td data-num="7"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token function">argint</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token operator">&amp;</span>n<span class="token punctuation">)</span> <span class="token operator">&lt;</span> <span class="token number">0</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="8"></td><td><pre> <span class="token keyword">return</span> <span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="9"></td><td><pre> addr <span class="token operator">=</span> <span class="token function">myproc</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token operator">-></span>sz<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="10"></td><td><pre> <span class="token keyword">if</span><span class="token punctuation">(</span>n <span class="token operator">></span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="11"></td><td><pre> <span class="token function">myproc</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token operator">-></span>sz <span class="token operator">=</span> addr <span class="token operator">+</span> n<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="12"></td><td><pre> <span class="token punctuation">&#125;</span><span class="token keyword">else</span><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="13"></td><td><pre> <span class="token function">uvmdealloc</span><span class="token punctuation">(</span><span class="token function">myproc</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token operator">-></span>pagetable<span class="token punctuation">,</span> addr<span class="token punctuation">,</span> addr <span class="token operator">+</span>n<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="14"></td><td><pre> <span class="token function">myproc</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token operator">-></span>sz <span class="token operator">=</span> addr <span class="token operator">+</span> n<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="15"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="16"></td><td><pre> </pre></td></tr><tr><td data-num="17"></td><td><pre> <span class="token comment">/* if(growproc(n) &lt; 0) */</span></pre></td></tr><tr><td data-num="18"></td><td><pre> <span class="token comment">/* return -1; */</span></pre></td></tr><tr><td data-num="19"></td><td><pre> <span class="token keyword">return</span> addr<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="20"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr></table></figure><p>结果:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220427163441467.png" alt="image-20220427163441467" /></p>
<p>这个 lab 虽然只是中等难度但是我写的时候很挣扎。。。。</p>
]]></description>
</item>
<item>
<guid isPermalink="true">https://songlinlife.top/2022/linux/MIT6-s081-lab4Trap/</guid>
<title>MIT6.s081: lab4Trap</title>
<link>https://songlinlife.top/2022/linux/MIT6-s081-lab4Trap/</link>
<category term="linux" scheme="https://songlinlife.top/categories/linux/" />
<category term="MIT" scheme="https://songlinlife.top/tags/MIT/" />
<pubDate>Mon, 25 Apr 2022 20:42:38 +0800</pubDate>
<description><![CDATA[ <h3 id="risc-v-assembly"><a class="anchor" href="#risc-v-assembly">#</a> RISC-V assembly</h3>
<p>首先是 call.c:</p>
<figure class="highlight c"><figcaption data-lang="c"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"kernel/param.h"</span></span></pre></td></tr><tr><td data-num="2"></td><td><pre><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"kernel/types.h"</span></span></pre></td></tr><tr><td data-num="3"></td><td><pre><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"kernel/stat.h"</span></span></pre></td></tr><tr><td data-num="4"></td><td><pre><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"user/user.h"</span></span></pre></td></tr><tr><td data-num="5"></td><td><pre></pre></td></tr><tr><td data-num="6"></td><td><pre><span class="token keyword">int</span> <span class="token function">g</span><span class="token punctuation">(</span><span class="token keyword">int</span> x<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="7"></td><td><pre> <span class="token keyword">return</span> x<span class="token operator">+</span><span class="token number">3</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="8"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="9"></td><td><pre></pre></td></tr><tr><td data-num="10"></td><td><pre><span class="token keyword">int</span> <span class="token function">f</span><span class="token punctuation">(</span><span class="token keyword">int</span> x<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="11"></td><td><pre> <span class="token keyword">return</span> <span class="token function">g</span><span class="token punctuation">(</span>x<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="12"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="13"></td><td><pre></pre></td></tr><tr><td data-num="14"></td><td><pre><span class="token keyword">void</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token keyword">void</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="15"></td><td><pre> <span class="token function">printf</span><span class="token punctuation">(</span><span class="token string">"%d %d\n"</span><span class="token punctuation">,</span> <span class="token function">f</span><span class="token punctuation">(</span><span class="token number">8</span><span class="token punctuation">)</span><span class="token operator">+</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token number">13</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="16"></td><td><pre> <span class="token function">exit</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="17"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr></table></figure><p>然后是 call 的汇编:</p>
<pre><code class="language-assembly">int g(int x) &#123;
0: 1141 addi sp,sp,-16
2: e422 sd s0,8(sp)
4: 0800 addi s0,sp,16
return x+3;
&#125;
6: 250d addiw a0,a0,3
8: 6422 ld s0,8(sp)
a: 0141 addi sp,sp,16
c: 8082 ret
000000000000000e &lt;f&gt;:
int f(int x) &#123;
e: 1141 addi sp,sp,-16
10: e422 sd s0,8(sp)
12: 0800 addi s0,sp,16
return g(x);
&#125;
14: 250d addiw a0,a0,3
16: 6422 ld s0,8(sp)
18: 0141 addi sp,sp,16
1a: 8082 ret
000000000000001c &lt;main&gt;:
void main(void) &#123;
1c: 1141 addi sp,sp,-16
1e: e406 sd ra,8(sp)
20: e022 sd s0,0(sp)
22: 0800 addi s0,sp,16
printf(&quot;%d %d\n&quot;, f(8)+1, 13);
24: 4635 li a2,13
26: 45b1 li a1,12
28: 00000517 auipc a0,0x0
2c: 7a050513 addi a0,a0,1952 # 7c8 &lt;malloc+0xe8&gt;
30: 00000097 auipc ra,0x0
34: 5f8080e7 jalr 1528(ra) # 628 &lt;printf&gt;
exit(0);
38: 4501 li a0,0
3a: 00000097 auipc ra,0x0
3e: 274080e7 jalr 628(ra) # 2ae &lt;exit&gt;
</code></pre>
<p>我主要说一下各个指令的作用 auipc 将当前 pc 值 load 到特定的寄存器中。比如</p>
<pre><code class="language-assembly">30: 00000097 auipc ra,0x0
34: 5f8080e7 jalr 1528(ra) # 628 &lt;printf&gt;
</code></pre>
<p>把当前 pc 的值加上 0x0 也就是 30 放到 ra 中,因为 ra 就是 return address, <code>jalr</code> 会将 pc+4 存储给指定的寄存器,反汇编语句里省略了指定寄存器,是因为默认给 ra,所以 ra= <code>0x38</code> 。</p>
<p><code>1528(ra)</code> 表示 ra 中的值加上 1528 生成一个地址,然后去这个地址寻找到数据作为指令进行解释并执行也就是跳转操作。printf 执行完毕后就会执行 <code>ret</code> 指令,ret 指令就去 <code>ra</code> 中把 <code>return address</code> 找到并且并且读取开始执行。</p>
<p>有意思的是这个:</p>
<figure class="highlight c"><figcaption data-lang="c"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">unsigned</span> <span class="token keyword">int</span> i <span class="token operator">=</span> <span class="token number">0x00646c72</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="2"></td><td><pre><span class="token function">printf</span><span class="token punctuation">(</span><span class="token string">"H%x Wo%s"</span><span class="token punctuation">,</span> <span class="token number">57616</span><span class="token punctuation">,</span> <span class="token punctuation">(</span><span class="token keyword">char</span><span class="token operator">*</span><span class="token punctuation">)</span><span class="token operator">&amp;</span>i<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr></table></figure><p>这个会打印 <code>He110 World</code> 。这里官方给出的代码没有进行指针强转,会没法编译。riscv 存数据的方式就是小端存储也就是正常的人类顺序。 <code>64</code> 对应 <code>o</code> , <code>89</code> 对应。。。跑题了,反正就是一一进行解释,最后显示 <code>Hello world</code> 。</p>
<p>In the following code, what is going to be printed after <code>'y='</code> ? (note: the answer is not a specific value.) Why does this happen?</p>
<figure class="highlight c"><figcaption data-lang="c"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token function">printf</span><span class="token punctuation">(</span><span class="token string">"x=%d y=%d"</span><span class="token punctuation">,</span> <span class="token number">3</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr></table></figure><p>因为 printf 需要三个参数,所以给 y 复制的寄存器就是 <code>a2</code> 。</p>
<h3 id="gdb常用指令"><a class="anchor" href="#gdb常用指令">#</a> gdb 常用指令</h3>
<h4 id="开启调试"><a class="anchor" href="#开启调试">#</a> 开启调试</h4>
<pre><code class="language-sh">make qemu-gdb CPU=1 # 限制cpu个数
risriscv64-unknown-elf-gdb # 使用riscv gdb
</code></pre>
<h4 id="gdb命令"><a class="anchor" href="#gdb命令">#</a> gdb 命令</h4>
<p><code>Ctrl-c</code></p>
<p>Halt the machine and break in to GDB at the current instruction. If QEMU has multiple virtual CPUs, this halts all of them.</p>
<p><code>c (or continue)</code></p>
<p>Continue execution until the next breakpoint or <code>Ctrl-c</code> .</p>
<p><code>si (or stepi)</code></p>
<p>Execute one machine instruction.</p>
<p><code>b function or b file:line (or breakpoint)</code></p>
<p>Set a breakpoint at the given function or line.</p>
<p><code>b **addr* (or breakpoint)</code></p>
<p>Set a breakpoint at the EIP <em>addr</em>.</p>
<p><code>set print pretty</code></p>
<p>Enable pretty-printing of arrays and structs.</p>
<p><code>info registers</code></p>
<p>Print the general purpose registers, <code>eip</code> , <code>eflags</code> , and the segment selectors. For a much more thorough dump of the machine register state, see QEMU's own <code>info registers</code> command.</p>
<p><code>x/*N*x *addr*</code></p>
<p>Display a hex dump of <em>N</em> words starting at virtual address <em>addr</em>. If <em>N</em> is omitted, it defaults to 1. <em>addr</em> can be any expression.</p>
<p><code>x/*N*i *addr*</code></p>
<p>Display the <em>N</em> assembly instructions starting at <em>addr</em>. Using <code>$eip</code> as <em>addr</em> will display the instructions at the current instruction pointer.</p>
<p><code>symbol-file *file*</code></p>
<p>(Lab 3+) Switch to symbol file <em>file</em>. When GDB attaches to QEMU, it has no notion of the process boundaries within the virtual machine, so we have to tell it which symbols to use. By default, we configure GDB to use the kernel symbol file, <code>obj/kern/kernel</code> . If the machine is running user code, say <code>hello.c</code> , you can switch to the hello symbol file using <code>symbol-file obj/user/hello</code> .</p>
<p>关于 <code>x</code> 模式,这里有个简单列表:</p>
<pre><code>n:是正整数,表示需要显示的内存单元的个数,即从当前地址向后显示n个内存单元的内容,
一个内存单元的大小由第三个参数u定义。
f:表示addr指向的内存内容的输出格式,s对应输出字符串,此处需特别注意输出整型数据的格式:
x 按十六进制格式显示变量.
d 按十进制格式显示变量。
u 按十进制格式显示无符号整型。
o 按八进制格式显示变量。
t 按二进制格式显示变量。
a 按十六进制格式显示变量。
c 按字符格式显示变量。
f 按浮点数格式显示变量。
i 按照指令方式进行打印。
u:就是指以多少个字节作为一个内存单元-unit,默认为4。u还可以用被一些字符表示:
如b=1 byte, h=2 bytes,w=4 bytes,g=8 bytes.
&lt;addr&gt;:表示内存地址。
</code></pre>
<h4 id="qemu命令"><a class="anchor" href="#qemu命令">#</a> qemu 命令</h4>
<p><code>Ctrl+a x</code> 退出</p>
<p><code>Ctrl+a c</code> 进入 consle 模式,可以用 <code>info mem</code> 进行页表打印。</p>
<h3 id="trap"><a class="anchor" href="#trap">#</a> Trap</h3>
<h4 id="trap代码执行流程"><a class="anchor" href="#trap代码执行流程">#</a> Trap 代码执行流程</h4>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220426095322352.png" alt="image-20220426095322352" /></p>
<h4 id="trap进入"><a class="anchor" href="#trap进入">#</a> trap 进入</h4>
<p>让我们再来回顾这张图</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220426095540263.png" alt="image-20220426095540263" /></p>
<p>我们的目的就是通过 ecall 跳转到内核态,而核心的就是 trampoline 和 trapframe 两个页。trampoline page 中存放了处理在用户态处理 trap 的代码,并且这个 map 是系统为所有进程完成的,包括内核页表。</p>
<p>打开 qemu 之后,可以用 <code>info mem</code> 查看页表:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220426100039263.png" alt="image-20220426100039263" /></p>
<p>可能这里把三级页表拆开了,反正可以看到 vaddr 中最下面两个最大的地址就是 trampoline page 和 trapframe page。</p>
<h5 id="ecall做的事情"><a class="anchor" href="#ecall做的事情">#</a> ecall 做的事情</h5>
<p>第一,ecall 将代码从 user mode 改到 supervisor mode。因为无论是 trampoline 还是 trapframe 都不能在用户态访问,因为标志位没有 <code>u</code> 。</p>
<p>第二,ecall 将程序计数器的值保存在了 SEPC 寄存器中。因为这是 trap 返回地址。</p>
<p>第三,ecall 把 stvec 寄存器中地址 load 到 pc 中。(stvec 就是中断向量地址,也就是 trampoline page 地址,因为 trampoline 就是用于存放处理 trap 的指令)</p>
<h5 id="保存寄存器状态"><a class="anchor" href="#保存寄存器状态">#</a> 保存寄存器状态</h5>
<p>使用 <code>csrrw a0 sscratch</code> 命令把 <code>a0</code> 和 <code>sscratch</code> 两个寄存器的值进行交换。sscratch 中保存的实际上就是 trapframe page 的地址,然后我们使用 save 指令把寄存器的值进行保存(这里截了一部分图,实际上上面还有很多保存寄存器的指令):</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220426101940437.png" alt="image-20220426101940437" /></p>
<h5 id="加载处理内核数据"><a class="anchor" href="#加载处理内核数据">#</a> 加载处理内核数据</h5>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220426103708995.png" alt="image-20220426103708995" /></p>
<p>这四条 load 指令,分别 load 了内核栈的栈顶指针、当前运行的 cpuid、处理终端的 usertrap () 函数的地址、kernel pagetable 的 id。</p>
<p>执行上面指令后,再次调用 <code>info mem</code> 可以看到:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220426103942318.png" alt="image-20220426103942318" /></p>
<p>我们成功进入了内核!</p>
<h4 id="处理系统调用"><a class="anchor" href="#处理系统调用">#</a> 处理系统调用</h4>
<p><code>trap.c</code> 会保存当前的 <code>sepc</code> ,检查状态判断是否 <code>scause</code> 也就是 trap 原因。如果 <code>scause</code> 为 8,那就执行 <code>syscall()</code> ,syscall 会调用根据 <code>a7</code> 来判断导致执行那种系统调用,并把执行结果放在 <code>a0</code> 。</p>
<h4 id="trap返回"><a class="anchor" href="#trap返回">#</a> trap 返回</h4>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220426110357638.png" alt="image-20220426110357638" /></p>
<p>内核发现这个进程并没有被杀死,于是它执行 trap 返回,也就是 usertrapret ()。</p>
<p>进入这个函数内:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220426110914010.png" alt="image-20220426110914010" /></p>
<p>这部分其实做了很多事情,但是都是一些镜像的事情,也就是 trap 进入需要什么,这里就保存什么。</p>
<p>我们一路快进:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220426111215836.png" alt="image-20220426111215836" /></p>
<p>boom!又回到了 trampoline!</p>
<p>最后把之前保存的数据重新 load 进去,ok,整个系统调用完成!</p>
<h3 id="参考资料"><a class="anchor" href="#参考资料">#</a> 参考资料:</h3>
<p><span class="exturl" data-url="aHR0cDovL3h5Zmphc29uLnRvcC8yMDIxLzExLzMwL3h2Ni1taXQtNi1TMDgxLTIwMjAtTGFiNC10cmFwcy8=">http://xyfjason.top/2021/11/30/xv6-mit-6-S081-2020-Lab4-traps/</span></p>
<p><span class="exturl" data-url="aHR0cHM6Ly93d3cueXNibG9nLmNjL2FyY2hpdmVzL21pdDZzMDgxbGFiNA==">https://www.ysblog.cc/archives/mit6s081lab4</span></p>
<p><span class="exturl" data-url="aHR0cHM6Ly9wZG9zLmNzYWlsLm1pdC5lZHUvNi44MjgvMjAxMi9sYWJndWlkZS5odG1s">lab tools</span></p>
]]></description>
</item>
<item>
<guid isPermalink="true">https://songlinlife.top/2022/%E6%95%B0%E6%8D%AE%E5%BA%93/DiskANN%EF%BC%88%E4%B8%80%EF%BC%89%EF%BC%9A%E4%BB%A3%E7%A0%81%E9%98%85%E8%AF%BB%E5%BC%80%E5%A7%8B/</guid>
<title>DiskANN(一):磁盘索引构建</title>
<link>https://songlinlife.top/2022/%E6%95%B0%E6%8D%AE%E5%BA%93/DiskANN%EF%BC%88%E4%B8%80%EF%BC%89%EF%BC%9A%E4%BB%A3%E7%A0%81%E9%98%85%E8%AF%BB%E5%BC%80%E5%A7%8B/</link>
<category term="数据库" scheme="https://songlinlife.top/categories/%E6%95%B0%E6%8D%AE%E5%BA%93/" />
<category term="ANNS" scheme="https://songlinlife.top/tags/ANNS/" />
<pubDate>Mon, 25 Apr 2022 17:09:26 +0800</pubDate>
<description><![CDATA[ <h3 id="编译"><a class="anchor" href="#编译">#</a> 编译</h3>
<figure class="highlight cmake"><figcaption data-lang="CMake"></figcaption><table><tr><td data-num="1"></td><td><pre>cmake -DCMAKE_BUILD_TYPE=Debug -B build</pre></td></tr><tr><td data-num="2"></td><td><pre>cmake --build build -j <span class="token number">8</span></pre></td></tr></table></figure><p>出现报错 <code>[CMake is not able to find BOOST libraries](https://stackoverflow.com/questions/24173330/cmake-is-not-able-to-find-boost-libraries)</code> , 解决 <code>sudo apt-get install cmake libblkid-dev e2fslibs-dev libboost-all-dev libaudit-dev</code> 。</p>
<h3 id="基于内存方案"><a class="anchor" href="#基于内存方案">#</a> 基于内存方案</h3>
<p>构建索引的参数:</p>
<p>1data_type、2dist_fn、3data_file、4index_path_prefix、5max_degree、6size_of_build_search_list、7search_DRAM_budget、8build_DRAM_budget、9 num_thread、10PQ_disk_bytes</p>
<h4 id="index构建函数"><a class="anchor" href="#index构建函数">#</a> Index 构建函数</h4>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">template</span><span class="token operator">&lt;</span><span class="token keyword">typename</span> <span class="token class-name">T</span><span class="token punctuation">,</span> <span class="token keyword">typename</span> <span class="token class-name">TagT</span><span class="token operator">></span></pre></td></tr><tr><td data-num="2"></td><td><pre> <span class="token class-name">Index</span><span class="token operator">&lt;</span>T<span class="token punctuation">,</span> TagT<span class="token operator">></span><span class="token double-colon punctuation">::</span><span class="token function">Index</span><span class="token punctuation">(</span>Metric m<span class="token punctuation">,</span> <span class="token keyword">const</span> size_t dim<span class="token punctuation">,</span> <span class="token keyword">const</span> size_t max_points<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="3"></td><td><pre> <span class="token keyword">const</span> <span class="token keyword">bool</span> dynamic_index<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="4"></td><td><pre> <span class="token keyword">const</span> Parameters <span class="token operator">&amp;</span>indexParams<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="5"></td><td><pre> <span class="token keyword">const</span> Parameters <span class="token operator">&amp;</span>searchParams<span class="token punctuation">,</span> <span class="token keyword">const</span> <span class="token keyword">bool</span> enable_tags<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="6"></td><td><pre> <span class="token keyword">const</span> <span class="token keyword">bool</span> support_eager_delete<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="7"></td><td><pre> <span class="token operator">:</span> <span class="token function">Index</span><span class="token punctuation">(</span>m<span class="token punctuation">,</span> dim<span class="token punctuation">,</span> max_points<span class="token punctuation">,</span> dynamic_index<span class="token punctuation">,</span> enable_tags<span class="token punctuation">,</span> support_eager_delete<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span> <span class="token comment">// Thank you C++ 11!</span></pre></td></tr><tr><td data-num="8"></td><td><pre> _indexingQueueSize <span class="token operator">=</span> indexParams<span class="token punctuation">.</span><span class="token generic-function"><span class="token function">Get</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">uint32_t</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token string">"L"</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="9"></td><td><pre> _indexingRange <span class="token operator">=</span> indexParams<span class="token punctuation">.</span><span class="token generic-function"><span class="token function">Get</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">uint32_t</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token string">"R"</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="10"></td><td><pre> _indexingMaxC <span class="token operator">=</span> indexParams<span class="token punctuation">.</span><span class="token generic-function"><span class="token function">Get</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">uint32_t</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token string">"C"</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="11"></td><td><pre> _indexingAlpha <span class="token operator">=</span> indexParams<span class="token punctuation">.</span><span class="token generic-function"><span class="token function">Get</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">float</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token string">"alpha"</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="12"></td><td><pre></pre></td></tr><tr><td data-num="13"></td><td><pre> <span class="token keyword">uint32_t</span> num_threads_srch <span class="token operator">=</span> searchParams<span class="token punctuation">.</span><span class="token generic-function"><span class="token function">Get</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">uint32_t</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token string">"num_threads"</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="14"></td><td><pre> <span class="token keyword">uint32_t</span> num_threads_indx <span class="token operator">=</span> indexParams<span class="token punctuation">.</span><span class="token generic-function"><span class="token function">Get</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">uint32_t</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token string">"num_threads"</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="15"></td><td><pre> <span class="token keyword">uint32_t</span> num_threads <span class="token operator">=</span> <span class="token function">diskann_max</span><span class="token punctuation">(</span>num_threads_srch<span class="token punctuation">,</span> num_threads_indx<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="16"></td><td><pre> <span class="token keyword">uint32_t</span> search_l <span class="token operator">=</span> searchParams<span class="token punctuation">.</span><span class="token generic-function"><span class="token function">Get</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">uint32_t</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token string">"L"</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="17"></td><td><pre></pre></td></tr><tr><td data-num="18"></td><td><pre> <span class="token function">initialize_query_scratch</span><span class="token punctuation">(</span>num_threads<span class="token punctuation">,</span> search_l<span class="token punctuation">,</span> _indexingQueueSize<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="19"></td><td><pre> _indexingRange<span class="token punctuation">,</span> dim<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="20"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="21"></td><td><pre></pre></td></tr><tr><td data-num="22"></td><td><pre> <span class="token keyword">template</span><span class="token operator">&lt;</span><span class="token keyword">typename</span> <span class="token class-name">T</span><span class="token punctuation">,</span> <span class="token keyword">typename</span> <span class="token class-name">TagT</span><span class="token operator">></span></pre></td></tr><tr><td data-num="23"></td><td><pre> <span class="token class-name">Index</span><span class="token operator">&lt;</span>T<span class="token punctuation">,</span> TagT<span class="token operator">></span><span class="token double-colon punctuation">::</span><span class="token function">Index</span><span class="token punctuation">(</span>Metric m<span class="token punctuation">,</span> <span class="token keyword">const</span> size_t dim<span class="token punctuation">,</span> <span class="token keyword">const</span> size_t max_points<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="24"></td><td><pre> <span class="token keyword">const</span> <span class="token keyword">bool</span> dynamic_index<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="25"></td><td><pre> <span class="token keyword">const</span> <span class="token keyword">bool</span> enable_tags<span class="token punctuation">,</span> <span class="token keyword">const</span> <span class="token keyword">bool</span> support_eager_delete<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="26"></td><td><pre> <span class="token operator">:</span> <span class="token function">_dist_metric</span><span class="token punctuation">(</span>m<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">_dim</span><span class="token punctuation">(</span>dim<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">_max_points</span><span class="token punctuation">(</span>max_points<span class="token punctuation">)</span><span class="token punctuation">,</span></pre></td></tr><tr><td data-num="27"></td><td><pre> <span class="token function">_dynamic_index</span><span class="token punctuation">(</span>dynamic_index<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">_enable_tags</span><span class="token punctuation">(</span>enable_tags<span class="token punctuation">)</span><span class="token punctuation">,</span></pre></td></tr><tr><td data-num="28"></td><td><pre> <span class="token function">_support_eager_delete</span><span class="token punctuation">(</span>support_eager_delete<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="29"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>dynamic_index <span class="token operator">&amp;&amp;</span> <span class="token operator">!</span>enable_tags<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="30"></td><td><pre> <span class="token keyword">throw</span> diskann<span class="token double-colon punctuation">::</span><span class="token function">ANNException</span><span class="token punctuation">(</span></pre></td></tr><tr><td data-num="31"></td><td><pre> <span class="token string">"ERROR: Eager Deletes must have Dynamic Indexing enabled."</span><span class="token punctuation">,</span> <span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">,</span></pre></td></tr><tr><td data-num="32"></td><td><pre> __FUNCSIG__<span class="token punctuation">,</span> <span class="token constant">__FILE__</span><span class="token punctuation">,</span> <span class="token constant">__LINE__</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="33"></td><td><pre> diskann<span class="token double-colon punctuation">::</span>cerr</pre></td></tr><tr><td data-num="34"></td><td><pre> <span class="token operator">&lt;&lt;</span> <span class="token string">"WARNING: Dynamic Indices must have tags enabled. Auto-enabling."</span></pre></td></tr><tr><td data-num="35"></td><td><pre> <span class="token operator">&lt;&lt;</span> std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="36"></td><td><pre> _enable_tags <span class="token operator">=</span> <span class="token boolean">true</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="37"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="38"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>support_eager_delete <span class="token operator">&amp;&amp;</span> <span class="token operator">!</span>dynamic_index<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="39"></td><td><pre> diskann<span class="token double-colon punctuation">::</span>cerr <span class="token operator">&lt;&lt;</span> <span class="token string">"ERROR: Eager Deletes must have Dynamic Indexing "</span></pre></td></tr><tr><td data-num="40"></td><td><pre> <span class="token string">"enabled. Exitting."</span></pre></td></tr><tr><td data-num="41"></td><td><pre> <span class="token operator">&lt;&lt;</span> std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="42"></td><td><pre> <span class="token keyword">throw</span> diskann<span class="token double-colon punctuation">::</span><span class="token function">ANNException</span><span class="token punctuation">(</span></pre></td></tr><tr><td data-num="43"></td><td><pre> <span class="token string">"ERROR: Eager deletes are possible only if dynamic indexing is "</span></pre></td></tr><tr><td data-num="44"></td><td><pre> <span class="token string">"enabled. Exiting."</span><span class="token punctuation">,</span></pre></td></tr><tr><td data-num="45"></td><td><pre> <span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">,</span> __FUNCSIG__<span class="token punctuation">,</span> <span class="token constant">__FILE__</span><span class="token punctuation">,</span> <span class="token constant">__LINE__</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="46"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="47"></td><td><pre> <span class="token comment">// data is stored to _nd * aligned_dim matrix with necessary</span></pre></td></tr><tr><td data-num="48"></td><td><pre> <span class="token comment">// zero-padding</span></pre></td></tr><tr><td data-num="49"></td><td><pre> _aligned_dim <span class="token operator">=</span> <span class="token function">ROUND_UP</span><span class="token punctuation">(</span>_dim<span class="token punctuation">,</span> <span class="token number">8</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="50"></td><td><pre></pre></td></tr><tr><td data-num="51"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>dynamic_index<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="52"></td><td><pre> _num_frozen_pts <span class="token operator">=</span> <span class="token number">1</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="53"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="54"></td><td><pre> <span class="token comment">// Sanity check. While logically it is correct, max_points ==0 causes</span></pre></td></tr><tr><td data-num="55"></td><td><pre> <span class="token comment">// downstream problems.</span></pre></td></tr><tr><td data-num="56"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>_max_points <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="57"></td><td><pre> _max_points <span class="token operator">=</span> <span class="token number">1</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="58"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="59"></td><td><pre></pre></td></tr><tr><td data-num="60"></td><td><pre> <span class="token function">alloc_aligned</span><span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token keyword">void</span> <span class="token operator">*</span><span class="token operator">*</span><span class="token punctuation">)</span> <span class="token operator">&amp;</span>_data<span class="token punctuation">)</span><span class="token punctuation">,</span></pre></td></tr><tr><td data-num="61"></td><td><pre> <span class="token punctuation">(</span>_max_points <span class="token operator">+</span> _num_frozen_pts<span class="token punctuation">)</span> <span class="token operator">*</span> _aligned_dim <span class="token operator">*</span> <span class="token keyword">sizeof</span><span class="token punctuation">(</span>T<span class="token punctuation">)</span><span class="token punctuation">,</span></pre></td></tr><tr><td data-num="62"></td><td><pre> <span class="token number">8</span> <span class="token operator">*</span> <span class="token keyword">sizeof</span><span class="token punctuation">(</span>T<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="63"></td><td><pre> std<span class="token double-colon punctuation">::</span><span class="token function">memset</span><span class="token punctuation">(</span>_data<span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">,</span></pre></td></tr><tr><td data-num="64"></td><td><pre> <span class="token punctuation">(</span>_max_points <span class="token operator">+</span> _num_frozen_pts<span class="token punctuation">)</span> <span class="token operator">*</span> _aligned_dim <span class="token operator">*</span> <span class="token keyword">sizeof</span><span class="token punctuation">(</span>T<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="65"></td><td><pre></pre></td></tr><tr><td data-num="66"></td><td><pre> _ep <span class="token operator">=</span> <span class="token punctuation">(</span><span class="token keyword">unsigned</span><span class="token punctuation">)</span> _max_points<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="67"></td><td><pre></pre></td></tr><tr><td data-num="68"></td><td><pre> <span class="token comment">//_final_graph.reserve(_max_points + _num_frozen_pts);</span></pre></td></tr><tr><td data-num="69"></td><td><pre> _final_graph<span class="token punctuation">.</span><span class="token function">resize</span><span class="token punctuation">(</span>_max_points <span class="token operator">+</span> _num_frozen_pts<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="70"></td><td><pre></pre></td></tr><tr><td data-num="71"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>_support_eager_delete<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="72"></td><td><pre> _in_graph<span class="token punctuation">.</span><span class="token function">reserve</span><span class="token punctuation">(</span>_max_points <span class="token operator">+</span> _num_frozen_pts<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="73"></td><td><pre> _in_graph<span class="token punctuation">.</span><span class="token function">resize</span><span class="token punctuation">(</span>_max_points <span class="token operator">+</span> _num_frozen_pts<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="74"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="75"></td><td><pre></pre></td></tr><tr><td data-num="76"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>m <span class="token operator">==</span> diskann<span class="token double-colon punctuation">::</span>Metric<span class="token double-colon punctuation">::</span>COSINE <span class="token operator">&amp;&amp;</span> std<span class="token double-colon punctuation">::</span>is_floating_point<span class="token operator">&lt;</span>T<span class="token operator">></span><span class="token double-colon punctuation">::</span>value<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="77"></td><td><pre> <span class="token comment">// This is safe because T is float inside the if block.</span></pre></td></tr><tr><td data-num="78"></td><td><pre> <span class="token keyword">this</span><span class="token operator">-></span>_distance <span class="token operator">=</span> <span class="token punctuation">(</span>Distance<span class="token operator">&lt;</span>T<span class="token operator">></span> <span class="token operator">*</span><span class="token punctuation">)</span> <span class="token keyword">new</span> <span class="token function">AVXNormalizedCosineDistanceFloat</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> </pre></td></tr><tr><td data-num="79"></td><td><pre> <span class="token keyword">this</span><span class="token operator">-></span>_normalize_vecs <span class="token operator">=</span> <span class="token boolean">true</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="80"></td><td><pre> std<span class="token double-colon punctuation">::</span>cout<span class="token operator">&lt;&lt;</span><span class="token string">"Normalizing vectors and using L2 for cosine AVXNormalizedCosineDistanceFloat()."</span> <span class="token operator">&lt;&lt;</span> std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="81"></td><td><pre> <span class="token comment">// std::cout &lt;&lt; "Need to add functionality for COSINE metric" &lt;&lt; std::endl;</span></pre></td></tr><tr><td data-num="82"></td><td><pre> <span class="token punctuation">&#125;</span> <span class="token keyword">else</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="83"></td><td><pre> <span class="token keyword">this</span><span class="token operator">-></span>_distance <span class="token operator">=</span> <span class="token generic-function"><span class="token function">get_distance_function</span><span class="token generic class-name"><span class="token operator">&lt;</span>T<span class="token operator">></span></span></span><span class="token punctuation">(</span>m<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="84"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="85"></td><td><pre></pre></td></tr><tr><td data-num="86"></td><td><pre> _locks <span class="token operator">=</span> std<span class="token double-colon punctuation">::</span><span class="token generic-function"><span class="token function">vector</span><span class="token generic class-name"><span class="token operator">&lt;</span>std<span class="token double-colon punctuation">::</span>mutex<span class="token operator">></span></span></span><span class="token punctuation">(</span>_max_points <span class="token operator">+</span> _num_frozen_pts<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="87"></td><td><pre></pre></td></tr><tr><td data-num="88"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>_support_eager_delete<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="89"></td><td><pre> _locks_in <span class="token operator">=</span> std<span class="token double-colon punctuation">::</span><span class="token generic-function"><span class="token function">vector</span><span class="token generic class-name"><span class="token operator">&lt;</span>std<span class="token double-colon punctuation">::</span>mutex<span class="token operator">></span></span></span><span class="token punctuation">(</span>_max_points <span class="token operator">+</span> _num_frozen_pts<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="90"></td><td><pre></pre></td></tr><tr><td data-num="91"></td><td><pre> _width <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="92"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="93"></td><td><pre></pre></td></tr><tr><td data-num="94"></td><td><pre> <span class="token keyword">template</span><span class="token operator">&lt;</span><span class="token keyword">typename</span> <span class="token class-name">T</span><span class="token punctuation">,</span> <span class="token keyword">typename</span> <span class="token class-name">TagT</span><span class="token operator">></span></pre></td></tr><tr><td data-num="95"></td><td><pre> Index<span class="token operator">&lt;</span>T<span class="token punctuation">,</span> TagT<span class="token operator">></span><span class="token double-colon punctuation">::</span><span class="token operator">~</span><span class="token function">Index</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="96"></td><td><pre> <span class="token comment">// Ensure that no other activity is happening before dtor()</span></pre></td></tr><tr><td data-num="97"></td><td><pre> std<span class="token double-colon punctuation">::</span>unique_lock<span class="token operator">&lt;</span>std<span class="token double-colon punctuation">::</span>shared_timed_mutex<span class="token operator">></span> <span class="token function">ul</span><span class="token punctuation">(</span>_update_lock<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="98"></td><td><pre> std<span class="token double-colon punctuation">::</span>unique_lock<span class="token operator">&lt;</span>std<span class="token double-colon punctuation">::</span>shared_timed_mutex<span class="token operator">></span> <span class="token function">tul</span><span class="token punctuation">(</span>_tag_lock<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="99"></td><td><pre> std<span class="token double-colon punctuation">::</span>unique_lock<span class="token operator">&lt;</span>std<span class="token double-colon punctuation">::</span>shared_timed_mutex<span class="token operator">></span> <span class="token function">tdl</span><span class="token punctuation">(</span>_delete_lock<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="100"></td><td><pre></pre></td></tr><tr><td data-num="101"></td><td><pre> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">auto</span> <span class="token operator">&amp;</span>lock <span class="token operator">:</span> _locks<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="102"></td><td><pre> LockGuard <span class="token function">lg</span><span class="token punctuation">(</span>lock<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="103"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="104"></td><td><pre> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">auto</span> <span class="token operator">&amp;</span>lock <span class="token operator">:</span> _locks_in<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="105"></td><td><pre> LockGuard <span class="token function">lg</span><span class="token punctuation">(</span>lock<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="106"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="107"></td><td><pre></pre></td></tr><tr><td data-num="108"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token keyword">this</span><span class="token operator">-></span>_distance <span class="token operator">!=</span> <span class="token keyword">nullptr</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="109"></td><td><pre> <span class="token keyword">delete</span> <span class="token keyword">this</span><span class="token operator">-></span>_distance<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="110"></td><td><pre> <span class="token keyword">this</span><span class="token operator">-></span>_distance <span class="token operator">=</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="111"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="112"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token keyword">this</span><span class="token operator">-></span>_data <span class="token operator">!=</span> <span class="token keyword">nullptr</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="113"></td><td><pre> <span class="token function">aligned_free</span><span class="token punctuation">(</span><span class="token keyword">this</span><span class="token operator">-></span>_data<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="114"></td><td><pre> <span class="token keyword">this</span><span class="token operator">-></span>_data <span class="token operator">=</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="115"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="116"></td><td><pre></pre></td></tr><tr><td data-num="117"></td><td><pre> <span class="token keyword">while</span> <span class="token punctuation">(</span><span class="token operator">!</span>_query_scratch<span class="token punctuation">.</span><span class="token function">empty</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="118"></td><td><pre> <span class="token keyword">auto</span> val <span class="token operator">=</span> _query_scratch<span class="token punctuation">.</span><span class="token function">pop</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="119"></td><td><pre> <span class="token keyword">while</span> <span class="token punctuation">(</span>val<span class="token punctuation">.</span>indices <span class="token operator">==</span> <span class="token keyword">nullptr</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="120"></td><td><pre> _query_scratch<span class="token punctuation">.</span><span class="token function">wait_for_push_notify</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="121"></td><td><pre> val <span class="token operator">=</span> _query_scratch<span class="token punctuation">.</span><span class="token function">pop</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="122"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="123"></td><td><pre> val<span class="token punctuation">.</span><span class="token function">destroy</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="124"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="125"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="126"></td><td><pre></pre></td></tr><tr><td data-num="127"></td><td><pre> _locks <span class="token operator">=</span> std<span class="token double-colon punctuation">::</span><span class="token generic-function"><span class="token function">vector</span><span class="token generic class-name"><span class="token operator">&lt;</span>std<span class="token double-colon punctuation">::</span>mutex<span class="token operator">></span></span></span><span class="token punctuation">(</span>_max_points <span class="token operator">+</span> _num_frozen_pts<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="128"></td><td><pre></pre></td></tr><tr><td data-num="129"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>_support_eager_delete<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="130"></td><td><pre> _locks_in <span class="token operator">=</span> std<span class="token double-colon punctuation">::</span><span class="token generic-function"><span class="token function">vector</span><span class="token generic class-name"><span class="token operator">&lt;</span>std<span class="token double-colon punctuation">::</span>mutex<span class="token operator">></span></span></span><span class="token punctuation">(</span>_max_points <span class="token operator">+</span> _num_frozen_pts<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="131"></td><td><pre></pre></td></tr><tr><td data-num="132"></td><td><pre> _width <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="133"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr></table></figure><p>search_invocation 是什么?_dynamic_index 是什么?</p>
<p>选边策略:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">template</span><span class="token operator">&lt;</span><span class="token keyword">typename</span> <span class="token class-name">T</span><span class="token punctuation">,</span> <span class="token keyword">typename</span> <span class="token class-name">TagT</span><span class="token operator">></span></pre></td></tr><tr><td data-num="2"></td><td><pre> <span class="token keyword">void</span> <span class="token class-name">Index</span><span class="token operator">&lt;</span>T<span class="token punctuation">,</span> TagT<span class="token operator">></span><span class="token double-colon punctuation">::</span><span class="token function">occlude_list</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span>Neighbor<span class="token operator">></span> <span class="token operator">&amp;</span>pool<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="3"></td><td><pre> <span class="token keyword">const</span> <span class="token keyword">float</span> alpha<span class="token punctuation">,</span> <span class="token keyword">const</span> <span class="token keyword">unsigned</span> degree<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="4"></td><td><pre> <span class="token keyword">const</span> <span class="token keyword">unsigned</span> maxc<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="5"></td><td><pre> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span>Neighbor<span class="token operator">></span> <span class="token operator">&amp;</span>result<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="6"></td><td><pre> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span><span class="token keyword">float</span><span class="token operator">></span> <span class="token operator">&amp;</span> occlude_factor<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="7"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>pool<span class="token punctuation">.</span><span class="token function">empty</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="8"></td><td><pre> <span class="token keyword">return</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="9"></td><td><pre> <span class="token function">assert</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">is_sorted</span><span class="token punctuation">(</span>pool<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> pool<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="10"></td><td><pre> <span class="token function">assert</span><span class="token punctuation">(</span><span class="token operator">!</span>pool<span class="token punctuation">.</span><span class="token function">empty</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="11"></td><td><pre></pre></td></tr><tr><td data-num="12"></td><td><pre> <span class="token keyword">float</span> cur_alpha <span class="token operator">=</span> <span class="token number">1</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="13"></td><td><pre> <span class="token keyword">while</span> <span class="token punctuation">(</span>cur_alpha <span class="token operator">&lt;=</span> alpha <span class="token operator">&amp;&amp;</span> result<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">&lt;</span> degree<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="14"></td><td><pre> <span class="token keyword">unsigned</span> start <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="15"></td><td><pre> <span class="token keyword">float</span> eps <span class="token operator">=</span> cur_alpha <span class="token operator">+</span> <span class="token number">0.01f</span><span class="token punctuation">;</span> <span class="token comment">// used for MIPS, where we store a value</span></pre></td></tr><tr><td data-num="16"></td><td><pre> <span class="token comment">// of eps in cur_alpha to</span></pre></td></tr><tr><td data-num="17"></td><td><pre> <span class="token comment">// denote pruned out entries which we can skip in later rounds.</span></pre></td></tr><tr><td data-num="18"></td><td><pre> <span class="token keyword">while</span> <span class="token punctuation">(</span>result<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">&lt;</span> degree <span class="token operator">&amp;&amp;</span> <span class="token punctuation">(</span>start<span class="token punctuation">)</span> <span class="token operator">&lt;</span> pool<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">&amp;&amp;</span> start <span class="token operator">&lt;</span> maxc<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="19"></td><td><pre> <span class="token keyword">auto</span> <span class="token operator">&amp;</span>p <span class="token operator">=</span> pool<span class="token punctuation">[</span>start<span class="token punctuation">]</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="20"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>occlude_factor<span class="token punctuation">[</span>start<span class="token punctuation">]</span> <span class="token operator">></span> cur_alpha<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="21"></td><td><pre> start<span class="token operator">++</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="22"></td><td><pre> <span class="token keyword">continue</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="23"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="24"></td><td><pre> occlude_factor<span class="token punctuation">[</span>start<span class="token punctuation">]</span> <span class="token operator">=</span> std<span class="token double-colon punctuation">::</span><span class="token class-name">numeric_limits</span><span class="token operator">&lt;</span><span class="token keyword">float</span><span class="token operator">></span><span class="token double-colon punctuation">::</span><span class="token function">max</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="25"></td><td><pre> result<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>p<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="26"></td><td><pre> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">unsigned</span> t <span class="token operator">=</span> start <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">;</span> t <span class="token operator">&lt;</span> pool<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">&amp;&amp;</span> t <span class="token operator">&lt;</span> maxc<span class="token punctuation">;</span> t<span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="27"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>occlude_factor<span class="token punctuation">[</span>t<span class="token punctuation">]</span> <span class="token operator">></span> alpha<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="28"></td><td><pre> <span class="token keyword">continue</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="29"></td><td><pre> <span class="token keyword">float</span> djk <span class="token operator">=</span> _distance<span class="token operator">-></span><span class="token function">compare</span><span class="token punctuation">(</span></pre></td></tr><tr><td data-num="30"></td><td><pre> _data <span class="token operator">+</span> _aligned_dim <span class="token operator">*</span> <span class="token punctuation">(</span>size_t<span class="token punctuation">)</span> pool<span class="token punctuation">[</span>t<span class="token punctuation">]</span><span class="token punctuation">.</span>id<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="31"></td><td><pre> _data <span class="token operator">+</span> _aligned_dim <span class="token operator">*</span> <span class="token punctuation">(</span>size_t<span class="token punctuation">)</span> p<span class="token punctuation">.</span>id<span class="token punctuation">,</span> <span class="token punctuation">(</span><span class="token keyword">unsigned</span><span class="token punctuation">)</span> _aligned_dim<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="32"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>_dist_metric <span class="token operator">==</span> diskann<span class="token double-colon punctuation">::</span>Metric<span class="token double-colon punctuation">::</span>L2 <span class="token operator">||</span></pre></td></tr><tr><td data-num="33"></td><td><pre> _dist_metric <span class="token operator">==</span> diskann<span class="token double-colon punctuation">::</span>Metric<span class="token double-colon punctuation">::</span>COSINE<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="34"></td><td><pre> occlude_factor<span class="token punctuation">[</span>t<span class="token punctuation">]</span> <span class="token operator">=</span></pre></td></tr><tr><td data-num="35"></td><td><pre> <span class="token comment">//(std::max)(occlude_factor[t], pool[t].distance / djk);</span></pre></td></tr><tr><td data-num="36"></td><td><pre> <span class="token function">diskann_max</span><span class="token punctuation">(</span>occlude_factor<span class="token punctuation">[</span>t<span class="token punctuation">]</span><span class="token punctuation">,</span> pool<span class="token punctuation">[</span>t<span class="token punctuation">]</span><span class="token punctuation">.</span>distance <span class="token operator">/</span> djk<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="37"></td><td><pre> <span class="token punctuation">&#125;</span> <span class="token keyword">else</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>_dist_metric <span class="token operator">==</span></pre></td></tr><tr><td data-num="38"></td><td><pre> diskann<span class="token double-colon punctuation">::</span>Metric<span class="token double-colon punctuation">::</span>INNER_PRODUCT<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span> <span class="token comment">// stylized rules for</span></pre></td></tr><tr><td data-num="39"></td><td><pre> <span class="token comment">// inner product since</span></pre></td></tr><tr><td data-num="40"></td><td><pre> <span class="token comment">// we want max instead</span></pre></td></tr><tr><td data-num="41"></td><td><pre> <span class="token comment">// of min distance</span></pre></td></tr><tr><td data-num="42"></td><td><pre> <span class="token keyword">float</span> x <span class="token operator">=</span> <span class="token operator">-</span>pool<span class="token punctuation">[</span>t<span class="token punctuation">]</span><span class="token punctuation">.</span>distance<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="43"></td><td><pre> <span class="token keyword">float</span> y <span class="token operator">=</span> <span class="token operator">-</span>djk<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="44"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>y <span class="token operator">></span> cur_alpha <span class="token operator">*</span> x<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="45"></td><td><pre> occlude_factor<span class="token punctuation">[</span>t<span class="token punctuation">]</span> <span class="token operator">=</span></pre></td></tr><tr><td data-num="46"></td><td><pre> <span class="token comment">/* (std::max)*/</span> <span class="token function">diskann_max</span><span class="token punctuation">(</span>occlude_factor<span class="token punctuation">[</span>t<span class="token punctuation">]</span><span class="token punctuation">,</span> eps<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="47"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="48"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="49"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="50"></td><td><pre> start<span class="token operator">++</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="51"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="52"></td><td><pre> cur_alpha <span class="token operator">*=</span> <span class="token number">1.2</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="53"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="54"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr></table></figure><p>这里有一个很有意思的点,cur_alpha 从 1 开始递增迭代来计算。</p>
<h3 id="build_disk"><a class="anchor" href="#build_disk">#</a> build_disk</h3>
<p>建立索引入口:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">int</span> <span class="token function">build_disk_index</span><span class="token punctuation">(</span><span class="token keyword">const</span> <span class="token keyword">char</span> <span class="token operator">*</span>dataFilePath<span class="token punctuation">,</span> <span class="token keyword">const</span> <span class="token keyword">char</span> <span class="token operator">*</span>indexFilePath<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="2"></td><td><pre> <span class="token keyword">const</span> <span class="token keyword">char</span> <span class="token operator">*</span> indexBuildParameters<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="3"></td><td><pre> diskann<span class="token double-colon punctuation">::</span>Metric compareMetric<span class="token punctuation">)</span></pre></td></tr></table></figure><p>dataFilePath: base_bin;indexFilePath: 保存索引路径;indexBuildParameter:构建索引的参数;compareMetric:一般为 l2。</p>
<p>这里有一系列的路径,我先记录下来:</p>
<pre><code class="language-c++">std::string base_file(dataFilePath); // base
std::string data_file_to_use = base_file; // base
std::string index_prefix_path(indexFilePath); // index 前缀
std::string pq_pivots_path = index_prefix_path + &quot;_pq_pivots.bin&quot;; // 保存pivots
std::string pq_compressed_vectors_path =
index_prefix_path + &quot;_pq_compressed.bin&quot;; // 保存压缩的pq
std::string mem_index_path = index_prefix_path + &quot;_mem.index&quot;; // 保存内存索引
std::string disk_index_path = index_prefix_path + &quot;_disk.index&quot;;
std::string medoids_path = disk_index_path + &quot;_medoids.bin&quot;;
std::string centroids_path = disk_index_path + &quot;_centroids.bin&quot;;
std::string sample_base_prefix = index_prefix_path + &quot;_sample&quot;;
</code></pre>
<p>划分 pq 块:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">double</span> final_index_ram_limit <span class="token operator">=</span> <span class="token function">get_memory_budget</span><span class="token punctuation">(</span>param_list<span class="token punctuation">[</span><span class="token number">2</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="2"></td><td><pre> size_t num_pq_chunks <span class="token operator">=</span></pre></td></tr><tr><td data-num="3"></td><td><pre> <span class="token punctuation">(</span>size_t<span class="token punctuation">)</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span>floor<span class="token punctuation">)</span><span class="token punctuation">(</span><span class="token function">_u64</span><span class="token punctuation">(</span>final_index_ram_limit <span class="token operator">/</span> points_num<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr></table></figure><p>随机采样训练数据(0,1)分布:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">double</span> p_val <span class="token operator">=</span> <span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token keyword">double</span><span class="token punctuation">)</span> MAX_PQ_TRAINING_SET_SIZE <span class="token operator">/</span> <span class="token punctuation">(</span><span class="token keyword">double</span><span class="token punctuation">)</span> points_num<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 这里 max 是 256000L,也就是说最多训练这么多数据</span></pre></td></tr><tr><td data-num="2"></td><td><pre><span class="token generic-function"><span class="token function">gen_random_slice</span><span class="token generic class-name"><span class="token operator">&lt;</span>T<span class="token operator">></span></span></span><span class="token punctuation">(</span>data_file_to_use<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> p_val<span class="token punctuation">,</span> train_data<span class="token punctuation">,</span> train_size<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="3"></td><td><pre> train_dim<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 按照 p_val 进行采样,因为这里只是使用了 10000 个数据,所以 p>1 直接取全部数据进行训练</span></pre></td></tr></table></figure><p>PQ 训练</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token function">generate_pq_pivots</span><span class="token punctuation">(</span>train_data<span class="token punctuation">,</span> train_size<span class="token punctuation">,</span> <span class="token punctuation">(</span><span class="token keyword">uint32_t</span><span class="token punctuation">)</span> dim<span class="token punctuation">,</span> <span class="token number">256</span><span class="token punctuation">,</span></pre></td></tr><tr><td data-num="2"></td><td><pre> <span class="token punctuation">(</span><span class="token keyword">uint32_t</span><span class="token punctuation">)</span> num_pq_chunks<span class="token punctuation">,</span> NUM_KMEANS_REPS<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="3"></td><td><pre> pq_pivots_path<span class="token punctuation">,</span> make_zero_mean<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">//l2 make_zero_mean 为 true</span></pre></td></tr><tr><td data-num="4"></td><td><pre><span class="token keyword">int</span> <span class="token function">generate_pq_pivots</span><span class="token punctuation">(</span><span class="token keyword">const</span> <span class="token keyword">float</span> <span class="token operator">*</span>passed_train_data<span class="token punctuation">,</span> size_t num_train<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="5"></td><td><pre> <span class="token keyword">unsigned</span> dim<span class="token punctuation">,</span> <span class="token keyword">unsigned</span> num_centers<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="6"></td><td><pre> <span class="token keyword">unsigned</span> num_pq_chunks<span class="token punctuation">,</span> <span class="token keyword">unsigned</span> max_k_means_reps<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="7"></td><td><pre> std<span class="token double-colon punctuation">::</span>string pq_pivots_path<span class="token punctuation">,</span> <span class="token keyword">bool</span> make_zero_mean<span class="token punctuation">)</span> <span class="token comment">// 固定 center 为 256,max_k_means_reps 为 12</span></pre></td></tr></table></figure><p>如果 pq_pivots_path 文件存在,那么直接读入,反之则需要进行训练。</p>
<p>l2 计算预处理,可以将 traindata 数据更加集中与 center,由于 sift 数据集向量为 128 维,使用 32 个 pq_chunks,可以被等分,每一个 chunks 负责 4 个维度,又因为 pq center 为 256,所以 8 位 1byte 就可以表示,4 维度会降维到 1 维,并且有 float 变为 uint8。</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre>size_t cur_chunk_size <span class="token operator">=</span> chunk_offsets<span class="token punctuation">[</span>i <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">]</span> <span class="token operator">-</span> chunk_offsets<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">;</span> <span class="token comment">// 这个直接为 4</span></pre></td></tr><tr><td data-num="2"></td><td><pre>std<span class="token double-colon punctuation">::</span>unique_ptr<span class="token operator">&lt;</span><span class="token keyword">float</span><span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token operator">></span> cur_pivot_data <span class="token operator">=</span></pre></td></tr><tr><td data-num="3"></td><td><pre> std<span class="token double-colon punctuation">::</span><span class="token generic-function"><span class="token function">make_unique</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">float</span><span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token operator">></span></span></span><span class="token punctuation">(</span>num_centers <span class="token operator">*</span> cur_chunk_size<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="4"></td><td><pre>std<span class="token double-colon punctuation">::</span>unique_ptr<span class="token operator">&lt;</span><span class="token keyword">float</span><span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token operator">></span> cur_data <span class="token operator">=</span></pre></td></tr><tr><td data-num="5"></td><td><pre> std<span class="token double-colon punctuation">::</span><span class="token generic-function"><span class="token function">make_unique</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">float</span><span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token operator">></span></span></span><span class="token punctuation">(</span>num_train <span class="token operator">*</span> cur_chunk_size<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="6"></td><td><pre>std<span class="token double-colon punctuation">::</span>unique_ptr<span class="token operator">&lt;</span><span class="token keyword">uint32_t</span><span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token operator">></span> closest_center <span class="token operator">=</span></pre></td></tr><tr><td data-num="7"></td><td><pre> std<span class="token double-colon punctuation">::</span><span class="token generic-function"><span class="token function">make_unique</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">uint32_t</span><span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token operator">></span></span></span><span class="token punctuation">(</span>num_train<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="8"></td><td><pre><span class="token comment">// 从 traindata 中导入对应维度的数据到 cur_data,也就是 4 维的数据</span></pre></td></tr></table></figure><p>keams 选择 pivots:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre>kmeans<span class="token double-colon punctuation">::</span><span class="token function">kmeanspp_selecting_pivots</span><span class="token punctuation">(</span>cur_data<span class="token punctuation">.</span><span class="token function">get</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> num_train<span class="token punctuation">,</span> cur_chunk_size<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="2"></td><td><pre> cur_pivot_data<span class="token punctuation">.</span><span class="token function">get</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> num_centers<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="3"></td><td><pre><span class="token keyword">void</span> <span class="token function">kmeanspp_selecting_pivots</span><span class="token punctuation">(</span><span class="token keyword">float</span><span class="token operator">*</span> data<span class="token punctuation">,</span> size_t num_points<span class="token punctuation">,</span> size_t dim<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="4"></td><td><pre> <span class="token keyword">float</span><span class="token operator">*</span> pivot_data<span class="token punctuation">,</span> size_t num_centers<span class="token punctuation">)</span></pre></td></tr></table></figure><p>首先随机选择一个初始的点;</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre>roll init_id</pre></td></tr><tr><td data-num="2"></td><td><pre>picked<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>init_id<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="3"></td><td><pre>std<span class="token double-colon punctuation">::</span><span class="token function">memcpy</span><span class="token punctuation">(</span>pivot_data<span class="token punctuation">,</span> data <span class="token operator">+</span> init_id <span class="token operator">*</span> dim<span class="token punctuation">,</span> dim <span class="token operator">*</span> <span class="token keyword">sizeof</span><span class="token punctuation">(</span><span class="token keyword">float</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 拷贝到 cur_pivot_data 中,dim 这里是 4</span></pre></td></tr></table></figure><p>计算所有点与该点的距离,然后用一个随机的迭代的方式获取其他 256 个随机点,说实话,这里我不是很明白,他为什么这么做。。。。</p>
<p>选择了 256 个 center 并保存在 cur_pivot_data 中之后,用 lloyds 算法进行训练:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre>kmeans<span class="token double-colon punctuation">::</span><span class="token function">run_lloyds</span><span class="token punctuation">(</span>cur_data<span class="token punctuation">.</span><span class="token function">get</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> num_train<span class="token punctuation">,</span> cur_chunk_size<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="2"></td><td><pre> cur_pivot_data<span class="token punctuation">.</span><span class="token function">get</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> num_centers<span class="token punctuation">,</span> max_k_means_reps<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="3"></td><td><pre> <span class="token constant">NULL</span><span class="token punctuation">,</span> closest_center<span class="token punctuation">.</span><span class="token function">get</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="4"></td><td><pre><span class="token keyword">float</span> <span class="token function">run_lloyds</span><span class="token punctuation">(</span><span class="token keyword">float</span><span class="token operator">*</span> data<span class="token punctuation">,</span> size_t num_points<span class="token punctuation">,</span> size_t dim<span class="token punctuation">,</span> <span class="token keyword">float</span><span class="token operator">*</span> centers<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="5"></td><td><pre> <span class="token keyword">const</span> size_t num_centers<span class="token punctuation">,</span> <span class="token keyword">const</span> size_t max_reps<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="6"></td><td><pre> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span>size_t<span class="token operator">></span><span class="token operator">*</span> closest_docs<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="7"></td><td><pre> <span class="token keyword">uint32_t</span><span class="token operator">*</span> closest_center<span class="token punctuation">)</span> <span class="token comment">//cloest_docs 就是说这个 center 内有哪些节点,类似于倒排索引</span></pre></td></tr><tr><td data-num="8"></td><td><pre><span class="token keyword">float</span><span class="token operator">*</span> docs_l2sq <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token keyword">float</span><span class="token punctuation">[</span>num_points<span class="token punctuation">]</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="9"></td><td><pre>math_utils<span class="token double-colon punctuation">::</span><span class="token function">compute_vecs_l2sq</span><span class="token punctuation">(</span>docs_l2sq<span class="token punctuation">,</span> data<span class="token punctuation">,</span> num_points<span class="token punctuation">,</span> dim<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="10"></td><td><pre><span class="token comment">// 迭代训练</span></pre></td></tr><tr><td data-num="11"></td><td><pre><span class="token function">lloyds_iter</span><span class="token punctuation">(</span>data<span class="token punctuation">,</span> num_points<span class="token punctuation">,</span> dim<span class="token punctuation">,</span> centers<span class="token punctuation">,</span> num_centers<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="12"></td><td><pre> docs_l2sq<span class="token punctuation">,</span> closest_docs<span class="token punctuation">,</span> closest_center<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="13"></td><td><pre><span class="token keyword">float</span> <span class="token function">lloyds_iter</span><span class="token punctuation">(</span><span class="token keyword">float</span><span class="token operator">*</span> data<span class="token punctuation">,</span> size_t num_points<span class="token punctuation">,</span> size_t dim<span class="token punctuation">,</span> <span class="token keyword">float</span><span class="token operator">*</span> centers<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="14"></td><td><pre> size_t num_centers<span class="token punctuation">,</span> <span class="token keyword">float</span><span class="token operator">*</span> docs_l2sq<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="15"></td><td><pre> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span>size_t<span class="token operator">></span><span class="token operator">*</span> closest_docs<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="16"></td><td><pre> <span class="token keyword">uint32_t</span><span class="token operator">*</span><span class="token operator">&amp;</span> closest_center<span class="token punctuation">)</span></pre></td></tr></table></figure><p>开始进行计算 closest_centers:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre>math_utils<span class="token double-colon punctuation">::</span><span class="token function">compute_closest_centers</span><span class="token punctuation">(</span>data<span class="token punctuation">,</span> num_points<span class="token punctuation">,</span> dim<span class="token punctuation">,</span> centers<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="2"></td><td><pre> num_centers<span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">,</span> closest_center<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="3"></td><td><pre> closest_docs<span class="token punctuation">,</span> docs_l2sq<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="4"></td><td><pre><span class="token keyword">void</span> <span class="token function">compute_closest_centers</span><span class="token punctuation">(</span><span class="token keyword">float</span><span class="token operator">*</span> data<span class="token punctuation">,</span> size_t num_points<span class="token punctuation">,</span> size_t dim<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="5"></td><td><pre> <span class="token keyword">float</span><span class="token operator">*</span> pivot_data<span class="token punctuation">,</span> size_t num_centers<span class="token punctuation">,</span> size_t k<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="6"></td><td><pre> <span class="token keyword">uint32_t</span><span class="token operator">*</span> closest_centers_ivf<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="7"></td><td><pre> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span>size_t<span class="token operator">></span><span class="token operator">*</span> inverted_index<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="8"></td><td><pre> <span class="token keyword">float</span><span class="token operator">*</span> pts_norms_squared<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="9"></td><td><pre> </pre></td></tr><tr><td data-num="10"></td><td><pre><span class="token keyword">uint32_t</span><span class="token operator">*</span> closest_centers <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token keyword">uint32_t</span><span class="token punctuation">[</span>PAR_BLOCK_SIZE <span class="token operator">*</span> k<span class="token punctuation">]</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="11"></td><td><pre><span class="token keyword">float</span><span class="token operator">*</span> distance_matrix <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token keyword">float</span><span class="token punctuation">[</span>num_centers <span class="token operator">*</span> PAR_BLOCK_SIZE<span class="token punctuation">]</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="12"></td><td><pre>math_utils<span class="token double-colon punctuation">::</span><span class="token function">compute_closest_centers_in_block</span><span class="token punctuation">(</span></pre></td></tr><tr><td data-num="13"></td><td><pre> data<span class="token punctuation">,</span> num_ponts<span class="token punctuation">,</span> dim<span class="token punctuation">,</span> pivot_data<span class="token punctuation">,</span> num_centers<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="14"></td><td><pre> pts_norms_squared<span class="token punctuation">,</span> pivs_norms_squared<span class="token punctuation">,</span> closest_centers<span class="token punctuation">,</span> distance_matrix<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="15"></td><td><pre> k<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="16"></td><td><pre><span class="token keyword">void</span> <span class="token function">compute_closest_centers_in_block</span><span class="token punctuation">(</span></pre></td></tr><tr><td data-num="17"></td><td><pre> <span class="token keyword">const</span> <span class="token keyword">float</span><span class="token operator">*</span> <span class="token keyword">const</span> data<span class="token punctuation">,</span> <span class="token keyword">const</span> size_t num_points<span class="token punctuation">,</span> <span class="token keyword">const</span> size_t dim<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="18"></td><td><pre> <span class="token keyword">const</span> <span class="token keyword">float</span><span class="token operator">*</span> <span class="token keyword">const</span> centers<span class="token punctuation">,</span> <span class="token keyword">const</span> size_t num_centers<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="19"></td><td><pre> <span class="token keyword">const</span> <span class="token keyword">float</span><span class="token operator">*</span> <span class="token keyword">const</span> docs_l2sq<span class="token punctuation">,</span> <span class="token keyword">const</span> <span class="token keyword">float</span><span class="token operator">*</span> <span class="token keyword">const</span> centers_l2sq<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="20"></td><td><pre> <span class="token keyword">uint32_t</span><span class="token operator">*</span> center_index<span class="token punctuation">,</span> <span class="token keyword">float</span><span class="token operator">*</span> <span class="token keyword">const</span> dist_matrix<span class="token punctuation">,</span> size_t k<span class="token punctuation">)</span></pre></td></tr></table></figure><p>这里他用了 MKL 的函数进行矩阵计算,说实话这一块不是很懂,先记下来以后有需要再去看:<span class="exturl" data-url="aHR0cHM6Ly9tdXJwaHlwZWkuZ2l0aHViLmlvL2Jsb2cvMjAxOS8wOS9jYmxhcy1nZW1tLWdlbXY=">https://murphypei.github.io/blog/2019/09/cblas-gemm-gemv</span></p>
<p>计算好所有 pivot,以及 cloest_center 之后就可以进行保存了:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">uint64_t</span> j <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> j <span class="token operator">&lt;</span> num_centers<span class="token punctuation">;</span> j<span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="2"></td><td><pre> std<span class="token double-colon punctuation">::</span><span class="token function">memcpy</span><span class="token punctuation">(</span>full_pivot_data<span class="token punctuation">.</span><span class="token function">get</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span> j <span class="token operator">*</span> dim <span class="token operator">+</span> chunk_offsets<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">,</span> </pre></td></tr><tr><td data-num="3"></td><td><pre> cur_pivot_data<span class="token punctuation">.</span><span class="token function">get</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span> j <span class="token operator">*</span> cur_chunk_size<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="4"></td><td><pre> cur_chunk_size <span class="token operator">*</span> <span class="token keyword">sizeof</span><span class="token punctuation">(</span><span class="token keyword">float</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 保存全精度码本</span></pre></td></tr><tr><td data-num="5"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="6"></td><td><pre>diskann<span class="token double-colon punctuation">::</span><span class="token generic-function"><span class="token function">save_bin</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">float</span><span class="token operator">></span></span></span><span class="token punctuation">(</span>pq_pivots_path<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> full_pivot_data<span class="token punctuation">.</span><span class="token function">get</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span></pre></td></tr><tr><td data-num="7"></td><td><pre> <span class="token punctuation">(</span>size_t<span class="token punctuation">)</span> num_centers<span class="token punctuation">,</span> dim<span class="token punctuation">)</span><span class="token punctuation">;</span> </pre></td></tr><tr><td data-num="8"></td><td><pre>std<span class="token double-colon punctuation">::</span>string centroids_path <span class="token operator">=</span> pq_pivots_path <span class="token operator">+</span> <span class="token string">"_centroid.bin"</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="9"></td><td><pre>diskann<span class="token double-colon punctuation">::</span><span class="token generic-function"><span class="token function">save_bin</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">float</span><span class="token operator">></span></span></span><span class="token punctuation">(</span>centroids_path<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> centroid<span class="token punctuation">.</span><span class="token function">get</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token punctuation">(</span>size_t<span class="token punctuation">)</span> dim<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="10"></td><td><pre> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 点集的中心向量</span></pre></td></tr><tr><td data-num="11"></td><td><pre>std<span class="token double-colon punctuation">::</span>string rearrangement_path <span class="token operator">=</span> pq_pivots_path <span class="token operator">+</span> <span class="token string">"_rearrangement_perm.bin"</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="12"></td><td><pre>diskann<span class="token double-colon punctuation">::</span><span class="token generic-function"><span class="token function">save_bin</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">uint32_t</span><span class="token operator">></span></span></span><span class="token punctuation">(</span>rearrangement_path<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> rearrangement<span class="token punctuation">.</span><span class="token function">data</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span></pre></td></tr><tr><td data-num="13"></td><td><pre> rearrangement<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 就是维度是按什么顺序进行 rearrangement</span></pre></td></tr><tr><td data-num="14"></td><td><pre>std<span class="token double-colon punctuation">::</span>string chunk_offsets_path <span class="token operator">=</span> pq_pivots_path <span class="token operator">+</span> <span class="token string">"_chunk_offsets.bin"</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="15"></td><td><pre>diskann<span class="token double-colon punctuation">::</span><span class="token generic-function"><span class="token function">save_bin</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">uint32_t</span><span class="token operator">></span></span></span><span class="token punctuation">(</span>chunk_offsets_path<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> chunk_offsets<span class="token punctuation">.</span><span class="token function">data</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span></pre></td></tr><tr><td data-num="16"></td><td><pre> chunk_offsets<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 每个 pq chunk 负责的原始维度数目</span></pre></td></tr></table></figure><p>回到构建索引函数中:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token function">generate_pq_pivots</span><span class="token punctuation">(</span>train_data<span class="token punctuation">,</span> train_size<span class="token punctuation">,</span> <span class="token punctuation">(</span><span class="token keyword">uint32_t</span><span class="token punctuation">)</span> dim<span class="token punctuation">,</span> <span class="token number">256</span><span class="token punctuation">,</span></pre></td></tr><tr><td data-num="2"></td><td><pre> <span class="token punctuation">(</span><span class="token keyword">uint32_t</span><span class="token punctuation">)</span> num_pq_chunks<span class="token punctuation">,</span> NUM_KMEANS_REPS<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="3"></td><td><pre> pq_pivots_path<span class="token punctuation">,</span> make_zero_mean<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="4"></td><td><pre></pre></td></tr><tr><td data-num="5"></td><td><pre><span class="token generic-function"><span class="token function">generate_pq_data_from_pivots</span><span class="token generic class-name"><span class="token operator">&lt;</span>T<span class="token operator">></span></span></span><span class="token punctuation">(</span>data_file_to_use<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token number">256</span><span class="token punctuation">,</span></pre></td></tr><tr><td data-num="6"></td><td><pre> <span class="token punctuation">(</span><span class="token keyword">uint32_t</span><span class="token punctuation">)</span> num_pq_chunks<span class="token punctuation">,</span> pq_pivots_path<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="7"></td><td><pre> pq_compressed_vectors_path<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr></table></figure><p>这个 generate_pq_data_from_pivots 就是根据 pivot 把向量进行 qp 压缩。注意上面的 pivot 是用采样数据进行训练的。</p>
<p>构建索引:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre>diskann<span class="token double-colon punctuation">::</span><span class="token generic-function"><span class="token function">build_merged_vamana_index</span><span class="token generic class-name"><span class="token operator">&lt;</span>T<span class="token operator">></span></span></span><span class="token punctuation">(</span></pre></td></tr><tr><td data-num="2"></td><td><pre> data_file_to_use<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> diskann<span class="token double-colon punctuation">::</span>Metric<span class="token double-colon punctuation">::</span>L2<span class="token punctuation">,</span> L<span class="token punctuation">,</span> R<span class="token punctuation">,</span> p_val<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="3"></td><td><pre> indexing_ram_budget<span class="token punctuation">,</span> mem_index_path<span class="token punctuation">,</span> medoids_path<span class="token punctuation">,</span> centroids_path<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="4"></td><td><pre><span class="token keyword">int</span> <span class="token function">build_merged_vamana_index</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span>string base_file<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="5"></td><td><pre> diskann<span class="token double-colon punctuation">::</span>Metric compareMetric<span class="token punctuation">,</span> <span class="token keyword">unsigned</span> L<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="6"></td><td><pre> <span class="token keyword">unsigned</span> R<span class="token punctuation">,</span> <span class="token keyword">double</span> sampling_rate<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="7"></td><td><pre> <span class="token keyword">double</span> ram_budget<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>string mem_index_path<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="8"></td><td><pre> std<span class="token double-colon punctuation">::</span>string medoids_file<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="9"></td><td><pre> std<span class="token double-colon punctuation">::</span>string centroids_file<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="10"></td><td><pre>paras<span class="token punctuation">.</span><span class="token generic-function"><span class="token function">Set</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token string">"L"</span><span class="token punctuation">,</span> <span class="token punctuation">(</span><span class="token keyword">unsigned</span><span class="token punctuation">)</span> L<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="11"></td><td><pre>paras<span class="token punctuation">.</span><span class="token generic-function"><span class="token function">Set</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token string">"R"</span><span class="token punctuation">,</span> <span class="token punctuation">(</span><span class="token keyword">unsigned</span><span class="token punctuation">)</span> R<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="12"></td><td><pre>paras<span class="token punctuation">.</span><span class="token generic-function"><span class="token function">Set</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token string">"C"</span><span class="token punctuation">,</span> <span class="token number">750</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 这个 C 是真的奇奇妙妙的,这个参数确定的是最终裁边的时候侯选池的大小</span></pre></td></tr><tr><td data-num="13"></td><td><pre>paras<span class="token punctuation">.</span><span class="token generic-function"><span class="token function">Set</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">float</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token string">"alpha"</span><span class="token punctuation">,</span> <span class="token number">1.2f</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 在这里设置了 alpha</span></pre></td></tr><tr><td data-num="14"></td><td><pre>paras<span class="token punctuation">.</span><span class="token generic-function"><span class="token function">Set</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token string">"num_rnds"</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="15"></td><td><pre>paras<span class="token punctuation">.</span><span class="token generic-function"><span class="token function">Set</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">bool</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token string">"saturate_graph"</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="16"></td><td><pre>paras<span class="token punctuation">.</span><span class="token generic-function"><span class="token function">Set</span><span class="token generic class-name"><span class="token operator">&lt;</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token string">"save_path"</span><span class="token punctuation">,</span> mem_index_path<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr></table></figure><p>这里就需要根据内存的预算进行划分,判断需要划分为多少个 part:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">int</span> num_parts <span class="token operator">=</span></pre></td></tr><tr><td data-num="2"></td><td><pre> <span class="token generic-function"><span class="token function">partition_with_ram_budget</span><span class="token generic class-name"><span class="token operator">&lt;</span>T<span class="token operator">></span></span></span><span class="token punctuation">(</span>base_file<span class="token punctuation">,</span> sampling_rate<span class="token punctuation">,</span> ram_budget<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="3"></td><td><pre> <span class="token number">2</span> <span class="token operator">*</span> R <span class="token operator">/</span> <span class="token number">3</span><span class="token punctuation">,</span> merged_index_prefix<span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="4"></td><td><pre><span class="token comment">// 注意这里的 kbase 参数,这个参数实际上就是说把一个向量 assign 到 kbase 个 cluster 中</span></pre></td></tr><tr><td data-num="5"></td><td><pre><span class="token keyword">template</span><span class="token operator">&lt;</span><span class="token keyword">typename</span> <span class="token class-name">T</span><span class="token operator">></span></pre></td></tr><tr><td data-num="6"></td><td><pre><span class="token keyword">int</span> <span class="token function">partition_with_ram_budget</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string data_file<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="7"></td><td><pre> <span class="token keyword">const</span> <span class="token keyword">double</span> sampling_rate<span class="token punctuation">,</span> <span class="token keyword">double</span> ram_budget<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="8"></td><td><pre> size_t graph_degree<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="9"></td><td><pre> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string prefix_path<span class="token punctuation">,</span> size_t k_base<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr></table></figure><p>这里划分不是随机进行划分的,代码首先以 part=3 为起始,进行 Kmeans 聚类,把对应的向量放到不同的 part 中。并且它还是生成一个 shard_idmap 这样的文件,记录每个 kmeans 聚类中的点 id <code>*refix_path* + &quot;_subshard-&quot; + std::to_string(i) + &quot;_ids_uint32.bin</code> 。</p>
<p>我们根据每一个 part 进行数据划分:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre>std<span class="token double-colon punctuation">::</span>string merged_index_prefix <span class="token operator">=</span> mem_index_path <span class="token operator">+</span> <span class="token string">"_tempFiles"</span><span class="token punctuation">;</span> </pre></td></tr><tr><td data-num="2"></td><td><pre>std<span class="token double-colon punctuation">::</span>string shard_base_file <span class="token operator">=</span></pre></td></tr><tr><td data-num="3"></td><td><pre> merged_index_prefix <span class="token operator">+</span> <span class="token string">"_subshard-"</span> <span class="token operator">+</span> std<span class="token double-colon punctuation">::</span><span class="token function">to_string</span><span class="token punctuation">(</span>p<span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token string">".bin"</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="4"></td><td><pre></pre></td></tr><tr><td data-num="5"></td><td><pre>std<span class="token double-colon punctuation">::</span>string shard_ids_file <span class="token operator">=</span> merged_index_prefix <span class="token operator">+</span> <span class="token string">"_subshard-"</span> <span class="token operator">+</span></pre></td></tr><tr><td data-num="6"></td><td><pre>std<span class="token double-colon punctuation">::</span><span class="token function">to_string</span><span class="token punctuation">(</span>p<span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token string">"_ids_uint32.bin"</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="7"></td><td><pre></pre></td></tr><tr><td data-num="8"></td><td><pre><span class="token generic-function"><span class="token function">retrieve_shard_data_from_ids</span><span class="token generic class-name"><span class="token operator">&lt;</span>T<span class="token operator">></span></span></span><span class="token punctuation">(</span>base_file<span class="token punctuation">,</span> shard_ids_file<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="9"></td><td><pre>shard_base_file<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="10"></td><td><pre><span class="token keyword">template</span><span class="token operator">&lt;</span><span class="token keyword">typename</span> <span class="token class-name">T</span><span class="token operator">></span></pre></td></tr><tr><td data-num="11"></td><td><pre><span class="token keyword">int</span> <span class="token function">retrieve_shard_data_from_ids</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string data_file<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="12"></td><td><pre> std<span class="token double-colon punctuation">::</span>string idmap_filename<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="13"></td><td><pre> std<span class="token double-colon punctuation">::</span>string data_filename<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="14"></td><td><pre>std<span class="token double-colon punctuation">::</span>string shard_index_file <span class="token operator">=</span></pre></td></tr><tr><td data-num="15"></td><td><pre> merged_index_prefix <span class="token operator">+</span> <span class="token string">"_subshard-"</span> <span class="token operator">+</span> std<span class="token double-colon punctuation">::</span><span class="token function">to_string</span><span class="token punctuation">(</span>p<span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token string">"_mem.index"</span><span class="token punctuation">;</span></pre></td></tr></table></figure><p>通过 retrieve_shard_data_from_ids 我们得到了部分重叠的采样数据,并把数据放到 shard_base_file。</p>
<p>我们进入到 Index 的类构造函数中,会发现一些参数比如 frozen_pts 这些都不需要考虑。一个需要注意的点就是 diskann 可以支持 dim 不被 8 整除,会有一个 aligned 操作, <code>_aligned_dim = ROUND_UP(_dim, 8);</code> 。</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre>_u64 shard_base_dim<span class="token punctuation">,</span> shard_base_pts<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="2"></td><td><pre><span class="token function">get_bin_metadata</span><span class="token punctuation">(</span>shard_base_file<span class="token punctuation">,</span> shard_base_pts<span class="token punctuation">,</span> shard_base_dim<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="3"></td><td><pre>std<span class="token double-colon punctuation">::</span>unique_ptr<span class="token operator">&lt;</span>diskann<span class="token double-colon punctuation">::</span>Index<span class="token operator">&lt;</span>T<span class="token operator">>></span> _pvamanaIndex <span class="token operator">=</span></pre></td></tr><tr><td data-num="4"></td><td><pre> std<span class="token double-colon punctuation">::</span><span class="token generic-function"><span class="token function">unique_ptr</span><span class="token generic class-name"><span class="token operator">&lt;</span>diskann<span class="token double-colon punctuation">::</span>Index<span class="token operator">&lt;</span>T<span class="token operator">>></span></span></span><span class="token punctuation">(</span></pre></td></tr><tr><td data-num="5"></td><td><pre> <span class="token keyword">new</span> diskann<span class="token double-colon punctuation">::</span><span class="token generic-function"><span class="token function">Index</span><span class="token generic class-name"><span class="token operator">&lt;</span>T<span class="token operator">></span></span></span><span class="token punctuation">(</span>compareMetric<span class="token punctuation">,</span> shard_base_dim<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="6"></td><td><pre> shard_base_pts<span class="token punctuation">,</span> <span class="token boolean">false</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// TODO: Single?</span></pre></td></tr><tr><td data-num="7"></td><td><pre>_pvamanaIndex<span class="token operator">-></span><span class="token function">build</span><span class="token punctuation">(</span>shard_base_file<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> shard_base_pts<span class="token punctuation">,</span> paras<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="8"></td><td><pre>_pvamanaIndex<span class="token operator">-></span><span class="token function">save</span><span class="token punctuation">(</span>shard_index_file<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="9"></td><td><pre></pre></td></tr><tr><td data-num="10"></td><td><pre><span class="token keyword">template</span><span class="token operator">&lt;</span><span class="token keyword">typename</span> <span class="token class-name">T</span><span class="token punctuation">,</span> <span class="token keyword">typename</span> <span class="token class-name">TagT</span><span class="token operator">></span></pre></td></tr><tr><td data-num="11"></td><td><pre><span class="token class-name">Index</span><span class="token operator">&lt;</span>T<span class="token punctuation">,</span> TagT<span class="token operator">></span><span class="token double-colon punctuation">::</span><span class="token function">Index</span><span class="token punctuation">(</span>Metric m<span class="token punctuation">,</span> <span class="token keyword">const</span> size_t dim<span class="token punctuation">,</span> <span class="token keyword">const</span> size_t max_points<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="12"></td><td><pre> <span class="token keyword">const</span> <span class="token keyword">bool</span> dynamic_index<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="13"></td><td><pre> <span class="token keyword">const</span> <span class="token keyword">bool</span> enable_tags<span class="token punctuation">,</span> <span class="token keyword">const</span> <span class="token keyword">bool</span> support_eager_delete<span class="token punctuation">)</span> <span class="token comment">//bool 值全部默认 false</span></pre></td></tr><tr><td data-num="14"></td><td><pre> <span class="token operator">:</span> <span class="token function">_dist_metric</span><span class="token punctuation">(</span>m<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">_dim</span><span class="token punctuation">(</span>dim<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">_max_points</span><span class="token punctuation">(</span>max_points<span class="token punctuation">)</span><span class="token punctuation">,</span></pre></td></tr><tr><td data-num="15"></td><td><pre><span class="token function">_dynamic_index</span><span class="token punctuation">(</span>dynamic_index<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">_enable_tags</span><span class="token punctuation">(</span>enable_tags<span class="token punctuation">)</span><span class="token punctuation">,</span></pre></td></tr><tr><td data-num="16"></td><td><pre><span class="token function">_support_eager_delete</span><span class="token punctuation">(</span>support_eager_delete<span class="token punctuation">)</span></pre></td></tr></table></figure><p>这里 build 有两个重载,但是明明参数和这两个重载函数都不一致。。。最后 gdb 打断点发现进入的是这个 build 函数内,真的奇奇怪怪的:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">void</span> <span class="token class-name">Index</span><span class="token operator">&lt;</span>T<span class="token punctuation">,</span> TagT<span class="token operator">></span><span class="token double-colon punctuation">::</span><span class="token function">build</span><span class="token punctuation">(</span><span class="token keyword">const</span> <span class="token keyword">char</span> <span class="token operator">*</span> filename<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="2"></td><td><pre> <span class="token keyword">const</span> size_t num_points_to_load<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="3"></td><td><pre> Parameters <span class="token operator">&amp;</span> parameters<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="4"></td><td><pre> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span>TagT<span class="token operator">></span> <span class="token operator">&amp;</span>tags<span class="token punctuation">)</span></pre></td></tr></table></figure><p>build 这个函数里面其实能看的还是比较少的,大部分代码都在做安全性检查。</p>
<p>我们进入到 link 这个函数里面:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token function">link</span><span class="token punctuation">(</span>parameters<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="2"></td><td><pre><span class="token keyword">void</span> <span class="token class-name">Index</span><span class="token operator">&lt;</span>T<span class="token punctuation">,</span> TagT<span class="token operator">></span><span class="token double-colon punctuation">::</span><span class="token function">link</span><span class="token punctuation">(</span>Parameters <span class="token operator">&amp;</span>parameters<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="3"></td><td><pre> </pre></td></tr><tr><td data-num="4"></td><td><pre><span class="token keyword">unsigned</span> num_threads <span class="token operator">=</span> parameters<span class="token punctuation">.</span><span class="token generic-function"><span class="token function">Get</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token string">"num_threads"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 注意如果不指定 num_threads,会使用所有可用的 cpu</span></pre></td></tr><tr><td data-num="5"></td><td><pre><span class="token keyword">if</span> <span class="token punctuation">(</span>num_threads <span class="token operator">!=</span> <span class="token number">0</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="6"></td><td><pre> <span class="token function">omp_set_num_threads</span><span class="token punctuation">(</span>num_threads<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr></table></figure><p>link 操作会执行两次,也就是跑两边 vanama,先把一些参数准备好:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre>_indexingQueueSize <span class="token operator">=</span> parameters<span class="token punctuation">.</span><span class="token generic-function"><span class="token function">Get</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token string">"L"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// Search list size</span></pre></td></tr><tr><td data-num="2"></td><td><pre>_indexingRange <span class="token operator">=</span> parameters<span class="token punctuation">.</span><span class="token generic-function"><span class="token function">Get</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token string">"R"</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="3"></td><td><pre>_indexingMaxC <span class="token operator">=</span> parameters<span class="token punctuation">.</span><span class="token generic-function"><span class="token function">Get</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token string">"C"</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="4"></td><td><pre><span class="token keyword">const</span> <span class="token keyword">float</span> last_round_alpha <span class="token operator">=</span> parameters<span class="token punctuation">.</span><span class="token generic-function"><span class="token function">Get</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">float</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token string">"alpha"</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="5"></td><td><pre><span class="token keyword">unsigned</span> L <span class="token operator">=</span> _indexingQueueSize<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="6"></td><td><pre></pre></td></tr><tr><td data-num="7"></td><td><pre><span class="token keyword">uint32_t</span> num_syncs <span class="token operator">=</span></pre></td></tr><tr><td data-num="8"></td><td><pre> <span class="token punctuation">(</span><span class="token keyword">unsigned</span><span class="token punctuation">)</span> <span class="token function">DIV_ROUND_UP</span><span class="token punctuation">(</span>_nd <span class="token operator">+</span> _num_frozen_pts<span class="token punctuation">,</span> <span class="token punctuation">(</span><span class="token number">64</span> <span class="token operator">*</span> <span class="token number">64</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="9"></td><td><pre><span class="token keyword">if</span> <span class="token punctuation">(</span>num_syncs <span class="token operator">&lt;</span> <span class="token number">40</span><span class="token punctuation">)</span> <span class="token comment">// 根据节点数量,可以把当前数据分成多块,虽然并不会并行,但是会让打印进度更加方便:</span></pre></td></tr><tr><td data-num="10"></td><td><pre> num_syncs <span class="token operator">=</span> <span class="token number">40</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="11"></td><td><pre></pre></td></tr><tr><td data-num="12"></td><td><pre>std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span> Lvec<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="13"></td><td><pre>Lvec<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>L<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="14"></td><td><pre>Lvec<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>L<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="15"></td><td><pre><span class="token keyword">const</span> <span class="token keyword">unsigned</span> NUM_RNDS <span class="token operator">=</span> <span class="token number">2</span><span class="token punctuation">;</span> <span class="token comment">// 两阶段</span></pre></td></tr><tr><td data-num="16"></td><td><pre> </pre></td></tr><tr><td data-num="17"></td><td><pre></pre></td></tr><tr><td data-num="18"></td><td><pre>_indexingAlpha <span class="token operator">=</span> <span class="token number">1.0f</span><span class="token punctuation">;</span> <span class="token comment">//index 的 alpha 参数</span></pre></td></tr><tr><td data-num="19"></td><td><pre></pre></td></tr><tr><td data-num="20"></td><td><pre>std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span> visit_order<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="21"></td><td><pre>std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span>diskann<span class="token double-colon punctuation">::</span>Neighbor<span class="token operator">></span> pool<span class="token punctuation">,</span> tmp<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="22"></td><td><pre>tsl<span class="token double-colon punctuation">::</span>robin_set<span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span> visited<span class="token punctuation">;</span></pre></td></tr></table></figure><p>计算 entry_point,并且设置一些时间参数:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre>_ep <span class="token operator">=</span> <span class="token function">calculate_entry_point</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="2"></td><td><pre><span class="token comment">// 这里是返回离数据集中心最近的点(也是数据集上的点)</span></pre></td></tr><tr><td data-num="3"></td><td><pre></pre></td></tr><tr><td data-num="4"></td><td><pre><span class="token keyword">double</span> sync_time <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">,</span> total_sync_time <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="5"></td><td><pre><span class="token keyword">double</span> inter_time <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">,</span> total_inter_time <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="6"></td><td><pre>size_t inter_count <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">,</span> total_inter_count <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="7"></td><td><pre><span class="token comment">// 一些时间参数</span></pre></td></tr></table></figure><p>这里有多个循环,很容易就绕晕了,最外层是 vanama 次数,然后是切分数据,最后是</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">uint32_t</span> rnd_no <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> rnd_no <span class="token operator">&lt;</span> NUM_RNDS<span class="token punctuation">;</span> rnd_no<span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span> <span class="token comment">// 默认执行两轮</span></pre></td></tr><tr><td data-num="2"></td><td><pre> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">uint32_t</span> sync_num <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> sync_num <span class="token operator">&lt;</span> num_syncs<span class="token punctuation">;</span> sync_num<span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="3"></td><td><pre> <span class="token comment">// 这个 for 并行计算</span></pre></td></tr><tr><td data-num="4"></td><td><pre> <span class="token keyword">for</span> <span class="token punctuation">(</span>_s64 node_ctr <span class="token operator">=</span> <span class="token punctuation">(</span>_s64<span class="token punctuation">)</span> start_id<span class="token punctuation">;</span> node_ctr <span class="token operator">&lt;</span> <span class="token punctuation">(</span>_s64<span class="token punctuation">)</span> end_id<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="5"></td><td><pre> <span class="token operator">++</span>node_ctr<span class="token punctuation">)</span></pre></td></tr></table></figure><p>使用贪心检索查找候选邻居,这里有太多层了,每一层换一个参数。。。。:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre>pool<span class="token punctuation">.</span><span class="token function">reserve</span><span class="token punctuation">(</span>L <span class="token operator">*</span> <span class="token number">2</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="2"></td><td><pre>visited<span class="token punctuation">.</span><span class="token function">reserve</span><span class="token punctuation">(</span>L <span class="token operator">*</span> <span class="token number">2</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="3"></td><td><pre><span class="token function">get_expanded_nodes</span><span class="token punctuation">(</span>node<span class="token punctuation">,</span> L<span class="token punctuation">,</span> init_ids<span class="token punctuation">,</span> pool<span class="token punctuation">,</span> visited<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="4"></td><td><pre></pre></td></tr><tr><td data-num="5"></td><td><pre><span class="token keyword">void</span> <span class="token class-name">Index</span><span class="token operator">&lt;</span>T<span class="token punctuation">,</span> TagT<span class="token operator">></span><span class="token double-colon punctuation">::</span><span class="token function">get_expanded_nodes</span><span class="token punctuation">(</span></pre></td></tr><tr><td data-num="6"></td><td><pre> <span class="token keyword">const</span> size_t node_id<span class="token punctuation">,</span> <span class="token keyword">const</span> <span class="token keyword">unsigned</span> Lindex<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="7"></td><td><pre> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span> init_ids<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="8"></td><td><pre> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span>Neighbor<span class="token operator">></span> <span class="token operator">&amp;</span> expanded_nodes_info<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="9"></td><td><pre> tsl<span class="token double-colon punctuation">::</span>robin_set<span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span> <span class="token operator">&amp;</span>expanded_nodes_ids<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="10"></td><td><pre> <span class="token keyword">const</span> T <span class="token operator">*</span>node_coords <span class="token operator">=</span> _data <span class="token operator">+</span> _aligned_dim <span class="token operator">*</span> node_id<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="11"></td><td><pre></pre></td></tr><tr><td data-num="12"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>init_ids<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="13"></td><td><pre> init_ids<span class="token punctuation">.</span><span class="token function">emplace_back</span><span class="token punctuation">(</span>_ep<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="14"></td><td><pre></pre></td></tr><tr><td data-num="15"></td><td><pre> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span> des<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="16"></td><td><pre> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span>Neighbor<span class="token operator">></span> best_L_nodes<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="17"></td><td><pre> best_L_nodes<span class="token punctuation">.</span><span class="token function">resize</span><span class="token punctuation">(</span>Lindex <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="18"></td><td><pre> tsl<span class="token double-colon punctuation">::</span>robin_set<span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span> inserted_into_pool_rs<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="19"></td><td><pre> boost<span class="token double-colon punctuation">::</span>dynamic_bitset<span class="token operator">&lt;</span><span class="token operator">></span> inserted_into_pool_bs<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="20"></td><td><pre></pre></td></tr><tr><td data-num="21"></td><td><pre> <span class="token function">iterate_to_fixed_point</span><span class="token punctuation">(</span>node_coords<span class="token punctuation">,</span> Lindex<span class="token punctuation">,</span> init_ids<span class="token punctuation">,</span> expanded_nodes_info<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="22"></td><td><pre> expanded_nodes_ids<span class="token punctuation">,</span> best_L_nodes<span class="token punctuation">,</span> des<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="23"></td><td><pre> inserted_into_pool_rs<span class="token punctuation">,</span> inserted_into_pool_bs<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="24"></td><td><pre></pre></td></tr><tr><td data-num="25"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="26"></td><td><pre></pre></td></tr><tr><td data-num="27"></td><td><pre>std<span class="token double-colon punctuation">::</span>pair<span class="token operator">&lt;</span><span class="token keyword">uint32_t</span><span class="token punctuation">,</span> <span class="token keyword">uint32_t</span><span class="token operator">></span> <span class="token class-name">Index</span><span class="token operator">&lt;</span>T<span class="token punctuation">,</span> TagT<span class="token operator">></span><span class="token double-colon punctuation">::</span><span class="token function">iterate_to_fixed_point</span><span class="token punctuation">(</span></pre></td></tr><tr><td data-num="28"></td><td><pre> <span class="token keyword">const</span> T <span class="token operator">*</span>node_coords<span class="token punctuation">,</span> <span class="token keyword">const</span> <span class="token keyword">unsigned</span> Lsize<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="29"></td><td><pre> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span> <span class="token operator">&amp;</span>init_ids<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="30"></td><td><pre> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span>Neighbor<span class="token operator">></span> <span class="token operator">&amp;</span> expanded_nodes_info<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="31"></td><td><pre> tsl<span class="token double-colon punctuation">::</span>robin_set<span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span> <span class="token operator">&amp;</span> expanded_nodes_ids<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="32"></td><td><pre> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span>Neighbor<span class="token operator">></span> <span class="token operator">&amp;</span>best_L_nodes<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span> <span class="token operator">&amp;</span>des<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="33"></td><td><pre> tsl<span class="token double-colon punctuation">::</span>robin_set<span class="token operator">&lt;</span><span class="token keyword">unsigned</span><span class="token operator">></span> <span class="token operator">&amp;</span>inserted_into_pool_rs<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="34"></td><td><pre> boost<span class="token double-colon punctuation">::</span>dynamic_bitset<span class="token operator">&lt;</span><span class="token operator">></span> <span class="token operator">&amp;</span>inserted_into_pool_bs<span class="token punctuation">,</span> <span class="token keyword">bool</span> ret_frozen<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="35"></td><td><pre> <span class="token keyword">bool</span> search_invocation<span class="token punctuation">)</span></pre></td></tr></table></figure><p>这里代码首先把 init_ids(len=1,因为是去 center 作为 entrypoint 并且没有添加其他的元素)中 ep 给放到 best_L_nodes 中,并且设置为已经插入到 pool,<em>inserted_into_pool_bs</em>[id] = 1。</p>
<p><strong>这里我 gdb 打断点后发现,vanama 根本没有使用初始化随机图,而是直接采用插入的方法。</strong> 鹅美静!</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220609203248466.png" alt="image-20220609203248466" /></p>
<p>这里使用贪心算法,在图上进行检索,然后裁边:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token function">get_expanded_nodes</span><span class="token punctuation">(</span>node<span class="token punctuation">,</span> L<span class="token punctuation">,</span> init_ids<span class="token punctuation">,</span> pool<span class="token punctuation">,</span> visited<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="2"></td><td><pre></pre></td></tr><tr><td data-num="3"></td><td><pre><span class="token comment">// 添加该点现有的邻居节点</span></pre></td></tr><tr><td data-num="4"></td><td><pre></pre></td></tr><tr><td data-num="5"></td><td><pre><span class="token function">prune_neighbors</span><span class="token punctuation">(</span>node<span class="token punctuation">,</span> pool<span class="token punctuation">,</span> pruned_list<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr></table></figure><p>这一块太抽象了,很难用文字描述,简单来说就是论文算法的实现。</p>
<p>这里它用 alpha 因子进行 occlude 的时候不是直接完成的,而且反复迭代增大初始因子进行的,这点很有意思。</p>
<p>现在分块构建的索引已经保存好了,需要做的就是 merge:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre>diskann<span class="token double-colon punctuation">::</span><span class="token function">merge_shards</span><span class="token punctuation">(</span>merged_index_prefix <span class="token operator">+</span> <span class="token string">"_subshard-"</span><span class="token punctuation">,</span> <span class="token string">"_mem.index"</span><span class="token punctuation">,</span></pre></td></tr><tr><td data-num="2"></td><td><pre> merged_index_prefix <span class="token operator">+</span> <span class="token string">"_subshard-"</span><span class="token punctuation">,</span> <span class="token string">"_ids_uint32.bin"</span><span class="token punctuation">,</span></pre></td></tr><tr><td data-num="3"></td><td><pre> num_parts<span class="token punctuation">,</span> R<span class="token punctuation">,</span> mem_index_path<span class="token punctuation">,</span> medoids_file<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="4"></td><td><pre> </pre></td></tr><tr><td data-num="5"></td><td><pre><span class="token keyword">int</span> <span class="token function">merge_shards</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&amp;</span>vamana_prefix<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="6"></td><td><pre> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&amp;</span>vamana_suffix<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="7"></td><td><pre> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&amp;</span>idmaps_prefix<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="8"></td><td><pre> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&amp;</span>idmaps_suffix<span class="token punctuation">,</span> <span class="token keyword">const</span> _u64 nshards<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="9"></td><td><pre> <span class="token keyword">unsigned</span> max_degree<span class="token punctuation">,</span> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&amp;</span>output_vamana<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="10"></td><td><pre> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&amp;</span>medoids_file<span class="token punctuation">)</span></pre></td></tr></table></figure> ]]></description>
</item>
<item>
<guid isPermalink="true">https://songlinlife.top/2022/cpp/CMake%E5%AD%A6%E4%B9%A0/</guid>
<title>CMake学习</title>
<link>https://songlinlife.top/2022/cpp/CMake%E5%AD%A6%E4%B9%A0/</link>
<category term="cpp" scheme="https://songlinlife.top/categories/cpp/" />
<category term="cmake" scheme="https://songlinlife.top/tags/cmake/" />
<pubDate>Mon, 25 Apr 2022 10:43:39 +0800</pubDate>
<description><![CDATA[ <p>打算简单学一下 CMake,毕竟接下来的实验也好,代码也好 CMake 基本上的躲不掉的。。。</p>
<h3 id="makefile"><a class="anchor" href="#makefile">#</a> MakeFile</h3>
<p>对于当前文件目录:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220425105218653.png" alt="image-20220425105218653" /></p>
<p>可以看到 answer.hpp 和 answer.cpp 都在当下目录下,所以不需要额外连接工作。</p>
<p>Makefile 可以简单写为:</p>
<figure class="highlight makefile"><figcaption data-lang="makefile"></figcaption><table><tr><td data-num="1"></td><td><pre>CC <span class="token operator">:=</span> clang</pre></td></tr><tr><td data-num="2"></td><td><pre>CXX <span class="token operator">:=</span> clang++</pre></td></tr><tr><td data-num="3"></td><td><pre></pre></td></tr><tr><td data-num="4"></td><td><pre><span class="token builtin">.PHONY</span><span class="token punctuation">:</span> all</pre></td></tr><tr><td data-num="5"></td><td><pre><span class="token symbol">all</span><span class="token punctuation">:</span> answer</pre></td></tr><tr><td data-num="6"></td><td><pre></pre></td></tr><tr><td data-num="7"></td><td><pre><span class="token comment"># 在这里添加了 answer.o 目标文件。</span></pre></td></tr><tr><td data-num="8"></td><td><pre>objects <span class="token operator">:=</span> main.o answer.o</pre></td></tr><tr><td data-num="9"></td><td><pre></pre></td></tr><tr><td data-num="10"></td><td><pre><span class="token symbol">answer</span><span class="token punctuation">:</span> <span class="token variable">$</span><span class="token punctuation">(</span>objects<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="11"></td><td><pre> <span class="token variable">$</span><span class="token punctuation">(</span>CXX<span class="token punctuation">)</span> -o <span class="token variable">$@</span> <span class="token variable">$</span><span class="token punctuation">(</span>objects<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="12"></td><td><pre></pre></td></tr><tr><td data-num="13"></td><td><pre><span class="token comment">#</span></pre></td></tr><tr><td data-num="14"></td><td><pre><span class="token comment"># Make 可以自动推断 .o 目标文件需要依赖同名的 .cpp 文件,</span></pre></td></tr><tr><td data-num="15"></td><td><pre><span class="token comment"># 所以其实不需要在依赖中指定 main.cpp 和 answer.cpp,</span></pre></td></tr><tr><td data-num="16"></td><td><pre><span class="token comment"># 也不需要写编译 commands,它知道要用 CXX 变量制定的命令</span></pre></td></tr><tr><td data-num="17"></td><td><pre><span class="token comment"># 作为 C++ 编译器。</span></pre></td></tr><tr><td data-num="18"></td><td><pre><span class="token comment">#</span></pre></td></tr><tr><td data-num="19"></td><td><pre><span class="token comment"># 这里只需要指定目标文件所依赖的头文件,使头文件变动时可以</span></pre></td></tr><tr><td data-num="20"></td><td><pre><span class="token comment"># 重新编译对应目标文件。</span></pre></td></tr><tr><td data-num="21"></td><td><pre><span class="token comment">#</span></pre></td></tr><tr><td data-num="22"></td><td><pre><span class="token symbol">main.o</span><span class="token punctuation">:</span> answer.hpp</pre></td></tr><tr><td data-num="23"></td><td><pre><span class="token symbol">answer.o</span><span class="token punctuation">:</span> answer.hpp</pre></td></tr><tr><td data-num="24"></td><td><pre></pre></td></tr><tr><td data-num="25"></td><td><pre><span class="token builtin">.PHONY</span><span class="token punctuation">:</span> clean</pre></td></tr><tr><td data-num="26"></td><td><pre><span class="token symbol">clean</span><span class="token punctuation">:</span></pre></td></tr><tr><td data-num="27"></td><td><pre> rm -f answer <span class="token variable">$</span><span class="token punctuation">(</span>objects<span class="token punctuation">)</span></pre></td></tr></table></figure><p>注意头文件其实并不会参加链接和编译,编译器做的第一步就是把头文件在源文件中进行展开,这就是为什么需要加 ifdefine。所以即使把 22 和 23 这两行给注释掉,也无所谓,只是头文件更新后不会重新编译。</p>
<h3 id="cmake-简单三步走"><a class="anchor" href="#cmake-简单三步走">#</a> CMake 简单三步走</h3>
<p>同样的文件目录,同样的 cpp 文件,执行,CmakeLists.txt 可以写为:</p>
<figure class="highlight cmake"><figcaption data-lang="CMake"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token comment"># 指定最小 CMake 版本要求</span></pre></td></tr><tr><td data-num="2"></td><td><pre><span class="token keyword">cmake_minimum_required</span><span class="token punctuation">(</span><span class="token property">VERSION</span> <span class="token number">3.9</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="3"></td><td><pre><span class="token comment"># 设置项目名称</span></pre></td></tr><tr><td data-num="4"></td><td><pre><span class="token keyword">project</span><span class="token punctuation">(</span>answer<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="5"></td><td><pre></pre></td></tr><tr><td data-num="6"></td><td><pre><span class="token comment">#[[</span></pre></td></tr><tr><td data-num="7"></td><td><pre>添加可执行文件 target,类似于原来 Makefile 的:</pre></td></tr><tr><td data-num="8"></td><td><pre></pre></td></tr><tr><td data-num="9"></td><td><pre> answer: main.o answer.o</pre></td></tr><tr><td data-num="10"></td><td><pre> main.o: main.cpp answer.hpp</pre></td></tr><tr><td data-num="11"></td><td><pre> answer.o: answer.cpp answer.hpp</pre></td></tr><tr><td data-num="12"></td><td><pre></pre></td></tr><tr><td data-num="13"></td><td><pre>CMake 会自动找到依赖的头文件,因此不需要特别指定,</pre></td></tr><tr><td data-num="14"></td><td><pre>当头文件修改的时候,会重新编译依赖它的目标文件。</pre></td></tr><tr><td data-num="15"></td><td><pre><span class="token comment">#]]</span></pre></td></tr><tr><td data-num="16"></td><td><pre><span class="token keyword">add_executable</span><span class="token punctuation">(</span>answer main.cpp answer.cpp<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="17"></td><td><pre></pre></td></tr><tr><td data-num="18"></td><td><pre><span class="token comment">#[[</span></pre></td></tr><tr><td data-num="19"></td><td><pre>使用如下命令构建本项目:</pre></td></tr><tr><td data-num="20"></td><td><pre></pre></td></tr><tr><td data-num="21"></td><td><pre> cmake -B build <span class="token comment"># 生成构建目录</span></pre></td></tr><tr><td data-num="22"></td><td><pre> cmake --build build <span class="token comment"># 执行构建</span></pre></td></tr><tr><td data-num="23"></td><td><pre> ./build/answer <span class="token comment"># 运行 answer 程序</span></pre></td></tr><tr><td data-num="24"></td><td><pre><span class="token comment">#]]</span></pre></td></tr></table></figure><h3 id="cmake-split-dir"><a class="anchor" href="#cmake-split-dir">#</a> Cmake Split Dir</h3>
<p>新的多层目录:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220425122439766.png" alt="image-20220425122439766" /></p>
<p>main.cpp:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">&lt;answer/answer.hpp></span></span></pre></td></tr></table></figure><p>可以看到 main.cpp includeanswer 的时候只是简单调用了 answer/answer.cpp</p>
<p>其中外部的 CMake 文件:</p>
<figure class="highlight cmake"><figcaption data-lang="CMake"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">cmake_minimum_required</span><span class="token punctuation">(</span><span class="token property">VERSION</span> <span class="token number">3.9</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="2"></td><td><pre><span class="token keyword">project</span><span class="token punctuation">(</span>answer<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="3"></td><td><pre></pre></td></tr><tr><td data-num="4"></td><td><pre><span class="token comment"># 添加 answer 子目录</span></pre></td></tr><tr><td data-num="5"></td><td><pre><span class="token keyword">add_subdirectory</span><span class="token punctuation">(</span>answer<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="6"></td><td><pre><span class="token keyword">add_executable</span><span class="token punctuation">(</span>answer_app main.cpp<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="7"></td><td><pre><span class="token keyword">target_link_libraries</span><span class="token punctuation">(</span>answer_app libanswer<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="8"></td><td><pre></pre></td></tr><tr><td data-num="9"></td><td><pre><span class="token comment">#[[</span></pre></td></tr><tr><td data-num="10"></td><td><pre>使用如下命令构建本项目:</pre></td></tr><tr><td data-num="11"></td><td><pre></pre></td></tr><tr><td data-num="12"></td><td><pre> cmake -B build <span class="token comment"># 生成构建目录</span></pre></td></tr><tr><td data-num="13"></td><td><pre> cmake --build build <span class="token comment"># 执行构建</span></pre></td></tr><tr><td data-num="14"></td><td><pre> ./build/answer_app <span class="token comment"># 运行 answer_app 程序</span></pre></td></tr><tr><td data-num="15"></td><td><pre><span class="token comment">#]]</span></pre></td></tr></table></figure><p>它添加了子目录,那么他也会去执行子目录的 CmakeList.txt。</p>
<p>来看子目录的 cmakeList.txt:</p>
<figure class="highlight cmake"><figcaption data-lang="CMake"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">add_library</span><span class="token punctuation">(</span>libanswer <span class="token namespace">STATIC</span> answer.cpp<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="2"></td><td><pre></pre></td></tr><tr><td data-num="3"></td><td><pre><span class="token comment">#[[</span></pre></td></tr><tr><td data-num="4"></td><td><pre>message 可用于打印调试信息或错误信息,除了 STATUS</pre></td></tr><tr><td data-num="5"></td><td><pre>外还有 DEBUG WARNING SEND_ERROR FATAL_ERROR 等。</pre></td></tr><tr><td data-num="6"></td><td><pre><span class="token comment">#]]</span></pre></td></tr><tr><td data-num="7"></td><td><pre><span class="token keyword">message</span><span class="token punctuation">(</span>STATUS <span class="token string">"Current source dir: <span class="token interpolation"><span class="token punctuation">$&#123;</span><span class="token variable">CMAKE_CURRENT_SOURCE_DIR</span><span class="token punctuation">&#125;</span></span>"</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="8"></td><td><pre></pre></td></tr><tr><td data-num="9"></td><td><pre><span class="token comment">#[[</span></pre></td></tr><tr><td data-num="10"></td><td><pre>给 libanswer 库目标添加 include 目录,<span class="token namespace">PUBLIC</span> 使</pre></td></tr><tr><td data-num="11"></td><td><pre>这个 include 目录能被外部使用者看到。</pre></td></tr><tr><td data-num="12"></td><td><pre></pre></td></tr><tr><td data-num="13"></td><td><pre>当链接 libanswer 库时,这里指定的 include 目录会被</pre></td></tr><tr><td data-num="14"></td><td><pre>自动添加到使用此库的 target 的 include 路径中。</pre></td></tr><tr><td data-num="15"></td><td><pre><span class="token comment">#]]</span></pre></td></tr><tr><td data-num="16"></td><td><pre><span class="token keyword">target_include_directories</span><span class="token punctuation">(</span>libanswer <span class="token namespace">PUBLIC</span> <span class="token punctuation">$&#123;</span><span class="token variable">CMAKE_CURRENT_SOURCE_DIR</span><span class="token punctuation">&#125;</span>/include<span class="token punctuation">)</span></pre></td></tr></table></figure><p>静态库表示连接的时候会把这个库也加到可执行文件。 <code>target_include_directories(libanswer PUBLIC $&#123;CMAKE_CURRENT_SOURCE_DIR&#125;/include)</code> 这个语句的作用连接 libanaswer 库的时候,会把当前目录下的 include 目录,添加到 target 的 include 目录。</p>
<p>如果加上一个 <code>DCMAKE_EXPORT_COMPILE_COMMANDS</code> 参数,也就是会自动生成一个 compile 文件:</p>
<figure class="highlight bash"><figcaption data-lang="bash"></figcaption><table><tr><td data-num="1"></td><td><pre>cmake -DCMAKE_EXPORT_COMPILE_COMMANDS<span class="token operator">=</span><span class="token number">1</span> -B build</pre></td></tr></table></figure><p>或者在 CMakeList.txt 中加入:</p>
<figure class="highlight cmake"><figcaption data-lang="CMake"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">set</span><span class="token punctuation">(</span><span class="token variable">CMAKE_EXPORT_COMPILE_COMMANDS</span> onk<span class="token punctuation">)</span></pre></td></tr></table></figure><p>如果把目录结构改为:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220425142550285.png" alt="image-20220425142550285" /></p>
<p>那么 main.cpp 中需要改为:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">&lt;ans/answer.hpp></span></span></pre></td></tr></table></figure><p>我现在彻底懂了这个含义了,第一个 CMakeLists.txt 中 <code>add_subdirectory(answer)</code> 表示使用子目录下的 CMakeLists.txt,子目录中 <code>add_library(libanswer STATIC answer.cpp)</code> 表示做一个库 libanswer, <code>target_include_directories(libanswer PUBLIC $&#123;CMAKE_CURRENT_SOURCE_DIR&#125;/include)</code> 这个 target 应该就是指的是这个 CmakeLists.txt 文件。因此这里做出库之后,那么 <code>target_link_libraries(answer_app libanswer)</code> 表示最终需要 add 这个库。</p>
<h4 id="使用系统的curl动态库"><a class="anchor" href="#使用系统的curl动态库">#</a> 使用系统的 curl 动态库</h4>
<figure class="highlight cmake"><figcaption data-lang="CMake"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token comment">#[[</span></pre></td></tr><tr><td data-num="2"></td><td><pre>find_package 用于在系统中寻找已经安装的第三方库的头文件和库文件</pre></td></tr><tr><td data-num="3"></td><td><pre>的位置,并创建一个名为 <span class="token inserted class-name">CURL::libcurl</span> 的库目标,以供链接。</pre></td></tr><tr><td data-num="4"></td><td><pre><span class="token comment">#]]</span></pre></td></tr><tr><td data-num="5"></td><td><pre><span class="token keyword">find_package</span><span class="token punctuation">(</span>CURL REQUIRED<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="6"></td><td><pre></pre></td></tr><tr><td data-num="7"></td><td><pre><span class="token keyword">add_library</span><span class="token punctuation">(</span>libanswer <span class="token namespace">STATIC</span> answer.cpp<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="8"></td><td><pre></pre></td></tr><tr><td data-num="9"></td><td><pre><span class="token keyword">target_include_directories</span><span class="token punctuation">(</span>libanswer <span class="token namespace">PUBLIC</span> <span class="token punctuation">$&#123;</span><span class="token variable">CMAKE_CURRENT_SOURCE_DIR</span><span class="token punctuation">&#125;</span>/include<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="10"></td><td><pre></pre></td></tr><tr><td data-num="11"></td><td><pre><span class="token comment">#[[</span></pre></td></tr><tr><td data-num="12"></td><td><pre>为 libanswer 库链接 libcurl,这里 <span class="token namespace">PRIVATE</span> 和 <span class="token namespace">PUBLIC</span> 的区别是:</pre></td></tr><tr><td data-num="13"></td><td><pre><span class="token inserted class-name">CURL::libcurl</span> 库只会被 libanswer 看到,根级别的 main.cpp 中</pre></td></tr><tr><td data-num="14"></td><td><pre>无法 include curl 的头文件。</pre></td></tr><tr><td data-num="15"></td><td><pre><span class="token comment">#]]</span></pre></td></tr><tr><td data-num="16"></td><td><pre><span class="token keyword">target_link_libraries</span><span class="token punctuation">(</span>libanswer <span class="token namespace">PRIVATE</span> <span class="token inserted class-name">CURL::libcurl</span><span class="token punctuation">)</span></pre></td></tr></table></figure><h3 id="新的依赖关系"><a class="anchor" href="#新的依赖关系">#</a> 新的依赖关系</h3>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220425155727377.png" alt="image-20220425155727377" /></p>
<p>依赖关系图:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220425160052810.png" alt="image-20220425160052810" /></p>
<p>其中的 CMake 代码:</p>
<p>answer 中:</p>
<figure class="highlight cmake"><figcaption data-lang="CMake"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">add_library</span><span class="token punctuation">(</span>libanswer <span class="token namespace">STATIC</span> answer.cpp<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="2"></td><td><pre><span class="token keyword">target_include_directories</span><span class="token punctuation">(</span>libanswer <span class="token namespace">PUBLIC</span> <span class="token punctuation">$&#123;</span><span class="token variable">CMAKE_CURRENT_SOURCE_DIR</span><span class="token punctuation">&#125;</span>/include<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="3"></td><td><pre></pre></td></tr><tr><td data-num="4"></td><td><pre><span class="token comment"># libanswer 改成直接使用 wolfram 库提供的 API,无需关心 CURL</span></pre></td></tr><tr><td data-num="5"></td><td><pre><span class="token keyword">target_link_libraries</span><span class="token punctuation">(</span>libanswer <span class="token namespace">PRIVATE</span> wolfram<span class="token punctuation">)</span></pre></td></tr></table></figure><p>wolframe 中:</p>
<figure class="highlight cmake"><figcaption data-lang="CMake"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">add_library</span><span class="token punctuation">(</span>wolfram <span class="token namespace">STATIC</span> alpha.cpp<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="2"></td><td><pre><span class="token keyword">target_include_directories</span><span class="token punctuation">(</span>wolfram <span class="token namespace">PUBLIC</span> <span class="token punctuation">$&#123;</span><span class="token variable">CMAKE_CURRENT_SOURCE_DIR</span><span class="token punctuation">&#125;</span>/include<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="3"></td><td><pre></pre></td></tr><tr><td data-num="4"></td><td><pre><span class="token comment"># wolfram PRIVATE 地依赖 curl_wrapper</span></pre></td></tr><tr><td data-num="5"></td><td><pre><span class="token keyword">target_link_libraries</span><span class="token punctuation">(</span>wolfram <span class="token namespace">PRIVATE</span> curl_wrapper<span class="token punctuation">)</span></pre></td></tr></table></figure><p>curl_wrapper:</p>
<figure class="highlight cmake"><figcaption data-lang="CMake"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">find_package</span><span class="token punctuation">(</span>CURL REQUIRED<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="2"></td><td><pre></pre></td></tr><tr><td data-num="3"></td><td><pre><span class="token keyword">add_library</span><span class="token punctuation">(</span>curl_wrapper <span class="token namespace">STATIC</span> curl_wrapper.cpp<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="4"></td><td><pre><span class="token keyword">target_include_directories</span><span class="token punctuation">(</span>curl_wrapper</pre></td></tr><tr><td data-num="5"></td><td><pre> <span class="token namespace">PUBLIC</span> <span class="token punctuation">$&#123;</span><span class="token variable">CMAKE_CURRENT_SOURCE_DIR</span><span class="token punctuation">&#125;</span>/include<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="6"></td><td><pre></pre></td></tr><tr><td data-num="7"></td><td><pre><span class="token comment"># curl_wrapper PRIVATE 地依赖 CURL::libcurl</span></pre></td></tr><tr><td data-num="8"></td><td><pre><span class="token keyword">target_link_libraries</span><span class="token punctuation">(</span>curl_wrapper <span class="token namespace">PRIVATE</span> <span class="token inserted class-name">CURL::libcurl</span><span class="token punctuation">)</span></pre></td></tr></table></figure><h3 id="把answer改成header-only"><a class="anchor" href="#把answer改成header-only">#</a> 把 answer 改成 Header only</h3>
<figure class="highlight cmake"><figcaption data-lang="CMake"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">set</span><span class="token punctuation">(</span>WOLFRAM_APPID</pre></td></tr><tr><td data-num="2"></td><td><pre> <span class="token string">""</span></pre></td></tr><tr><td data-num="3"></td><td><pre> <span class="token variable">CACHE</span> STRING <span class="token string">"WolframAlpha APPID"</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="4"></td><td><pre></pre></td></tr><tr><td data-num="5"></td><td><pre><span class="token keyword">if</span><span class="token punctuation">(</span>WOLFRAM_APPID <span class="token operator">STREQUAL</span> <span class="token string">""</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="6"></td><td><pre> <span class="token keyword">message</span><span class="token punctuation">(</span>SEND_ERROR <span class="token string">"WOLFRAM_APPID must not be empty"</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="7"></td><td><pre><span class="token keyword">endif</span><span class="token punctuation">(</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="8"></td><td><pre></pre></td></tr><tr><td data-num="9"></td><td><pre><span class="token comment">#[[</span></pre></td></tr><tr><td data-num="10"></td><td><pre><span class="token namespace">INTERFACE</span> 类型的 target 一般用于没有源文件的情况,比如</pre></td></tr><tr><td data-num="11"></td><td><pre>header-only 库,或者只是为了抽象地提供一组 target_xxx</pre></td></tr><tr><td data-num="12"></td><td><pre>的配置。</pre></td></tr><tr><td data-num="13"></td><td><pre></pre></td></tr><tr><td data-num="14"></td><td><pre><span class="token namespace">INTERFACE</span> target 的后续所有 target_xxx 都必须也使用</pre></td></tr><tr><td data-num="15"></td><td><pre><span class="token namespace">INTERFACE</span>,效果将会直接应用到链接此库的 target 上。</pre></td></tr><tr><td data-num="16"></td><td><pre></pre></td></tr><tr><td data-num="17"></td><td><pre>本步骤将 libanswer 从 <span class="token namespace">STATIC</span> target 改成 <span class="token namespace">INTERFACE</span></pre></td></tr><tr><td data-num="18"></td><td><pre>target 不会影响 answer_app 中使用它的代码。</pre></td></tr><tr><td data-num="19"></td><td><pre><span class="token comment">#]]</span></pre></td></tr><tr><td data-num="20"></td><td><pre><span class="token keyword">add_library</span><span class="token punctuation">(</span>libanswer <span class="token namespace">INTERFACE</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="21"></td><td><pre><span class="token keyword">target_include_directories</span><span class="token punctuation">(</span>libanswer</pre></td></tr><tr><td data-num="22"></td><td><pre> <span class="token namespace">INTERFACE</span> <span class="token punctuation">$&#123;</span><span class="token variable">CMAKE_CURRENT_SOURCE_DIR</span><span class="token punctuation">&#125;</span>/include<span class="token punctuation">)</span></pre></td></tr><tr><td data-num="23"></td><td><pre><span class="token keyword">target_compile_definitions</span><span class="token punctuation">(</span>libanswer <span class="token namespace">INTERFACE</span> WOLFRAM_APPID=<span class="token string">"<span class="token interpolation"><span class="token punctuation">$&#123;</span><span class="token variable">WOLFRAM_APPID</span><span class="token punctuation">&#125;</span></span>"</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="24"></td><td><pre><span class="token keyword">target_link_libraries</span><span class="token punctuation">(</span>libanswer <span class="token namespace">INTERFACE</span> wolfram<span class="token punctuation">)</span></pre></td></tr></table></figure><p>通过:</p>
<figure class="highlight cmake"><figcaption data-lang="CMake"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">if</span><span class="token punctuation">(</span>WOLFRAM_APPID <span class="token operator">STREQUAL</span> <span class="token string">""</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="2"></td><td><pre> <span class="token keyword">message</span><span class="token punctuation">(</span>SEND_ERROR <span class="token string">"WOLFRAM_APPID must not be empty"</span><span class="token punctuation">)</span></pre></td></tr><tr><td data-num="3"></td><td><pre><span class="token keyword">endif</span><span class="token punctuation">(</span><span class="token punctuation">)</span></pre></td></tr></table></figure><p>检查是否传入了 WOLFRAME_APPID,build 时候需要执行:</p>
<figure class="highlight cmake"><figcaption data-lang="CMake"></figcaption><table><tr><td data-num="1"></td><td><pre>cmake -B build -DWOLFRAM_APPID=ssdfk</pre></td></tr></table></figure><p>注意传参需要加个 <code>-D</code> 。</p>
]]></description>
</item>
<item>
<guid isPermalink="true">https://songlinlife.top/2022/%E6%95%B0%E6%8D%AE%E5%BA%93/KGraph/</guid>
<title>ANNS:KGraph</title>
<link>https://songlinlife.top/2022/%E6%95%B0%E6%8D%AE%E5%BA%93/KGraph/</link>
<category term="数据库" scheme="https://songlinlife.top/categories/%E6%95%B0%E6%8D%AE%E5%BA%93/" />
<category term="ANNS" scheme="https://songlinlife.top/tags/ANNS/" />
<pubDate>Sun, 24 Apr 2022 16:50:10 +0800</pubDate>
<description><![CDATA[ <p>这篇文章主要是过一下 KGraph,也就是 NN-Descent 算法,A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search。</p>
<p>NN-Descent 实际上就是一个近似 K-NNG 的构建方法。</p>
<h3 id="作者的动机"><a class="anchor" href="#作者的动机">#</a> 作者的动机</h3>
<p>高效的 K-NNG 构建方法一直是一个公开的问题,然而没有哪个方法能够做到通用、高效以及可扩展性,作者希望能够解决这些问题。</p>
<p>通用:对于任意的 similarity oracle,该方法都能适用。</p>
<p>可扩展性:也就是 dataset 可以不断正常,满足 online 场景。</p>
<p>高效、准确、易于实现。</p>
<h3 id="算法拆解"><a class="anchor" href="#算法拆解">#</a> 算法拆解</h3>
<h4 id="原始算法"><a class="anchor" href="#原始算法">#</a> 原始算法</h4>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424165850876.png" alt="image-20220424165850876" /></p>
<p>这个算法的核心就是一句话 ——<strong> 邻居的邻居有可能是邻居</strong>。</p>
<p>B 列表存的是节点的 K 近邻,R 列表存以节点为 K 近邻的节点。因此 B[v] \or R[v] 得到新的数据 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mover accent="true"><mi>B</mi><mo>^</mo></mover><mo stretchy="false">[</mo><mi>v</mi><mo stretchy="false">]</mo></mrow><annotation encoding="application/x-tex">\hat{B}[v]</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.19677em;vertical-align:-0.25em;"></span><span class="mord accent"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.9467699999999999em;"><span style="top:-3em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05017em;">B</span></span></span><span style="top:-3.25233em;"><span class="pstrut" style="height:3em;"></span><span class="accent-body" style="left:-0.16666em;"><span class="mord">^</span></span></span></span></span></span></span><span class="mopen">[</span><span class="mord mathnormal" style="margin-right:0.03588em;">v</span><span class="mclose">]</span></span></span></span>,然后我们说,邻居的邻居也有可能是邻居:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424170601406.png" alt="image-20220424170601406" /></p>
<p>这个算法的停止条件就是没法优化了,也就是邻居的邻居经过分析都没有可能互相成为邻居。</p>
<h4 id="优化后的算法"><a class="anchor" href="#优化后的算法">#</a> 优化后的算法</h4>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424170816397.png" alt="image-20220424170816397" /></p>
<p>作者做一下优化:</p>
<p>1)对于 u1,u2 两个点,他们实际上会比较两次,因此加上限定条件 u1 &lt; u2。</p>
<p>2)local join,不是更新 v,而是更新 u1 和 u2.</p>
<p>3)加入 sample 操作。</p>
<p>4)使用 old 和 new 两个数组。</p>
<p>5)早停。</p>
<h5 id="理解这里的条件"><a class="anchor" href="#理解这里的条件">#</a> 理解这里的条件</h5>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424172347894.png" alt="image-20220424172347894" /></p>
<p>B [v] 初始化或者插入一个新值的时候,都会把 flag 设为 true。因为我们是 local join,通过一个中间点把两边邻居的 B 进行更新。所以对于自己的 B [v] 的值来说,flag 为 true 说明还没有处理这些加入的点。因此如果 u1,u2 都属于 new 中,那么我们满足 u1&lt;u2,防止算两遍。处理完毕后会将 flag 设置为 false,这里在 sample 的时候就设置为 false 了。</p>
<p>对于 old 来说,如果两个都是 old,那么说明它们之前相互比较过了,所以不必要比较。但是如果两个点,一个属于 old,一个属于 new,那么它们之前一定没有比较!</p>
]]></description>
</item>
<item>
<guid isPermalink="true">https://songlinlife.top/2022/cpp/cpp%E5%A5%87%E6%B7%AB%E5%B7%A7%E6%8A%80/</guid>
<title>cpp奇淫巧技</title>
<link>https://songlinlife.top/2022/cpp/cpp%E5%A5%87%E6%B7%AB%E5%B7%A7%E6%8A%80/</link>
<category term="cpp" scheme="https://songlinlife.top/categories/cpp/" />
<pubDate>Sat, 23 Apr 2022 17:26:49 +0800</pubDate>
<description><![CDATA[ <p>用于记录各种 cpp 技巧</p>
<h3 id="nth_element-和-bind"><a class="anchor" href="#nth_element-和-bind">#</a> nth_element 和 bind</h3>
<p>nth_element 相当于快排,用于把特定元素放在特定的位置,比如下面这个用于归为中位数, <code>bind</code> 用于生成 cmp 排序函数。</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token keyword">using</span> std<span class="token double-colon punctuation">::</span>placeholders<span class="token double-colon punctuation">::</span>_1<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="2"></td><td><pre><span class="token keyword">using</span> std<span class="token double-colon punctuation">::</span>placeholders<span class="token double-colon punctuation">::</span>_2<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="3"></td><td><pre></pre></td></tr><tr><td data-num="4"></td><td><pre>std<span class="token double-colon punctuation">::</span><span class="token function">nth_element</span><span class="token punctuation">(</span>begin<span class="token punctuation">,</span> begin <span class="token operator">+</span> std<span class="token double-colon punctuation">::</span><span class="token function">distance</span><span class="token punctuation">(</span>begin<span class="token punctuation">,</span> end<span class="token punctuation">)</span> <span class="token operator">/</span> <span class="token number">2</span><span class="token punctuation">,</span></pre></td></tr><tr><td data-num="5"></td><td><pre> end<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span><span class="token function">bind</span><span class="token punctuation">(</span><span class="token operator">&amp;</span>comparer<span class="token double-colon punctuation">::</span>compare_idx<span class="token punctuation">,</span> comp<span class="token punctuation">,</span> _1<span class="token punctuation">,</span> _2<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr></table></figure><h3 id="vector-创建多维固定数组"><a class="anchor" href="#vector-创建多维固定数组">#</a> vector 创建多维固定数组</h3>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre>std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span>std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span><span class="token keyword">int</span><span class="token operator">>></span> <span class="token function">a</span><span class="token punctuation">(</span><span class="token number">5</span><span class="token punctuation">,</span> <span class="token generic-function"><span class="token function">vector</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">int</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token punctuation">(</span>size_t<span class="token punctuation">)</span><span class="token number">5</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 创建 5*5 的数组。</span></pre></td></tr><tr><td data-num="2"></td><td><pre> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span>std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span><span class="token keyword">int</span><span class="token operator">>></span> <span class="token function">a</span><span class="token punctuation">(</span><span class="token number">5</span><span class="token punctuation">,</span> <span class="token generic-function"><span class="token function">vector</span><span class="token generic class-name"><span class="token operator">&lt;</span><span class="token keyword">int</span><span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token punctuation">(</span>size_t<span class="token punctuation">)</span><span class="token number">5</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 创建 5*5 初始化值为 1 的数组。</span></pre></td></tr></table></figure> ]]></description>
</item>
<item>
<guid isPermalink="true">https://songlinlife.top/2022/%E6%95%B0%E6%8D%AE%E5%BA%93/%E6%A0%91%E6%9F%A5%E8%AF%A2%E7%AE%97%E6%B3%95/</guid>
<title>最近邻检索树算法——KD树,Ball树,VP树</title>
<link>https://songlinlife.top/2022/%E6%95%B0%E6%8D%AE%E5%BA%93/%E6%A0%91%E6%9F%A5%E8%AF%A2%E7%AE%97%E6%B3%95/</link>
<category term="数据库" scheme="https://songlinlife.top/categories/%E6%95%B0%E6%8D%AE%E5%BA%93/" />
<category term="ANNS" scheme="https://songlinlife.top/tags/ANNS/" />
<pubDate>Sat, 23 Apr 2022 12:02:49 +0800</pubDate>
<description><![CDATA[ <p>这篇文章主要聚焦一下各种奇奇怪怪的树查询算法。虽然我主要关注最近邻检索的图方法,但是因为很多图方法实际上都使用了各种 tree 作为辅助索引,所以有必要简单了解一下所有的树查询算法。</p>
<p>有请第一位 <s>受害者</s> 。</p>
<h3 id="kd树"><a class="anchor" href="#kd树">#</a> KD 树</h3>
<p>kd 树算是用的很广泛的一种最近邻检索树了,它的思想实际上和二叉搜索树很像。</p>
<p>看这张图就够了:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220423164038307.png" alt="image-20220423164038307" /></p>
<p>首先思考动机,为什么我们需要构建 KD 树?树结构是一种非常高效的数据结构,对于二叉搜索树,它的查询复杂度只有 h,也就是 log (n)。如果直接暴力检索最近邻,那么复杂度会是 n。</p>
<p>KD 树是什么?与搜索二叉树类似,通过比较数据,把大于当前节点的数据插入到树右侧,把小于节点的数据放到左侧。而 KD 树比较大小是在不同维度进行的。</p>
<p>例如,对于当前节点 root。我们比较所有数据点(p <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>∈</mo><msup><mi>E</mi><mi>d</mi></msup></mrow><annotation encoding="application/x-tex">\in E^d</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5782em;vertical-align:-0.0391em;"></span><span class="mrel">∈</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.849108em;vertical-align:0em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">E</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.849108em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">d</span></span></span></span></span></span></span></span></span></span></span>) 在维度 0 的数据,对于某个向量 [1.1,3.4,...,0.1] 在 dim=0 的数据即 1.1,如果 1.1 小于 root(dim=0),那么就把这个向量插入到左侧,如果小于就插入到右侧。同样在 KD 树下一层,我们比较 dim=1 的数据,按照同样方式进行插入。</p>
<p>一个很重要的优化方式就是,我们要保证 KD 树左右两边数据点数量差不多,这样进行插入时候可以保证得到的 KD 树尽可能平衡。于是我们构建 KD 树时候选择当前点集在当前维度下处于 <code>中位数</code> 的那个点作为 base,然后比较插入。</p>
<p>第二个优化的点,维度选择。我们尽量选择数据分布方差很大的维度。</p>
<p>所以构建方法:</p>
<p>假设我们已经知道了 维空间内的 个不同的点的坐标,要将其构建成一棵 k-D Tree,步骤如下:</p>
<ol>
<li>若当前超长方体中只有一个点,返回这个点。</li>
<li>选择一个维度,将当前超长方体按照这个维度分成两个超长方体。</li>
<li>选择切割点:在方差大的维度上选择中位数那个点,这一维度上的值小于这个点的归入一个超长方体(左子树),其余的归入另一个超长方体(右子树)。</li>
<li>将选择的点作为这棵子树的根节点,递归对分出的两个超长方体构建左右子树,维护子树的信息。</li>
</ol>
<p>这里定义可以参考:<span class="exturl" data-url="aHR0cHM6Ly9vaS13aWtpLm9yZy9kcy9rZHQv">https://oi-wiki.org/ds/kdt/</span></p>
<p>代码实现我看的是:<span class="exturl" data-url="aHR0cHM6Ly9naXRodWIuY29tL2NydnMvS0RUcmVl">https://github.com/crvs/KDTree</span></p>
<p>构建树的算法:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre>KDNodePtr <span class="token class-name">KDTree</span><span class="token double-colon punctuation">::</span><span class="token function">make_tree</span><span class="token punctuation">(</span><span class="token keyword">const</span> pointIndexArr<span class="token double-colon punctuation">::</span>iterator <span class="token operator">&amp;</span>begin<span class="token punctuation">,</span> <span class="token comment">//</span></pre></td></tr><tr><td data-num="2"></td><td><pre> <span class="token keyword">const</span> pointIndexArr<span class="token double-colon punctuation">::</span>iterator <span class="token operator">&amp;</span>end<span class="token punctuation">,</span> <span class="token comment">//</span></pre></td></tr><tr><td data-num="3"></td><td><pre> <span class="token keyword">const</span> size_t <span class="token operator">&amp;</span>length<span class="token punctuation">,</span> <span class="token comment">//</span></pre></td></tr><tr><td data-num="4"></td><td><pre> <span class="token keyword">const</span> size_t <span class="token operator">&amp;</span>level <span class="token comment">//</span></pre></td></tr><tr><td data-num="5"></td><td><pre><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="6"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>begin <span class="token operator">==</span> end<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="7"></td><td><pre> <span class="token keyword">return</span> <span class="token function">NewKDNodePtr</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// empty tree</span></pre></td></tr><tr><td data-num="8"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="9"></td><td><pre></pre></td></tr><tr><td data-num="10"></td><td><pre> size_t dim <span class="token operator">=</span> begin<span class="token operator">-></span>first<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="11"></td><td><pre></pre></td></tr><tr><td data-num="12"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>length <span class="token operator">></span> <span class="token number">1</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="13"></td><td><pre> <span class="token function">sort_on_idx</span><span class="token punctuation">(</span>begin<span class="token punctuation">,</span> end<span class="token punctuation">,</span> level<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 类似于快排,就是把中位数放到正确的位置,同时左边的数都比他小,右边的数都比中位数大</span></pre></td></tr><tr><td data-num="14"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="15"></td><td><pre></pre></td></tr><tr><td data-num="16"></td><td><pre> <span class="token keyword">auto</span> middle <span class="token operator">=</span> begin <span class="token operator">+</span> <span class="token punctuation">(</span>length <span class="token operator">/</span> <span class="token number">2</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="17"></td><td><pre></pre></td></tr><tr><td data-num="18"></td><td><pre> <span class="token keyword">auto</span> l_begin <span class="token operator">=</span> begin<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="19"></td><td><pre> <span class="token keyword">auto</span> l_end <span class="token operator">=</span> middle<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="20"></td><td><pre> <span class="token keyword">auto</span> r_begin <span class="token operator">=</span> middle <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="21"></td><td><pre> <span class="token keyword">auto</span> r_end <span class="token operator">=</span> end<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="22"></td><td><pre></pre></td></tr><tr><td data-num="23"></td><td><pre> size_t l_len <span class="token operator">=</span> length <span class="token operator">/</span> <span class="token number">2</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="24"></td><td><pre> size_t r_len <span class="token operator">=</span> length <span class="token operator">-</span> l_len <span class="token operator">-</span> <span class="token number">1</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="25"></td><td><pre></pre></td></tr><tr><td data-num="26"></td><td><pre> KDNodePtr left<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="27"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>l_len <span class="token operator">></span> <span class="token number">0</span> <span class="token operator">&amp;&amp;</span> dim <span class="token operator">></span> <span class="token number">0</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="28"></td><td><pre> left <span class="token operator">=</span> <span class="token function">make_tree</span><span class="token punctuation">(</span>l_begin<span class="token punctuation">,</span> l_end<span class="token punctuation">,</span> l_len<span class="token punctuation">,</span> <span class="token punctuation">(</span>level <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span> <span class="token operator">%</span> dim<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="29"></td><td><pre> <span class="token punctuation">&#125;</span> <span class="token keyword">else</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="30"></td><td><pre> left <span class="token operator">=</span> leaf<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="31"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="32"></td><td><pre> KDNodePtr right<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="33"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>r_len <span class="token operator">></span> <span class="token number">0</span> <span class="token operator">&amp;&amp;</span> dim <span class="token operator">></span> <span class="token number">0</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="34"></td><td><pre> right <span class="token operator">=</span> <span class="token function">make_tree</span><span class="token punctuation">(</span>r_begin<span class="token punctuation">,</span> r_end<span class="token punctuation">,</span> r_len<span class="token punctuation">,</span> <span class="token punctuation">(</span>level <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span> <span class="token operator">%</span> dim<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="35"></td><td><pre> <span class="token punctuation">&#125;</span> <span class="token keyword">else</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="36"></td><td><pre> right <span class="token operator">=</span> leaf<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="37"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="38"></td><td><pre></pre></td></tr><tr><td data-num="39"></td><td><pre> <span class="token comment">// KDNode result = KDNode();</span></pre></td></tr><tr><td data-num="40"></td><td><pre> <span class="token keyword">return</span> std<span class="token double-colon punctuation">::</span><span class="token generic-function"><span class="token function">make_shared</span><span class="token generic class-name"><span class="token operator">&lt;</span> KDNode <span class="token operator">></span></span></span><span class="token punctuation">(</span><span class="token operator">*</span>middle<span class="token punctuation">,</span> left<span class="token punctuation">,</span> right<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="41"></td><td><pre><span class="token punctuation">&#125;</span></pre></td></tr></table></figure><p>他的代码没有用到第二个优化,选择划分的 dim 只是简单的递增取模。</p>
<p>查询算法:</p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre>KDNodePtr <span class="token class-name">KDTree</span><span class="token double-colon punctuation">::</span><span class="token function">nearest_</span><span class="token punctuation">(</span> <span class="token comment">//</span></pre></td></tr><tr><td data-num="2"></td><td><pre> <span class="token keyword">const</span> KDNodePtr <span class="token operator">&amp;</span>branch<span class="token punctuation">,</span> <span class="token comment">//</span></pre></td></tr><tr><td data-num="3"></td><td><pre> <span class="token keyword">const</span> point_t <span class="token operator">&amp;</span>pt<span class="token punctuation">,</span> <span class="token comment">//</span></pre></td></tr><tr><td data-num="4"></td><td><pre> <span class="token keyword">const</span> size_t <span class="token operator">&amp;</span>level<span class="token punctuation">,</span> <span class="token comment">//</span></pre></td></tr><tr><td data-num="5"></td><td><pre> <span class="token keyword">const</span> KDNodePtr <span class="token operator">&amp;</span>best<span class="token punctuation">,</span> <span class="token comment">//</span></pre></td></tr><tr><td data-num="6"></td><td><pre> <span class="token keyword">const</span> <span class="token keyword">double</span> <span class="token operator">&amp;</span>best_dist <span class="token comment">//</span></pre></td></tr><tr><td data-num="7"></td><td><pre><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="8"></td><td><pre> <span class="token keyword">double</span> d<span class="token punctuation">,</span> dx<span class="token punctuation">,</span> dx2<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="9"></td><td><pre></pre></td></tr><tr><td data-num="10"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token operator">!</span><span class="token keyword">bool</span><span class="token punctuation">(</span><span class="token operator">*</span>branch<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span> <span class="token comment">// 走到了一个叶子结点</span></pre></td></tr><tr><td data-num="11"></td><td><pre> <span class="token keyword">return</span> <span class="token function">NewKDNodePtr</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// basically, null</span></pre></td></tr><tr><td data-num="12"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="13"></td><td><pre></pre></td></tr><tr><td data-num="14"></td><td><pre> point_t <span class="token function">branch_pt</span><span class="token punctuation">(</span><span class="token operator">*</span>branch<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 之间重载了操作符 ()</span></pre></td></tr><tr><td data-num="15"></td><td><pre> size_t dim <span class="token operator">=</span> branch_pt<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="16"></td><td><pre></pre></td></tr><tr><td data-num="17"></td><td><pre> d <span class="token operator">=</span> <span class="token function">dist2</span><span class="token punctuation">(</span>branch_pt<span class="token punctuation">,</span> pt<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="18"></td><td><pre> dx <span class="token operator">=</span> branch_pt<span class="token punctuation">.</span><span class="token function">at</span><span class="token punctuation">(</span>level<span class="token punctuation">)</span> <span class="token operator">-</span> pt<span class="token punctuation">.</span><span class="token function">at</span><span class="token punctuation">(</span>level<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="19"></td><td><pre> dx2 <span class="token operator">=</span> dx <span class="token operator">*</span> dx<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="20"></td><td><pre></pre></td></tr><tr><td data-num="21"></td><td><pre> KDNodePtr best_l <span class="token operator">=</span> best<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="22"></td><td><pre> <span class="token keyword">double</span> best_dist_l <span class="token operator">=</span> best_dist<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="23"></td><td><pre></pre></td></tr><tr><td data-num="24"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>d <span class="token operator">&lt;</span> best_dist<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="25"></td><td><pre> best_dist_l <span class="token operator">=</span> d<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="26"></td><td><pre> best_l <span class="token operator">=</span> branch<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="27"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="28"></td><td><pre></pre></td></tr><tr><td data-num="29"></td><td><pre> size_t next_lv <span class="token operator">=</span> <span class="token punctuation">(</span>level <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span> <span class="token operator">%</span> dim<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="30"></td><td><pre> KDNodePtr section<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="31"></td><td><pre> KDNodePtr other<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="32"></td><td><pre></pre></td></tr><tr><td data-num="33"></td><td><pre> <span class="token comment">// select which branch makes sense to check</span></pre></td></tr><tr><td data-num="34"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>dx <span class="token operator">></span> <span class="token number">0</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="35"></td><td><pre> section <span class="token operator">=</span> branch<span class="token operator">-></span>left<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="36"></td><td><pre> other <span class="token operator">=</span> branch<span class="token operator">-></span>right<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="37"></td><td><pre> <span class="token punctuation">&#125;</span> <span class="token keyword">else</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="38"></td><td><pre> section <span class="token operator">=</span> branch<span class="token operator">-></span>right<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="39"></td><td><pre> other <span class="token operator">=</span> branch<span class="token operator">-></span>left<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="40"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="41"></td><td><pre></pre></td></tr><tr><td data-num="42"></td><td><pre> <span class="token comment">// keep nearest neighbor from further down the tree</span></pre></td></tr><tr><td data-num="43"></td><td><pre> KDNodePtr further <span class="token operator">=</span> <span class="token function">nearest_</span><span class="token punctuation">(</span>section<span class="token punctuation">,</span> pt<span class="token punctuation">,</span> next_lv<span class="token punctuation">,</span> best_l<span class="token punctuation">,</span> best_dist_l<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="44"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token operator">!</span>further<span class="token operator">-></span>x<span class="token punctuation">.</span><span class="token function">empty</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span> <span class="token comment">//best_l 实际上是一个指针,这一步 check 并没有必要。</span></pre></td></tr><tr><td data-num="45"></td><td><pre> <span class="token keyword">double</span> dl <span class="token operator">=</span> <span class="token function">dist2</span><span class="token punctuation">(</span>further<span class="token operator">-></span>x<span class="token punctuation">,</span> pt<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="46"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>dl <span class="token operator">&lt;</span> best_dist_l<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="47"></td><td><pre> best_dist_l <span class="token operator">=</span> dl<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="48"></td><td><pre> best_l <span class="token operator">=</span> further<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="49"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="50"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="51"></td><td><pre> <span class="token comment">// only check the other branch if it makes sense to do so</span></pre></td></tr><tr><td data-num="52"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>dx2 <span class="token operator">&lt;</span> best_dist_l<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span> <span class="token comment">// 说明 other 这个 branch 也有可能能够找到最近邻</span></pre></td></tr><tr><td data-num="53"></td><td><pre> further <span class="token operator">=</span> <span class="token function">nearest_</span><span class="token punctuation">(</span>other<span class="token punctuation">,</span> pt<span class="token punctuation">,</span> next_lv<span class="token punctuation">,</span> best_l<span class="token punctuation">,</span> best_dist_l<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="54"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token operator">!</span>further<span class="token operator">-></span>x<span class="token punctuation">.</span><span class="token function">empty</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="55"></td><td><pre> <span class="token keyword">double</span> dl <span class="token operator">=</span> <span class="token function">dist2</span><span class="token punctuation">(</span>further<span class="token operator">-></span>x<span class="token punctuation">,</span> pt<span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="56"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span>dl <span class="token operator">&lt;</span> best_dist_l<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="57"></td><td><pre> best_dist_l <span class="token operator">=</span> dl<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="58"></td><td><pre> best_l <span class="token operator">=</span> further<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="59"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="60"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="61"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="62"></td><td><pre></pre></td></tr><tr><td data-num="63"></td><td><pre> <span class="token keyword">return</span> best_l<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="64"></td><td><pre><span class="token punctuation">&#125;</span><span class="token punctuation">;</span></pre></td></tr></table></figure><p>查询的思想也和二叉树更类似,不多赘述。</p>
<h3 id="ball-tree"><a class="anchor" href="#ball-tree">#</a> ball tree</h3>
<p>ball tree 的思想更简单了。</p>
<p>将当前点集的质心作为 root,搜索离当前质心最远的节点 p,搜索离节点 p 最远的节点 q。通过 p,q 将当前点集进行划分为两个 cluster,看点离 p,q 哪个更近,如果离 p 更近就加入 p cluster,如果离 q 更近就加入 q cluster。反复迭代,知道到达最大深度。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220423175340044.png" alt="image-20220423175340044" /></p>
<p>首先取灰色的点作为 root,用 3 和 9 来划分两个 cluster。然后递归进行划分。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220423175455261.png" alt="image-20220423175455261" /></p>
<h3 id="vp-treee"><a class="anchor" href="#vp-treee">#</a> VP treee</h3>
<p>这里有一篇文章,对于我理解 VP tree 有很大帮助:<span class="exturl" data-url="aHR0cDovL3N0ZXZlaGFub3YuY2EvYmxvZy8/aWQ9MTMw">http://stevehanov.ca/blog/?id=130</span></p>
<p>so,what is vp tree?</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220423194326540.png" alt="image-20220423194326540" /></p>
<p>对于每一个节点 p,我们设置一个半径 r,把和该节点距离小于 r 的所有点插入到节点的左子树,把距离大于 r 的所有点插入到右子树。实际代码中,我们取点集中位数之前的所有点作为左子树,中位数之后的点作为右子树。</p>
<p>查询时候,需要去确定向左还是向右去查询。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220423195255100.png" alt="image-20220423195255100" /></p>
<p>如果查询的点 <code>x</code> 在当前也就是距离小于 tau,把当前 p 加入到 result 中(因为需要查 k 个最近邻), tau 始终等于结果 result 中距离 x 的最大距离。</p>
<p>因为 <code>x</code> 在圆内,于是我们向左子树进行查询,蓝色的点,就是我们在左子树中查询得到的点,并且更新得到了 tau。</p>
<p><em>如果 tau &gt; distance to shell,说明外部还可能有节点,需要向右子树进行查询。</em></p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220423195812891.png" alt="image-20220423195812891" /></p>
<figure class="highlight cpp"><figcaption data-lang="C++"></figcaption><table><tr><td data-num="1"></td><td><pre><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">&lt;stdlib.h></span></span></pre></td></tr><tr><td data-num="2"></td><td><pre><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">&lt;algorithm></span></span></pre></td></tr><tr><td data-num="3"></td><td><pre><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">&lt;vector></span></span></pre></td></tr><tr><td data-num="4"></td><td><pre><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">&lt;stdio.h></span></span></pre></td></tr><tr><td data-num="5"></td><td><pre><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">&lt;queue></span></span></pre></td></tr><tr><td data-num="6"></td><td><pre><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">&lt;limits></span></span></pre></td></tr><tr><td data-num="7"></td><td><pre></pre></td></tr><tr><td data-num="8"></td><td><pre><span class="token keyword">template</span><span class="token operator">&lt;</span><span class="token keyword">typename</span> <span class="token class-name">T</span><span class="token punctuation">,</span> <span class="token keyword">double</span> <span class="token punctuation">(</span><span class="token operator">*</span>distance<span class="token punctuation">)</span><span class="token punctuation">(</span> <span class="token keyword">const</span> T<span class="token operator">&amp;</span><span class="token punctuation">,</span> <span class="token keyword">const</span> T<span class="token operator">&amp;</span> <span class="token punctuation">)</span><span class="token operator">></span></pre></td></tr><tr><td data-num="9"></td><td><pre><span class="token keyword">class</span> <span class="token class-name">VpTree</span></pre></td></tr><tr><td data-num="10"></td><td><pre><span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="11"></td><td><pre><span class="token keyword">public</span><span class="token operator">:</span></pre></td></tr><tr><td data-num="12"></td><td><pre> <span class="token function">VpTree</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">:</span> <span class="token function">_root</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span><span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="13"></td><td><pre></pre></td></tr><tr><td data-num="14"></td><td><pre> <span class="token operator">~</span><span class="token function">VpTree</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="15"></td><td><pre> <span class="token keyword">delete</span> _root<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="16"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="17"></td><td><pre></pre></td></tr><tr><td data-num="18"></td><td><pre> <span class="token keyword">void</span> <span class="token function">create</span><span class="token punctuation">(</span> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&amp;</span> items <span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="19"></td><td><pre> <span class="token keyword">delete</span> _root<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="20"></td><td><pre> _items <span class="token operator">=</span> items<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="21"></td><td><pre> _root <span class="token operator">=</span> <span class="token function">buildFromPoints</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> items<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="22"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="23"></td><td><pre></pre></td></tr><tr><td data-num="24"></td><td><pre> <span class="token keyword">void</span> <span class="token function">search</span><span class="token punctuation">(</span> <span class="token keyword">const</span> T<span class="token operator">&amp;</span> target<span class="token punctuation">,</span> <span class="token keyword">int</span> k<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">*</span> results<span class="token punctuation">,</span> </pre></td></tr><tr><td data-num="25"></td><td><pre> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span><span class="token keyword">double</span><span class="token operator">></span><span class="token operator">*</span> distances<span class="token punctuation">)</span> </pre></td></tr><tr><td data-num="26"></td><td><pre> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="27"></td><td><pre> std<span class="token double-colon punctuation">::</span>priority_queue<span class="token operator">&lt;</span>HeapItem<span class="token operator">></span> heap<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="28"></td><td><pre></pre></td></tr><tr><td data-num="29"></td><td><pre> _tau <span class="token operator">=</span> std<span class="token double-colon punctuation">::</span>numeric_limits<span class="token double-colon punctuation">::</span><span class="token function">max</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="30"></td><td><pre> <span class="token function">search</span><span class="token punctuation">(</span> _root<span class="token punctuation">,</span> target<span class="token punctuation">,</span> k<span class="token punctuation">,</span> heap <span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="31"></td><td><pre></pre></td></tr><tr><td data-num="32"></td><td><pre> results<span class="token operator">-></span><span class="token function">clear</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> distances<span class="token operator">-></span><span class="token function">clear</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="33"></td><td><pre></pre></td></tr><tr><td data-num="34"></td><td><pre> <span class="token keyword">while</span><span class="token punctuation">(</span> <span class="token operator">!</span>heap<span class="token punctuation">.</span><span class="token function">empty</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="35"></td><td><pre> results<span class="token operator">-></span><span class="token function">push_back</span><span class="token punctuation">(</span> _items<span class="token punctuation">[</span>heap<span class="token punctuation">.</span><span class="token function">top</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>index<span class="token punctuation">]</span> <span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="36"></td><td><pre> distances<span class="token operator">-></span><span class="token function">push_back</span><span class="token punctuation">(</span> heap<span class="token punctuation">.</span><span class="token function">top</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>dist <span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="37"></td><td><pre> heap<span class="token punctuation">.</span><span class="token function">pop</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="38"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="39"></td><td><pre></pre></td></tr><tr><td data-num="40"></td><td><pre> std<span class="token double-colon punctuation">::</span><span class="token function">reverse</span><span class="token punctuation">(</span> results<span class="token operator">-></span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> results<span class="token operator">-></span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="41"></td><td><pre> std<span class="token double-colon punctuation">::</span><span class="token function">reverse</span><span class="token punctuation">(</span> distances<span class="token operator">-></span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> distances<span class="token operator">-></span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="42"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="43"></td><td><pre></pre></td></tr><tr><td data-num="44"></td><td><pre><span class="token keyword">private</span><span class="token operator">:</span></pre></td></tr><tr><td data-num="45"></td><td><pre> std<span class="token double-colon punctuation">::</span>vector<span class="token operator">&lt;</span>T<span class="token operator">></span> _items<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="46"></td><td><pre> <span class="token keyword">double</span> _tau<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="47"></td><td><pre></pre></td></tr><tr><td data-num="48"></td><td><pre> <span class="token keyword">struct</span> <span class="token class-name">Node</span> </pre></td></tr><tr><td data-num="49"></td><td><pre> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="50"></td><td><pre> <span class="token keyword">int</span> index<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="51"></td><td><pre> <span class="token keyword">double</span> threshold<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="52"></td><td><pre> Node<span class="token operator">*</span> left<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="53"></td><td><pre> Node<span class="token operator">*</span> right<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="54"></td><td><pre></pre></td></tr><tr><td data-num="55"></td><td><pre> <span class="token function">Node</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">:</span></pre></td></tr><tr><td data-num="56"></td><td><pre> <span class="token function">index</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">threshold</span><span class="token punctuation">(</span><span class="token number">0.</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">left</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">right</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span><span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="57"></td><td><pre></pre></td></tr><tr><td data-num="58"></td><td><pre> <span class="token operator">~</span><span class="token function">Node</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="59"></td><td><pre> <span class="token keyword">delete</span> left<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="60"></td><td><pre> <span class="token keyword">delete</span> right<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="61"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="62"></td><td><pre> <span class="token punctuation">&#125;</span><span class="token operator">*</span> _root<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="63"></td><td><pre></pre></td></tr><tr><td data-num="64"></td><td><pre> <span class="token keyword">struct</span> <span class="token class-name">HeapItem</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="65"></td><td><pre> <span class="token function">HeapItem</span><span class="token punctuation">(</span> <span class="token keyword">int</span> index<span class="token punctuation">,</span> <span class="token keyword">double</span> dist<span class="token punctuation">)</span> <span class="token operator">:</span></pre></td></tr><tr><td data-num="66"></td><td><pre> <span class="token function">index</span><span class="token punctuation">(</span>index<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">dist</span><span class="token punctuation">(</span>dist<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span><span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="67"></td><td><pre> <span class="token keyword">int</span> index<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="68"></td><td><pre> <span class="token keyword">double</span> dist<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="69"></td><td><pre> <span class="token keyword">bool</span> <span class="token keyword">operator</span><span class="token operator">&lt;</span><span class="token punctuation">(</span> <span class="token keyword">const</span> HeapItem<span class="token operator">&amp;</span> o <span class="token punctuation">)</span> <span class="token keyword">const</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="70"></td><td><pre> <span class="token keyword">return</span> dist <span class="token operator">&lt;</span> o<span class="token punctuation">.</span>dist<span class="token punctuation">;</span> </pre></td></tr><tr><td data-num="71"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="72"></td><td><pre> <span class="token punctuation">&#125;</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="73"></td><td><pre></pre></td></tr><tr><td data-num="74"></td><td><pre> <span class="token keyword">struct</span> <span class="token class-name">DistanceComparator</span></pre></td></tr><tr><td data-num="75"></td><td><pre> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="76"></td><td><pre> <span class="token keyword">const</span> T<span class="token operator">&amp;</span> item<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="77"></td><td><pre> <span class="token function">DistanceComparator</span><span class="token punctuation">(</span> <span class="token keyword">const</span> T<span class="token operator">&amp;</span> item <span class="token punctuation">)</span> <span class="token operator">:</span> <span class="token function">item</span><span class="token punctuation">(</span>item<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span><span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="78"></td><td><pre> <span class="token keyword">bool</span> <span class="token keyword">operator</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">(</span><span class="token keyword">const</span> T<span class="token operator">&amp;</span> a<span class="token punctuation">,</span> <span class="token keyword">const</span> T<span class="token operator">&amp;</span> b<span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="79"></td><td><pre> <span class="token keyword">return</span> <span class="token function">distance</span><span class="token punctuation">(</span> item<span class="token punctuation">,</span> a <span class="token punctuation">)</span> <span class="token operator">&lt;</span> <span class="token function">distance</span><span class="token punctuation">(</span> item<span class="token punctuation">,</span> b <span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="80"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="81"></td><td><pre> <span class="token punctuation">&#125;</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="82"></td><td><pre></pre></td></tr><tr><td data-num="83"></td><td><pre> Node<span class="token operator">*</span> <span class="token function">buildFromPoints</span><span class="token punctuation">(</span> <span class="token keyword">int</span> lower<span class="token punctuation">,</span> <span class="token keyword">int</span> upper <span class="token punctuation">)</span></pre></td></tr><tr><td data-num="84"></td><td><pre> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="85"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span> upper <span class="token operator">==</span> lower <span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="86"></td><td><pre> <span class="token keyword">return</span> <span class="token constant">NULL</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="87"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="88"></td><td><pre></pre></td></tr><tr><td data-num="89"></td><td><pre> Node<span class="token operator">*</span> node <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token function">Node</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="90"></td><td><pre> node<span class="token operator">-></span>index <span class="token operator">=</span> lower<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="91"></td><td><pre></pre></td></tr><tr><td data-num="92"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span> upper <span class="token operator">-</span> lower <span class="token operator">></span> <span class="token number">1</span> <span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="93"></td><td><pre></pre></td></tr><tr><td data-num="94"></td><td><pre> <span class="token comment">// choose an arbitrary point and move it to the start</span></pre></td></tr><tr><td data-num="95"></td><td><pre> <span class="token keyword">int</span> i <span class="token operator">=</span> <span class="token punctuation">(</span><span class="token keyword">int</span><span class="token punctuation">)</span><span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token keyword">double</span><span class="token punctuation">)</span><span class="token function">rand</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">/</span> RAND_MAX <span class="token operator">*</span> <span class="token punctuation">(</span>upper <span class="token operator">-</span> lower <span class="token operator">-</span> <span class="token number">1</span><span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token operator">+</span> lower<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="96"></td><td><pre> std<span class="token double-colon punctuation">::</span><span class="token function">swap</span><span class="token punctuation">(</span> _items<span class="token punctuation">[</span>lower<span class="token punctuation">]</span><span class="token punctuation">,</span> _items<span class="token punctuation">[</span>i<span class="token punctuation">]</span> <span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="97"></td><td><pre></pre></td></tr><tr><td data-num="98"></td><td><pre> <span class="token keyword">int</span> median <span class="token operator">=</span> <span class="token punctuation">(</span> upper <span class="token operator">+</span> lower <span class="token punctuation">)</span> <span class="token operator">/</span> <span class="token number">2</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="99"></td><td><pre></pre></td></tr><tr><td data-num="100"></td><td><pre> <span class="token comment">// partitian around the median distance</span></pre></td></tr><tr><td data-num="101"></td><td><pre> std<span class="token double-colon punctuation">::</span><span class="token function">nth_element</span><span class="token punctuation">(</span> </pre></td></tr><tr><td data-num="102"></td><td><pre> _items<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span> lower <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">,</span> </pre></td></tr><tr><td data-num="103"></td><td><pre> _items<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span> median<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="104"></td><td><pre> _items<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">+</span> upper<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="105"></td><td><pre> <span class="token function">DistanceComparator</span><span class="token punctuation">(</span> _items<span class="token punctuation">[</span>lower<span class="token punctuation">]</span> <span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="106"></td><td><pre></pre></td></tr><tr><td data-num="107"></td><td><pre> <span class="token comment">// what was the median?</span></pre></td></tr><tr><td data-num="108"></td><td><pre> node<span class="token operator">-></span>threshold <span class="token operator">=</span> <span class="token function">distance</span><span class="token punctuation">(</span> _items<span class="token punctuation">[</span>lower<span class="token punctuation">]</span><span class="token punctuation">,</span> _items<span class="token punctuation">[</span>median<span class="token punctuation">]</span> <span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="109"></td><td><pre></pre></td></tr><tr><td data-num="110"></td><td><pre> node<span class="token operator">-></span>index <span class="token operator">=</span> lower<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="111"></td><td><pre> node<span class="token operator">-></span>left <span class="token operator">=</span> <span class="token function">buildFromPoints</span><span class="token punctuation">(</span> lower <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">,</span> median <span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="112"></td><td><pre> node<span class="token operator">-></span>right <span class="token operator">=</span> <span class="token function">buildFromPoints</span><span class="token punctuation">(</span> median<span class="token punctuation">,</span> upper <span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="113"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="114"></td><td><pre></pre></td></tr><tr><td data-num="115"></td><td><pre> <span class="token keyword">return</span> node<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="116"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="117"></td><td><pre></pre></td></tr><tr><td data-num="118"></td><td><pre> <span class="token keyword">void</span> <span class="token function">search</span><span class="token punctuation">(</span> Node<span class="token operator">*</span> node<span class="token punctuation">,</span> <span class="token keyword">const</span> T<span class="token operator">&amp;</span> target<span class="token punctuation">,</span> <span class="token keyword">int</span> k<span class="token punctuation">,</span></pre></td></tr><tr><td data-num="119"></td><td><pre> std<span class="token double-colon punctuation">::</span>priority_queue<span class="token operator">&amp;</span> heap <span class="token punctuation">)</span></pre></td></tr><tr><td data-num="120"></td><td><pre> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="121"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span> node <span class="token operator">==</span> <span class="token constant">NULL</span> <span class="token punctuation">)</span> <span class="token keyword">return</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="122"></td><td><pre></pre></td></tr><tr><td data-num="123"></td><td><pre> <span class="token keyword">double</span> dist <span class="token operator">=</span> <span class="token function">distance</span><span class="token punctuation">(</span> _items<span class="token punctuation">[</span>node<span class="token operator">-></span>index<span class="token punctuation">]</span><span class="token punctuation">,</span> target <span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="124"></td><td><pre> <span class="token comment">//printf("dist=%g tau=%gn", dist, _tau );</span></pre></td></tr><tr><td data-num="125"></td><td><pre></pre></td></tr><tr><td data-num="126"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span> dist <span class="token operator">&lt;</span> _tau <span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="127"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span> heap<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">==</span> k <span class="token punctuation">)</span> heap<span class="token punctuation">.</span><span class="token function">pop</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="128"></td><td><pre> heap<span class="token punctuation">.</span><span class="token function">push</span><span class="token punctuation">(</span> <span class="token function">HeapItem</span><span class="token punctuation">(</span>node<span class="token operator">-></span>index<span class="token punctuation">,</span> dist<span class="token punctuation">)</span> <span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="129"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span> heap<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">==</span> k <span class="token punctuation">)</span> _tau <span class="token operator">=</span> heap<span class="token punctuation">.</span><span class="token function">top</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>dist<span class="token punctuation">;</span></pre></td></tr><tr><td data-num="130"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="131"></td><td><pre></pre></td></tr><tr><td data-num="132"></td><td><pre></pre></td></tr><tr><td data-num="133"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span> node<span class="token operator">-></span>left <span class="token operator">==</span> <span class="token constant">NULL</span> <span class="token operator">&amp;&amp;</span> node<span class="token operator">-></span>right <span class="token operator">==</span> <span class="token constant">NULL</span> <span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="134"></td><td><pre> <span class="token keyword">return</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="135"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="136"></td><td><pre></pre></td></tr><tr><td data-num="137"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span> dist <span class="token operator">&lt;</span> node<span class="token operator">-></span>threshold <span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="138"></td><td><pre> <span class="token function">search</span><span class="token punctuation">(</span> node<span class="token operator">-></span>left<span class="token punctuation">,</span> target<span class="token punctuation">,</span> k<span class="token punctuation">,</span> heap <span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="139"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span> dist <span class="token operator">+</span> _tau <span class="token operator">>=</span> node<span class="token operator">-></span>threshold <span class="token punctuation">)</span> <span class="token punctuation">&#123;</span> <span class="token comment">// 说明外部还有可能有节点。</span></pre></td></tr><tr><td data-num="140"></td><td><pre> <span class="token function">search</span><span class="token punctuation">(</span> node<span class="token operator">-></span>right<span class="token punctuation">,</span> target<span class="token punctuation">,</span> k<span class="token punctuation">,</span> heap <span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="141"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="142"></td><td><pre></pre></td></tr><tr><td data-num="143"></td><td><pre> <span class="token punctuation">&#125;</span> <span class="token keyword">else</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="144"></td><td><pre> <span class="token function">search</span><span class="token punctuation">(</span> node<span class="token operator">-></span>right<span class="token punctuation">,</span> target<span class="token punctuation">,</span> k<span class="token punctuation">,</span> heap <span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="145"></td><td><pre> <span class="token keyword">if</span> <span class="token punctuation">(</span> dist <span class="token operator">-</span> _tau <span class="token operator">&lt;=</span> node<span class="token operator">-></span>threshold <span class="token punctuation">)</span> <span class="token punctuation">&#123;</span></pre></td></tr><tr><td data-num="146"></td><td><pre> <span class="token function">search</span><span class="token punctuation">(</span> node<span class="token operator">-></span>left<span class="token punctuation">,</span> target<span class="token punctuation">,</span> k<span class="token punctuation">,</span> heap <span class="token punctuation">)</span><span class="token punctuation">;</span></pre></td></tr><tr><td data-num="147"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="148"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="149"></td><td><pre> <span class="token punctuation">&#125;</span></pre></td></tr><tr><td data-num="150"></td><td><pre><span class="token punctuation">&#125;</span><span class="token punctuation">;</span></pre></td></tr></table></figure><h3 id="brute-force"><a class="anchor" href="#brute-force">#</a> Brute Force</h3>
<p>说白了就是暴力检索。</p>
]]></description>
</item>
<item>
<guid isPermalink="true">https://songlinlife.top/2022/%E6%95%B0%E6%8D%AE%E5%BA%93/%E8%AE%BA%E6%96%87%E9%98%85%E8%AF%BB%EF%BC%9AFANNG/</guid>
<title>论文阅读:FANNG</title>
<link>https://songlinlife.top/2022/%E6%95%B0%E6%8D%AE%E5%BA%93/%E8%AE%BA%E6%96%87%E9%98%85%E8%AF%BB%EF%BC%9AFANNG/</link>
<category term="数据库" scheme="https://songlinlife.top/categories/%E6%95%B0%E6%8D%AE%E5%BA%93/" />
<category term="ANNS" scheme="https://songlinlife.top/tags/ANNS/" />
<pubDate>Thu, 21 Apr 2022 15:36:17 +0800</pubDate>
<description><![CDATA[ <p>这篇文章就简单说一下 FANNG 这个算法(Fast Approximate Nearest Neighbour Graph)</p>
<h3 id="本文贡献"><a class="anchor" href="#本文贡献">#</a> 本文贡献</h3>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424121804922.png" alt="image-20220424121804922" /></p>
<h3 id="fann算法拆解"><a class="anchor" href="#fann算法拆解">#</a> FANN 算法拆解</h3>
<h4 id="理想的图结构"><a class="anchor" href="#理想的图结构">#</a> 理想的图结构</h4>
<p>首先看一下这个简单的贪心算法,这个算法让我思考的一个问题就是我们是否需要维持一个 visit 列表来保证非回溯,可以看到因为查询路径单调的原因,即使不维持 visit 列表,我们也能够不回头查询。所以 visit 列表最大的作用其实是当 Candidate 不是一个节点而是一组节点的时候,使用 visit 列表可以砍掉重复访问,因为我们的查询路径如果是单调的,即使 Candidate 是一组节点也不会发生回溯现象。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424122910794.png" alt="image-20220424122910794" /></p>
<p>要使得这个查询算法在任意一个起始点都能找到查询点,那么我们需要保证, <code>always an edge that leads to a vertex which is closer to the query.</code> 这句话的翻译就是对于任意满足 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>d</mi><mi>i</mi><mi>s</mi><mo stretchy="false">(</mo><mi>Q</mi><mo separator="true">,</mo><msub><mi>P</mi><mi>u</mi></msub><mo stretchy="false">)</mo><mo>&lt;</mo><mi>d</mi><mi>i</mi><mi>s</mi><mo stretchy="false">(</mo><mi>Q</mi><mo separator="true">,</mo><msub><mi>P</mi><mi>v</mi></msub><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">dis(Q,P_u) &lt; dis(Q, P_v)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal">d</span><span class="mord mathnormal">i</span><span class="mord mathnormal">s</span><span class="mopen">(</span><span class="mord mathnormal">Q</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.151392em;"><span style="top:-2.5500000000000003em;margin-left:-0.13889em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">u</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">&lt;</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal">d</span><span class="mord mathnormal">i</span><span class="mord mathnormal">s</span><span class="mopen">(</span><span class="mord mathnormal">Q</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.151392em;"><span style="top:-2.5500000000000003em;margin-left:-0.13889em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em;">v</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span> 的节点 u,那么节点 u 一定能带领我们接近 p。因为所有节点必须满足至少(注意这个至少)有一条边能够使得查询更接近查询结果。好好回味这句话,这是精髓。这句话的意思实际上就是任意两点之间存在单调路径。(没错就是这样,任意两点之间存在单调路径就是至少有一条边能够 closer to query。这就是 MSNET 图。</p>
<p>因为只需要有一条边满足就行了,所以我们可以砍掉冗余边。这里我们定义,如果 <code>p1</code> 、 <code>p2</code> 这两个节点之间有边 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>p</mi><mn>1</mn></msub><mo>→</mo><msub><mi>p</mi><mn>2</mn></msub></mrow><annotation encoding="application/x-tex">p_1 \rightarrow p_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.19444em;"></span><span class="mord"><span class="mord mathnormal">p</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">→</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.19444em;"></span><span class="mord"><span class="mord mathnormal">p</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span>,那么对于任意的节点 p3 来说,如果满足 $ p2p3 &lt; p1p3$ 那么我们说 p1p3 就是一条冗余边,p1p2 可以关闭 p1p3。因为我们满足了单调路径的条件,即 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>δ</mi><mo stretchy="false">(</mo><mi>p</mi><mn>1</mn><mo separator="true">,</mo><mi>p</mi><mn>3</mn><mo stretchy="false">)</mo><mo>&gt;</mo><mi>δ</mi><mo stretchy="false">(</mo><mi>p</mi><mn>2</mn><mo separator="true">,</mo><mi>p</mi><mn>3</mn><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\delta(p1,p3) &gt; \delta(p2,p3)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.03785em;">δ</span><span class="mopen">(</span><span class="mord mathnormal">p</span><span class="mord">1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal">p</span><span class="mord">3</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.03785em;">δ</span><span class="mopen">(</span><span class="mord mathnormal">p</span><span class="mord">2</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal">p</span><span class="mord">3</span><span class="mclose">)</span></span></span></span>,p1 到 p3 之间存在单调路径。但是如果<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>δ</mi><mo stretchy="false">(</mo><mi>p</mi><mn>1</mn><mo separator="true">,</mo><mi>p</mi><mn>2</mn><mo stretchy="false">)</mo><mo>&gt;</mo><mi>δ</mi><mo stretchy="false">(</mo><mi>p</mi><mn>1</mn><mo separator="true">,</mo><mi>p</mi><mn>3</mn><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\delta(p1,p2) &gt; \delta(p1,p3)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.03785em;">δ</span><span class="mopen">(</span><span class="mord mathnormal">p</span><span class="mord">1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal">p</span><span class="mord">2</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.03785em;">δ</span><span class="mopen">(</span><span class="mord mathnormal">p</span><span class="mord">1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal">p</span><span class="mord">3</span><span class="mclose">)</span></span></span></span> ,那我们就发生了绕远路的情况,于是我们可以改变一下约束条件,只有 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>δ</mi><mo stretchy="false">(</mo><mi>p</mi><mn>1</mn><mo separator="true">,</mo><mi>p</mi><mn>3</mn><mo stretchy="false">)</mo><mo>&gt;</mo><mi>δ</mi><mo stretchy="false">(</mo><mi>p</mi><mn>2</mn><mo separator="true">,</mo><mi>p</mi><mn>3</mn><mo stretchy="false">)</mo><mtext>并且</mtext><mi>δ</mi><mo stretchy="false">(</mo><mi>p</mi><mn>1</mn><mo separator="true">,</mo><mi>p</mi><mn>2</mn><mo stretchy="false">)</mo><mo>&lt;</mo><mi>δ</mi><mo stretchy="false">(</mo><mi>p</mi><mn>1</mn><mo separator="true">,</mo><mi>p</mi><mn>3</mn><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\delta(p1,p3) &gt; \delta(p2,p3) 并且\delta(p1,p2) &lt; \delta(p1,p3)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.03785em;">δ</span><span class="mopen">(</span><span class="mord mathnormal">p</span><span class="mord">1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal">p</span><span class="mord">3</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.03785em;">δ</span><span class="mopen">(</span><span class="mord mathnormal">p</span><span class="mord">2</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal">p</span><span class="mord">3</span><span class="mclose">)</span><span class="mord cjk_fallback">并</span><span class="mord cjk_fallback">且</span><span class="mord mathnormal" style="margin-right:0.03785em;">δ</span><span class="mopen">(</span><span class="mord mathnormal">p</span><span class="mord">1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal">p</span><span class="mord">2</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">&lt;</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.03785em;">δ</span><span class="mopen">(</span><span class="mord mathnormal">p</span><span class="mord">1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal">p</span><span class="mord">3</span><span class="mclose">)</span></span></span></span>,那么才将 p1,p2 视为冗余边:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424125504292.png" alt="image-20220424125504292" /></p>
<h4 id="朴素构建"><a class="anchor" href="#朴素构建">#</a> 朴素构建</h4>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424130543359.png" alt="image-20220424130543359" /></p>
<p>注意这里第一个标红的地方, <code>sorted by distance(p_i, p_j)</code> 。这一步 sort 很重要的,避免了之后添加上去的边需要 <code>occlude</code> 之前的边。</p>
<p>想了想这个所谓了关闭条件,我发现这玩意就是 RNG,它的选边策略和 NSG 以及 HNSW 不能说相似吧,只能说是一模一样。。。</p>
<p>让我们思考一个场景,我们需要判断一个节点 q 能不能被添加到 p 的邻居中,那么需要考虑之前所有在 result 列表中的节点 y,是否都满足 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>d</mi><mi>i</mi><mi>s</mi><mo stretchy="false">(</mo><mi>p</mi><mo separator="true">,</mo><mi>q</mi><mo stretchy="false">)</mo><mo>&gt;</mo><mi>d</mi><mi>i</mi><mi>s</mi><mo stretchy="false">(</mo><mi>y</mi><mo separator="true">,</mo><mi>q</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">dis(p, q) &gt; dis(y, q)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal">d</span><span class="mord mathnormal">i</span><span class="mord mathnormal">s</span><span class="mopen">(</span><span class="mord mathnormal">p</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">q</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal">d</span><span class="mord mathnormal">i</span><span class="mord mathnormal">s</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em;">y</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">q</span><span class="mclose">)</span></span></span></span>。这 tm 不就是 HNSW 和 NSG 的选边策略一模一样吗?绕了半天,不说人话。。。。🖕 凸 (艹皿艹)</p>
<p>作者还煞有其事的说,通过某一条边关闭的边,那么两条边之间的的夹角至少是 60°,因此整个图就分散得很好。咋就是说呢,一整个无语住了。这不是 RNG 图推出的性质吗?</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424144923430.png" alt="image-20220424144923430" /></p>
<h4 id="绝对最近邻保证"><a class="anchor" href="#绝对最近邻保证">#</a> 绝对最近邻保证</h4>
<p>通过朴素算法 2 构建出的图,能够保证如果查询节点在图上,那么一定能够找到它的最近邻。但是需要查询不在图上 q 点的最近邻,作者给出了一个 relax 条件:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424151857822.png" alt="image-20220424151857822" /></p>
<p>关闭边的条件得到了放松,但是我并没有搞明白这个操作是怎么做的。。。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424151953449.png" alt="image-20220424151953449" /></p>
<h4 id="回溯查询"><a class="anchor" href="#回溯查询">#</a> 回溯查询</h4>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424152548246.png" alt="image-20220424152548246" /></p>
<p>这里说一下怎么把这个查询一个节点的算法改成 KANNS。就是把 n 改成一个大顶堆堆 W。</p>
<p>将第 11-12 行修改为:</p>
<p><span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>i</mi><mi>f</mi><mspace width="1em"/><mi>d</mi><mi>i</mi><mi>s</mi><mo stretchy="false">(</mo><mi>Q</mi><mo separator="true">,</mo><msub><mi>P</mi><mi>u</mi></msub><mo stretchy="false">)</mo><mo>&lt;</mo><mi>d</mi><mi>i</mi><mi>s</mi><mo stretchy="false">(</mo><mi>Q</mi><mo separator="true">,</mo><msub><mi>P</mi><mi>n</mi></msub><mo stretchy="false">)</mo><mspace width="1em"/><mi>o</mi><mi>r</mi><mspace width="1em"/><mi mathvariant="normal">∣</mi><mi>W</mi><mi mathvariant="normal">∣</mi><mo>&lt;</mo><mi>s</mi><mi>i</mi><mi>z</mi><mi>e</mi><mspace width="1em"/><mi>t</mi><mi>h</mi><mi>e</mi><mi>n</mi><mspace linebreak="newline"></mspace><mi>W</mi><mi mathvariant="normal">.</mi><mi>a</mi><mi>d</mi><mi>d</mi><mo stretchy="false">(</mo><msub><mi>P</mi><mi>u</mi></msub><mo stretchy="false">)</mo><mspace width="1em"/><mi>a</mi><mi>n</mi><mi>d</mi><mspace width="1em"/><mi>r</mi><mi>e</mi><mi>s</mi><mi>i</mi><mi>z</mi><mi>e</mi><mo stretchy="false">(</mo><mi>W</mi><mo separator="true">,</mo><mi>s</mi><mi>i</mi><mi>z</mi><mi>e</mi><mo stretchy="false">)</mo><mspace width="1em"/><mi>a</mi><mi>n</mi><mi>d</mi><mspace width="1em"/><mi>n</mi><mo>=</mo><mi>W</mi><mi mathvariant="normal">.</mi><mi>f</mi><mi>r</mi><mi>o</mi><mi>n</mi><mi>t</mi><mo stretchy="false">(</mo><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">if \quad dis(Q, P_u) &lt; dis( Q, P_n) \quad or \quad |W| &lt; size \quad then \\
W.add(P_u) \quad and \quad resize(W, size) \quad and \quad n =W.front()
</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal">i</span><span class="mord mathnormal" style="margin-right:0.10764em;">f</span><span class="mspace" style="margin-right:1em;"></span><span class="mord mathnormal">d</span><span class="mord mathnormal">i</span><span class="mord mathnormal">s</span><span class="mopen">(</span><span class="mord mathnormal">Q</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.151392em;"><span style="top:-2.5500000000000003em;margin-left:-0.13889em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">u</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">&lt;</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal">d</span><span class="mord mathnormal">i</span><span class="mord mathnormal">s</span><span class="mopen">(</span><span class="mord mathnormal">Q</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.151392em;"><span style="top:-2.5500000000000003em;margin-left:-0.13889em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">n</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose">)</span><span class="mspace" style="margin-right:1em;"></span><span class="mord mathnormal">o</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mspace" style="margin-right:1em;"></span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord">∣</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">&lt;</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.69444em;vertical-align:0em;"></span><span class="mord mathnormal">s</span><span class="mord mathnormal">i</span><span class="mord mathnormal" style="margin-right:0.04398em;">z</span><span class="mord mathnormal">e</span><span class="mspace" style="margin-right:1em;"></span><span class="mord mathnormal">t</span><span class="mord mathnormal">h</span><span class="mord mathnormal">e</span><span class="mord mathnormal">n</span></span><span class="mspace newline"></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord">.</span><span class="mord mathnormal">a</span><span class="mord mathnormal">d</span><span class="mord mathnormal">d</span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.151392em;"><span style="top:-2.5500000000000003em;margin-left:-0.13889em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">u</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose">)</span><span class="mspace" style="margin-right:1em;"></span><span class="mord mathnormal">a</span><span class="mord mathnormal">n</span><span class="mord mathnormal">d</span><span class="mspace" style="margin-right:1em;"></span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal">e</span><span class="mord mathnormal">s</span><span class="mord mathnormal">i</span><span class="mord mathnormal" style="margin-right:0.04398em;">z</span><span class="mord mathnormal">e</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal">s</span><span class="mord mathnormal">i</span><span class="mord mathnormal" style="margin-right:0.04398em;">z</span><span class="mord mathnormal">e</span><span class="mclose">)</span><span class="mspace" style="margin-right:1em;"></span><span class="mord mathnormal">a</span><span class="mord mathnormal">n</span><span class="mord mathnormal">d</span><span class="mspace" style="margin-right:1em;"></span><span class="mord mathnormal">n</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord">.</span><span class="mord mathnormal" style="margin-right:0.10764em;">f</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal">o</span><span class="mord mathnormal">n</span><span class="mord mathnormal">t</span><span class="mopen">(</span><span class="mclose">)</span></span></span></span></span></p>
<p>这样的回溯方法说实话,效率挺低的,作者说可以传入一个参数 T,作用于 14 行,只选择 i&lt;T 的边。</p>
<h4 id="高效构建"><a class="anchor" href="#高效构建">#</a> 高效构建</h4>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424154026998.png" alt="image-20220424154026998" /></p>
<p>没怎么看懂,不过感觉有点意义不大,还不如 search and select 呢。。。</p>
<h3 id="实验结果"><a class="anchor" href="#实验结果">#</a> 实验结果</h3>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424154918142.png" alt="image-20220424154918142" /></p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424155016238.png" alt="image-20220424155016238" /></p>
<h3 id="结论"><a class="anchor" href="#结论">#</a> 结论</h3>
<p>这篇文章的思想还是很简单的,因为发表的时间最久,说实话收获不是很大。</p>
]]></description>
</item>
<item>
<guid isPermalink="true">https://songlinlife.top/2022/%E6%95%B0%E6%8D%AE%E5%BA%93/%E8%AE%BA%E6%96%87%E9%98%85%E8%AF%BB%EF%BC%9ANSG/</guid>
<title>论文阅读:NSG</title>
<link>https://songlinlife.top/2022/%E6%95%B0%E6%8D%AE%E5%BA%93/%E8%AE%BA%E6%96%87%E9%98%85%E8%AF%BB%EF%BC%9ANSG/</link>
<category term="数据库" scheme="https://songlinlife.top/categories/%E6%95%B0%E6%8D%AE%E5%BA%93/" />
<pubDate>Wed, 20 Apr 2022 19:53:27 +0800</pubDate>
<description><![CDATA[ <h3 id="作者的动机"><a class="anchor" href="#作者的动机">#</a> 作者的动机</h3>
<p>作者的动机有 4 个:</p>
<p>(1)保证 graph 的连通性。</p>
<p>(2)降低 graph 的平均出度</p>
<p>(3)使得搜索 path 长度尽可能短。</p>
<p>(4)降低索引的尺寸。</p>
<h3 id="贪婪算法"><a class="anchor" href="#贪婪算法">#</a> 贪婪算法</h3>
<p>首先介绍一下贪婪算法,很多算法比如 HNSW 的查询方法就是使用了贪婪算法。那么它是如何实现需要最近 K 邻的呢?对于给定的查询 q,以及起始点 p(这个起始点也叫 seed)。</p>
<p>1)首先,把 p 放入候选池 S 中,侯选池的尺寸是固定的为 l。</p>
<p>2)第二步,从侯选池中找到没有 check 过的节点(为什么要没有 check 过呢,因为砍掉重复路径)</p>
<p>3)如果发现侯选池中都已经 check 过了,说明已经没法找到更近 q 的节点了。注意筛选条件放到了 S 的 resize 哪里。直接把 neighbors 放入侯选池,抛弃掉比 q 远的节点。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220420195404015.png" alt="image-20220420195404015" /></p>
<h3 id="提高anns的性能"><a class="anchor" href="#提高anns的性能">#</a> 提高 ANNS 的性能</h3>
<p>提高 ANNS 性能可以从四个方面进行考虑,1)保证图的连通性。2)更低的平均出度。3)减少查询路径长度。4)降低索引的尺寸。</p>
<p>ANNS 的查询性能实际上就是(step 次数)* (每次比较 distance 的次数)。</p>
<h3 id="基于图anns算法"><a class="anchor" href="#基于图anns算法">#</a> 基于图 ANNS 算法</h3>
<h4 id="delaunay-graphs"><a class="anchor" href="#delaunay-graphs">#</a> Delaunay Graphs</h4>
<p>这里有一篇文章,写的很好:</p>
<p>基于 Delaunay 图的快速最大内积搜索算法 - 张雨石的文章 - 知乎 <span class="exturl" data-url="aHR0cHM6Ly96aHVhbmxhbi56aGlodS5jb20vcC8xMzM1MjY2MzI=">https://zhuanlan.zhihu.com/p/133526632</span></p>
<p>首先,定义沃罗诺伊块 (Voronoi cell):</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturespicturesv2-eb1ee86662d4437f5aba26475f198883_720w.png" alt="img" /></p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesv2-4e1c91847c5b29272085044d9d4881d2_720w.jpg" alt="img" /></p>
<p>每个沃若罗块内的坐标点离决定沃若罗块的 element 的距离比离其他 element 的距离是最近的。这是沃若罗块的性质。</p>
<p>如果把相邻沃若罗块的 element 相连接就形成了 DG 图。DG 图是一个单调图,这里有一篇文章解释了为什么 DG 图是单调图:</p>
<p><span class="exturl" data-url="aHR0cHM6Ly93aGVuZXZlcjUyMjUuZ2l0aHViLmlvLzIwMjEvMDMvMTIvcHJveGltaXR5LWdyYXBoLW1vbm90b25pY2l0eS8=">https://whenever5225.github.io/2021/03/12/proximity-graph-monotonicity/</span></p>
<p>DG 图实际上性能很差,因为 DG 图在高维情况下很容易变成全连接。</p>
<h3 id="mrng全华班怎么说"><a class="anchor" href="#mrng全华班怎么说">#</a> MRNG(全华班怎么说</h3>
<p>RNG 就是 Relative Neighborhood Graphs</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220420203621622.png" alt="image-20220420203621622" /></p>
<p>RNG 并不是单调搜索网络(MSNET,Monotonous Search Network),因此需要在 RND 中加入额外边,这种图这叫做 minimal MSNET。</p>
<p>FANNG 和 HNSW 都采用了 RND 的边选择策略。</p>
<h4 id="关于单调搜索图的定义"><a class="anchor" href="#关于单调搜索图的定义">#</a> 关于单调搜索图的定义</h4>
<h5 id="单调路径"><a class="anchor" href="#单调路径">#</a> 单调路径</h5>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220420203921257.png" alt="image-20220420203921257" /></p>
<p>如果 q 和 p 之间的一条路径,并且这个路径满足 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>d</mi><mi>i</mi><mi>s</mi><mo stretchy="false">(</mo><msub><mi>v</mi><mi>i</mi></msub><mo separator="true">,</mo><mi>q</mi><mo stretchy="false">)</mo><mo>&gt;</mo><mi>d</mi><mi>i</mi><mi>s</mi><mo stretchy="false">(</mo><msub><mi>v</mi><mrow><mi>i</mi><mo>+</mo><mn>1</mn></mrow></msub><mo separator="true">,</mo><mi>q</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">dis(v_i, q) &gt; dis(v_{i+1} , q)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal">d</span><span class="mord mathnormal">i</span><span class="mord mathnormal">s</span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em;">v</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.31166399999999994em;"><span style="top:-2.5500000000000003em;margin-left:-0.03588em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">q</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal">d</span><span class="mord mathnormal">i</span><span class="mord mathnormal">s</span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em;">v</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.311664em;"><span style="top:-2.5500000000000003em;margin-left:-0.03588em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mbin mtight">+</span><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.208331em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">q</span><span class="mclose">)</span></span></span></span> ,那么说明这是一个单调路径。</p>
<h5 id="单调搜索图"><a class="anchor" href="#单调搜索图">#</a> 单调搜索图</h5>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220420204310079.png" alt="image-20220420204310079" /></p>
<p>当且仅当图 G 中任意两个节点之间都存在一条单调路径,那么说明这是一个 MSNET。</p>
<h3 id="定理以及证明"><a class="anchor" href="#定理以及证明">#</a> 定理以及证明</h3>
<h4 id="定理一"><a class="anchor" href="#定理一">#</a> 定理一</h4>
<h5 id="定义"><a class="anchor" href="#定义">#</a> 定义</h5>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220420205835027.png" alt="image-20220420205835027" /></p>
<p>在 MSNET 的使用算法一,也就是非回溯的贪婪算法,那么通过这个算法找到 q 和 p 之间的路径,一定是一条单调路径。</p>
<h5 id="证明"><a class="anchor" href="#证明">#</a> 证明</h5>
<p>对于 t 次迭代,如果满足这个<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="normal">∀</mi><mi>i</mi><mo>∈</mo><mo stretchy="false">{</mo><mn>1</mn><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><mi>t</mi><mo>−</mo><mn>1</mn><mo stretchy="false">}</mo><mo separator="true">,</mo><mi>δ</mi><mrow><mo fence="true">(</mo><msub><mi>v</mi><mi>i</mi></msub><mo separator="true">,</mo><mi>q</mi><mo fence="true">)</mo></mrow><mo>&gt;</mo><mi>δ</mi><mrow><mo fence="true">(</mo><msub><mi>v</mi><mrow><mi>i</mi><mo>+</mo><mn>1</mn></mrow></msub><mo separator="true">,</mo><mi>q</mi><mo fence="true">)</mo></mrow></mrow><annotation encoding="application/x-tex">\forall i \in\{1, \ldots, t-1\}, \delta\left(v_{i}, q\right)&gt;\delta\left(v_{i+1}, q\right)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.73354em;vertical-align:-0.0391em;"></span><span class="mord">∀</span><span class="mord mathnormal">i</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">∈</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mopen">{</span><span class="mord">1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner">…</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal">t</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord">1</span><span class="mclose">}</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal" style="margin-right:0.03785em;">δ</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">(</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em;">v</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.31166399999999994em;"><span style="top:-2.5500000000000003em;margin-left:-0.03588em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">q</span><span class="mclose delimcenter" style="top:0em;">)</span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.03785em;">δ</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">(</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em;">v</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.311664em;"><span style="top:-2.5500000000000003em;margin-left:-0.03588em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mbin mtight">+</span><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.208331em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">q</span><span class="mclose delimcenter" style="top:0em;">)</span></span></span></span></span> 条件,那么说明找到的路径就是一条单调路径。通过数学归纳法进行证明。</p>
<p>1)如果 t=1,因为 MSNET 的定义就是任意两个点都能够找到一个单调路径,所以对于 p 和 q 两个点(v1=p)。存在一个路径 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>v</mi><mn>1</mn></msub><mo>→</mo><mi>r</mi><mo>→</mo><mi>q</mi></mrow><annotation encoding="application/x-tex">v_1 \rightarrow r \rightarrow q</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.58056em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em;">v</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.03588em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">→</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.43056em;vertical-align:0em;"></span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">→</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.19444em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">q</span></span></span></span>,这个路径是单调路径有 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>d</mi><mi>i</mi><mi>s</mi><mo stretchy="false">(</mo><msub><mi>v</mi><mn>1</mn></msub><mo separator="true">,</mo><mi>q</mi><mo stretchy="false">)</mo><mo>≥</mo><mi>d</mi><mi>i</mi><mi>s</mi><mo stretchy="false">(</mo><mi>r</mi><mo separator="true">,</mo><mi>q</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">dis(v_1, q) \geq dis(r, q)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal">d</span><span class="mord mathnormal">i</span><span class="mord mathnormal">s</span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em;">v</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.03588em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">q</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">≥</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal">d</span><span class="mord mathnormal">i</span><span class="mord mathnormal">s</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">q</span><span class="mclose">)</span></span></span></span>。那么 r 一定是 v1 的邻居节点,因为算法是一个非回溯贪心算法,v2 一定是选择最接近 q 的 v1 邻居节点(这是算法的定义)。那么就有 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>d</mi><mi>i</mi><mi>s</mi><mo stretchy="false">(</mo><msub><mi>v</mi><mn>2</mn></msub><mo separator="true">,</mo><mi>q</mi><mo stretchy="false">)</mo><mo>≤</mo><mi>d</mi><mi>i</mi><mi>s</mi><mo stretchy="false">(</mo><mi>r</mi><mo separator="true">,</mo><mi>q</mi><mo stretchy="false">)</mo><mo>≤</mo><mi>d</mi><mi>i</mi><mi>s</mi><mo stretchy="false">(</mo><mi>v</mi><mn>1</mn><mo separator="true">,</mo><mi>q</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">dis(v_2,q) \leq dis(r,q) \leq dis(v1, q)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal">d</span><span class="mord mathnormal">i</span><span class="mord mathnormal">s</span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em;">v</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.03588em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">q</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">≤</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal">d</span><span class="mord mathnormal">i</span><span class="mord mathnormal">s</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">q</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">≤</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal">d</span><span class="mord mathnormal">i</span><span class="mord mathnormal">s</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em;">v</span><span class="mord">1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">q</span><span class="mclose">)</span></span></span></span> ,即说明了 t=1 时,非回溯贪心算法找到的是单调路径。</p>
<p>2)假设 t = m 时,<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>v</mi><mn>1</mn></msub><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msub><mi>v</mi><mi>m</mi></msub></mrow><annotation encoding="application/x-tex">v_1, \dots , v_m</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.19444em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em;">v</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.03588em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner">…</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em;">v</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.151392em;"><span style="top:-2.5500000000000003em;margin-left:-0.03588em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">m</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span> 是一条单调路径。当 t = m+1 时,我们按照 t=1 进行分析,于是可以得出结论 t=m+1 也是单调路径。</p>
<h4 id="定理二"><a class="anchor" href="#定理二">#</a> 定理二</h4>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220421104228235.png" alt="image-20220421104228235" /></p>
<p>定理二指出,对于 MSNET,任意两点之间的路径长度为<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>O</mi><mrow><mo fence="true">(</mo><msup><mi>n</mi><mrow><mn>1</mn><mi mathvariant="normal">/</mi><mi>d</mi></mrow></msup><mi>log</mi><mo></mo><mrow><mo fence="true">(</mo><msup><mi>n</mi><mrow><mn>1</mn><mi mathvariant="normal">/</mi><mi>d</mi></mrow></msup><mo fence="true">)</mo></mrow><mi mathvariant="normal">/</mi><mi mathvariant="normal">Δ</mi><mi>r</mi><mo fence="true">)</mo></mrow></mrow><annotation encoding="application/x-tex">O\left(n^{1 / d} \log \left(n^{1 / d}\right) / \Delta r\right)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.2380099999999998em;vertical-align:-0.35001em;"></span><span class="mord mathnormal" style="margin-right:0.02778em;">O</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;"><span class="delimsizing size1">(</span></span><span class="mord"><span class="mord mathnormal">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8879999999999999em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1</span><span class="mord mtight">/</span><span class="mord mathnormal mtight">d</span></span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mop">lo<span style="margin-right:0.01389em;">g</span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;"><span class="delimsizing size1">(</span></span><span class="mord"><span class="mord mathnormal">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8879999999999999em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1</span><span class="mord mtight">/</span><span class="mord mathnormal mtight">d</span></span></span></span></span></span></span></span></span><span class="mclose delimcenter" style="top:0em;"><span class="delimsizing size1">)</span></span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord">/</span><span class="mord">Δ</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mclose delimcenter" style="top:0em;"><span class="delimsizing size1">)</span></span></span></span></span></span>,d 是样本的维度,<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="normal">Δ</mi><mi>r</mi></mrow><annotation encoding="application/x-tex">\Delta r</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.68333em;vertical-align:0em;"></span><span class="mord">Δ</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span></span></span></span> 随着样本 n 的增大非常缓慢地减小。</p>
<h3 id="索引构建算法"><a class="anchor" href="#索引构建算法">#</a> 索引构建算法</h3>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220421114639965.png" alt="image-20220421114639965" /></p>
<p>这个构建索引但是看挺惊艳的,但是现在一看其实和 HNSW 的边选择策略一样。</p>
<p>也是从构建类似 RNG 的角度出发,贪心搜索得到候选邻居(这里的候选邻居不再是 HNSW 中的 W,而是整个搜索中 visit 得到的 E),然后用 RNG 选边策略进行邻居选择。</p>
<p>最后在使用一个 tree build 算法,保证 root 节点到所有节点都是可寻的,也就是使得图强连通。每次检索 q 都是从 root 节点开始。</p>
<h3 id="实验部分"><a class="anchor" href="#实验部分">#</a> 实验部分</h3>
<p>作者说,因为不是每个算法都支持并行查询,所以只是用单线程来测试。</p>
<p>作者使用的对比算法:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424162913419.png" alt="image-20220424162913419" /></p>
<p>数据集上索引构建耗时:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424163507839.png" alt="image-20220424163507839" /></p>
<p>结果:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424163439964.png" alt="image-20220424163439964" /></p>
<h3 id="问题"><a class="anchor" href="#问题">#</a> 问题</h3>
<p>为什么 NSG 在实际的使用中并没有 enable increment Index 功能?明明它可以使用查询插入的方法?</p>
<h3 id="提供的启发"><a class="anchor" href="#提供的启发">#</a> 提供的启发</h3>
<p>要减少搜索的时间,两个方向,一个是减少平均度,一个是查询的 step 长度。</p>
]]></description>
</item>
<item>
<guid isPermalink="true">https://songlinlife.top/2022/%E6%95%B0%E6%8D%AE%E5%BA%93/Efficient-and-robust-approximate-nearest-neighbor-search-using-Hierarchical-Navigable-Small-World-graphs/</guid>
<title>ANNS:HNSW算法详解</title>
<link>https://songlinlife.top/2022/%E6%95%B0%E6%8D%AE%E5%BA%93/Efficient-and-robust-approximate-nearest-neighbor-search-using-Hierarchical-Navigable-Small-World-graphs/</link>
<category term="数据库" scheme="https://songlinlife.top/categories/%E6%95%B0%E6%8D%AE%E5%BA%93/" />
<category term="ANNS" scheme="https://songlinlife.top/tags/ANNS/" />
<pubDate>Tue, 19 Apr 2022 19:25:07 +0800</pubDate>
<description><![CDATA[ <p>这算是我看的第一篇 ANNS 文章,感觉像是打开了新世界的大门,后续读了一些 ANNS 文章,感觉没有读这一篇那么心潮澎湃了。</p>
<h3 id="论文动机"><a class="anchor" href="#论文动机">#</a> 论文动机</h3>
<h4 id="作者希望解决的问题"><a class="anchor" href="#作者希望解决的问题">#</a> 作者希望解决的问题</h4>
<p> K 近邻检索(K-NNS)在很多场景有着很强的需求,naive 的思想就是扫一遍整个数据集,然后找到 KNN,但是这样做无疑是极大耗时间的。虽然后续有很多更好的 KNNS 算法提出,但是由于 <code>curse of dimensionality</code> 维度诅咒的存在 KNNS 只适用于低维度的数据。为了解决这一问题,KANNS 被提出,只是检索最近似的邻居。通过用召回率,即(检索得到的最近邻)/ K。</p>
<p>现有的很多算法,比如基于树的方法<a href="https://songlinlife.top/2022/%E6%A0%91%E6%9F%A5%E8%AF%A2%E7%AE%97%E6%B3%95/">树检索算法</a>,基于哈希的算法,基于 product quantization(PQ)的方法。但是基于临近图(proximity graph)的 ANNS 算法取得了更加出色的性能。</p>
<p>但是这些图算法有一些缺点:</p>
<p>1)查询步数呈现幂律分布。</p>
<p>2)有可能失去全局连通性,导致只在 cluster 进行查询。</p>
<h4 id="作者提出的解决办法"><a class="anchor" href="#作者提出的解决办法">#</a> 作者提出的解决办法</h4>
<p>作者根据 proximity graph 提出了 NSW。(NSW 我没有读)NSW 的核心就是通过贪心算法查询 p 节点的 M 个最近邻,然后把查到的 M 的最近邻作为 p 节点的邻居。</p>
<p>NSW 的查询复杂度是 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>l</mi><mi>o</mi><msup><mi>g</mi><mn>2</mn></msup><mo stretchy="false">(</mo><mi>S</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">log^2(S)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.064108em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.01968em;">l</span><span class="mord mathnormal">o</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em;">g</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141079999999999em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="mclose">)</span></span></span></span>,因为平均查询步数为 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>l</mi><mi>o</mi><mi>g</mi><mo stretchy="false">(</mo><mi>S</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">log(S)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.01968em;">l</span><span class="mord mathnormal">o</span><span class="mord mathnormal" style="margin-right:0.03588em;">g</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="mclose">)</span></span></span></span>,而平均度数也为 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>l</mi><mi>o</mi><mi>g</mi><mo stretchy="false">(</mo><mi>S</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">log(S)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.01968em;">l</span><span class="mord mathnormal">o</span><span class="mord mathnormal" style="margin-right:0.03588em;">g</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="mclose">)</span></span></span></span>。这就导致总的 distance 计算次数为平方。</p>
<p>而 HNSW 为了降低这个查询复杂度,提出了多层 graph 模型。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220423215759395.png" alt="image-20220423215759395" /></p>
<h3 id="hnsw算法拆解"><a class="anchor" href="#hnsw算法拆解">#</a> HNSW 算法拆解</h3>
<p>作者关于 HNSW 的介绍,HNSW 的结构实际上类似于跳表:</p>
<blockquote>
<p>The Hierarchical NSW idea is also very similar to a well-known 1D probabilistic skip list structure [27] and can be described using its terms.</p>
</blockquote>
<h4 id="邻居选择策略"><a class="anchor" href="#邻居选择策略">#</a> 邻居选择策略</h4>
<p>HNSW 其实是一个 DG+RNG 图。</p>
<p>DG 图就是下面这个每个三角形之内,不能有其他点,DG 图可以给 ANNS 提供很好的查询精度,但是对于高纬度数据,DG 图会退化为全连接,可以看这篇文章 <span class="exturl" data-url="aHR0cHM6Ly96aHVhbmxhbi56aGlodS5jb20vcC8xMzM1MjY2MzI=">DG 图</span>:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220423222945245.png" alt="image-20220423222945245" /></p>
<p>RNG 图就是下面这个月牙里面不能有其他节点,if 𝑥 and 𝑦 are connected by edge 𝑒 ∈ 𝐸, then ∀𝑧 ∈ 𝑉 , with 𝛿 (𝑥, 𝑦) &lt; 𝛿 (𝑥, 𝑧), or 𝛿 (𝑥, 𝑦) &lt; 𝛿 (𝑧, 𝑦),这个 RNG 图我一开始很难理解,其实这个月牙是两个圆相交,那么这个月牙内每个节点到 x 或者 y 的距离一定是都要小于 x,y 的距离的。可以画个图,以 x 圆形以 xy 为半径和以 y 为圆形 xy 为半径进行画图相交得到月牙,那么这里面的所有节点到 x 和 y 的距离都小于 xy。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220423223514215.png" alt="image-20220423223514215" /></p>
<p>HNSW 借鉴了 RNG 的这种思想,也就是启发式选边,但是并没有构建一个真的 RNG 图,因为一个 RNG 的构建复杂度为 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>O</mi><mo stretchy="false">(</mo><mi mathvariant="normal">∣</mi><mi>S</mi><msup><mi mathvariant="normal">∣</mi><mn>3</mn></msup><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">O(|S|^3)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.064108em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.02778em;">O</span><span class="mopen">(</span><span class="mord">∣</span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="mord"><span class="mord">∣</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141079999999999em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">3</span></span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span>。</p>
<p>RNG 论文中对 RNG 图的定义:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424105155163.png" alt="image-20220424105155163" /></p>
<p>其实就是指的月牙里面没有点。</p>
<p>so,首先我们 look one down 选边策略:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220423224657961.png" alt="image-20220423224657961" /></p>
<p>extendCandidates 是一个数据加强的策略,把 Candidate 的 neighbour 也加入到 Candidate 中,因为 KGraph 也提到过,<em>邻居的邻居很有可能是邻居</em>。</p>
<p>我们主要看这个核心的启发式选边策略:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220423225152676.png" alt="image-20220423225152676" /></p>
<p><code>extract nearest element from W to q</code> ,也就是说这个循环是 distance 从小到大开始选边。这里的类 RNG 是,我们需要满足没有连接的边出现在上图说的月牙形状中,而不是说没有点出现在月牙中,这有点难理解。这个选边策略其实和 NSG 的选边策略相同。。。</p>
<p>RNG 图的要求,如果 x,y 要进行连接,那么不能出现一个节点到 x 或者 y 的距离都小于 x,y 的距离。而我们从小到大开始选边,那么默认满足了之前选择的节点到 q 的距离一定小于现在这个节点 e 到 q 的距离。那么如果出现之前选择的节点到 e 的距离小于 q 和 e 之间的距离,那么我们就触发了 RNG 的丢弃条件了。HNSW 叫启发式选边,而 NSG 叫 MRNG,难绷。。。</p>
<p><strong>易远丁真,鉴定为最近邻。</strong></p>
<p>这种启发式选边的最大好处就是继承了类似 RNG 的属性,就是是稀疏性。</p>
<p>作者说这样可以连接到不同的 cluster,其实这正是 RNG 的特性之一,RNG 的特定就是强连通性。因为如果节点出现聚集,那么两个 cluster 的边界节点很可能是稀疏的,也就是月牙状里面没有其他节点,那么我们就能够把他们进行连接。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220423232313593.png" alt="image-20220423232313593" /></p>
<h4 id="layer搜索算法"><a class="anchor" href="#layer搜索算法">#</a> layer 搜索算法</h4>
<p>首先是层检索算法,因为 HNSW 是一个分层的结构,所以我们检索时候实际上是从上到下进行检索,每一层都进行检索。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220423233156710.png" alt="image-20220423233156710" /></p>
<p>其实这就是一个贪心算法,换汤不换药。给定一个 Candidate list 和 result list,每一次迭代取出 Candidate 中和 q 最近的点 c,如果这个最近的点比 result 中离 q 最远的点 f 还要远,说明达到了局部最佳,这时候 break。如果还可以优化就继续检索 c 的邻居节点。</p>
<p>注意,贪心算法是非回溯的。在 NSG 中证明了 MRNG 是一个 MSNET,通过贪心非回溯算法得到的结果,它的搜索路径一定是单调路径。</p>
<h4 id="插入算法"><a class="anchor" href="#插入算法">#</a> 插入算法</h4>
<p>因为 HNSW 的多层结构,我们在插入的时候也是从上到下进行插入。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424104002378.png" alt="image-20220424104002378" /></p>
<p>这个算法我觉得没有太多需要注意的地方。需要注意的就是,在选择 neighbour 之后需要加上双向边,这就导致了可能出现节点的度大于 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>M</mi><mrow><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub></mrow><annotation encoding="application/x-tex">M_{max}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.83333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.10903em;">M</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.151392em;"><span style="top:-2.5500000000000003em;margin-left:-0.10903em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">m</span><span class="mord mathnormal mtight">a</span><span class="mord mathnormal mtight">x</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span>。因此需要进行 shrink 操作。</p>
<p>插入法归纳步骤。1)随机 roll 一个层数 l,从当前图的 top level L 开始查询,每次就去最近邻,直到查到 l 层。2)使用贪心算法检索出候选邻居,并使用启发式算法对候选邻居进行选择,给这些邻居添加 <strong>双向边</strong>。3)如果邻居的度超过了 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>M</mi><mrow><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub></mrow><annotation encoding="application/x-tex">M_{max}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.83333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.10903em;">M</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.151392em;"><span style="top:-2.5500000000000003em;margin-left:-0.10903em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">m</span><span class="mord mathnormal mtight">a</span><span class="mord mathnormal mtight">x</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span>,那么对邻居执行一次查询选择操作,并为候选邻居添加 <strong>单向边</strong>。</p>
<h4 id="总的查询算法"><a class="anchor" href="#总的查询算法">#</a> 总的查询算法</h4>
<p>第 0 层以上的,使用 ef=1,第 0 层使用用户指定的 ef。层次化查询,easy。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424104910397.png" alt="image-20220424104910397" /></p>
<h3 id="问题"><a class="anchor" href="#问题">#</a> 问题</h3>
<p>如果设置固定层数,随机因子固定,对性能影响?</p>
<p>为什么通过启发式构建的图,能够使用贪婪算法?以及证明贪婪算法收敛?</p>
<h3 id="构建参数影响"><a class="anchor" href="#构建参数影响">#</a> 构建参数影响</h3>
<p>构建索引这里面有很多参数,比如 $M_{max0} \quad efConstruction \quad m_L\quad M $.</p>
<p>一般 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>m</mi><mi>L</mi></msub></mrow><annotation encoding="application/x-tex">m_L</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.58056em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal">m</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.32833099999999993em;"><span style="top:-2.5500000000000003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">L</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span> 选择为 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>1</mn><mi mathvariant="normal">/</mi><mi>ln</mi><mo></mo><mi>M</mi></mrow><annotation encoding="application/x-tex">1/\ln{M}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord">1</span><span class="mord">/</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mop">ln</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.10903em;">M</span></span></span></span></span> 。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424112057862.png" alt="image-20220424112057862" /></p>
<p>作者也对 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>M</mi><mrow><mi>m</mi><mi>a</mi><mi>x</mi><mn>0</mn></mrow></msub></mrow><annotation encoding="application/x-tex">M_{max0}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.83333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.10903em;">M</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.10903em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">m</span><span class="mord mathnormal mtight">a</span><span class="mord mathnormal mtight">x</span><span class="mord mtight">0</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span> 进行了研究,发现取 2M 时候效果最好</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424114751408.png" alt="image-20220424114751408" /></p>
<p>efconstruction 可以通过 simple data 来自动 configured。</p>
<p>唯一对用户有意义的参数就是 M,也就是节点的度,作者也对 M 的取值做了实验,12 或者 20 是一个比较好的选择,anns-benchmark 也是取 12 和 20:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424115031521.png" alt="image-20220424115031521" /></p>
<h4 id="性能评估"><a class="anchor" href="#性能评估">#</a> 性能评估</h4>
<p>作者从四个方面对算法是性能进行评估:</p>
<h4 id="和baseline-nsw进行比较"><a class="anchor" href="#和baseline-nsw进行比较">#</a> 和 baseline NSW 进行比较</h4>
<p>使用维度为 4 的随机数据,可以看到 HNSW 吊打 NSW。</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424115715624.png" alt="image-20220424115715624" /></p>
<h4 id="欧式空间内进行比较"><a class="anchor" href="#欧式空间内进行比较">#</a> 欧式空间内进行比较</h4>
<p>用到的对比算法:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424115905927.png" alt="image-20220424115905927" /></p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424115915647.png" alt="image-20220424115915647" /></p>
<p>用到的数据集:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424115930890.png" alt="image-20220424115930890" /></p>
<p>结果:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424120053747.png" alt="image-20220424120053747" /></p>
<p>低纬度的随机数据,HNSW 略微好于 annoy,其他情况下都是吊打。</p>
<h4 id="非欧空间"><a class="anchor" href="#非欧空间">#</a> 非欧空间</h4>
<p>这个不是很懂,以后再看吧</p>
<h4 id="和pq方法的对比"><a class="anchor" href="#和pq方法的对比">#</a> 和 pq 方法的对比</h4>
<p>参数设置:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424120413221.png" alt="image-20220424120413221" /></p>
<p>结果:</p>
<p><img data-src="https://image-2021-wu.oss-cn-beijing.aliyuncs.com/blogs/picturesimage-20220424120444921.png" alt="image-20220424120444921" /></p>
<p>依旧是吊打,但是图方法的最大劣势就是耗内存很多。。。</p>
]]></description>
</item>
</channel>
</rss>