-
Notifications
You must be signed in to change notification settings - Fork 0
/
draft-szarecki-grow-abstract-nh-scaleout-peering-01.xml
977 lines (915 loc) · 49.9 KB
/
draft-szarecki-grow-abstract-nh-scaleout-peering-01.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
There has to be one entity for each item to be referenced.
An alternate method (rfc include) is described in the references. -->
<!ENTITY RFC2385 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2385.xml">
<!ENTITY RFC4271 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4271.xml">
<!ENTITY RFC4272 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4272.xml">
<!ENTITY RFC4456 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4456.xml">
<!ENTITY RFC4724 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4724.xml">
<!ENTITY RFC5065 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5065.xml">
<!ENTITY RFC5082 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5082.xml">
<!ENTITY RFC5925 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5925.xml">
<!ENTITY RFC7454 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7454.xml">
<!ENTITY RFC7911 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7911.xml">
<!ENTITY RFC8402 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8402.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs),
please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
(Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space
(using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="info" docName="draft-szarecki-grow-abstract-nh-scaleout-peering-01" ipr="trust200902">
<!-- category values: std, bcp, info, exp, and historic
ipr values: trust200902, noModificationTrust200902, noDerivativesTrust200902,
or pre5378Trust200902
you can add the attributes updates="NNNN" and obsoletes="NNNN"
they will automatically be output with "(if approved)" -->
<!-- ***** FRONT MATTER ***** -->
<front>
<!-- The abbreviated title is used in the page header - it is only necessary if the
full title is longer than 39 characters -->
<title abbrev="Abstract NH in scale-out peering">Use of Abstract NH in Scale-Out peering architecture</title>
<!-- add 'role="editor"' below for the editors if appropriate -->
<!-- Another author who claims to be an editor -->
<author fullname="Rafal Jan Szarecki" initials="R.J." role="editor"
surname="Szarecki">
<organization>Juniper Networks Inc.</organization>
<address>
<postal>
<street>1133 Innovation Way</street>
<!-- Reorder these if your country does things differently -->
<city>Sunnyvale</city>
<region>CA</region>
<code>94089</code>
<country>US</country>
</postal>
<phone>+1(408)680-9604</phone>
<email>[email protected]</email>
<!-- uri and facsimile elements may also be added -->
</address>
</author>
<author fullname="Kaliraj Vairavakkalai" initials="K."
surname="Vairavakkalai">
<organization>Juniper Networks Inc.</organization>
<address>
<postal>
<street>1133 Innovation Way</street>
<!-- Reorder these if your country does things differently -->
<city>Sunnyvale</city>
<region>CA</region>
<code>94089</code>
<country>US</country>
</postal>
<phone>+1(408)936-8872</phone>
<email>[email protected]</email>
<!-- uri and facsimile elements may also be added -->
</address>
</author>
<author fullname="Natrajan Venkataraman" initials="N."
surname="Venkataraman">
<organization>Juniper Networks Inc.</organization>
<address>
<postal>
<street>1133 Innovation Way</street>
<!-- Reorder these if your country does things differently -->
<city>Sunnyvale</city>
<region>CA</region>
<code>94089</code>
<country>US</country>
</postal>
<phone>+1(408)936-6597</phone>
<email>[email protected]</email>
<!-- uri and facsimile elements may also be added -->
</address>
</author>
<author fullname="Mannan Venkatesan" initials="M."
surname="Venkatesan">
<organization>Comcast</organization>
<address>
<postal>
<street>1800 Bishops Gate Blvd</street>
<!-- Reorder these if your country does things differently -->
<city>Mount Laurel</city>
<region>NJ</region>
<code>08054</code>
<country>US</country>
</postal>
<phone>+1(856)792-2467</phone>
<email>[email protected]</email>
<!-- uri and facsimile elements may also be added -->
</address>
</author>
<date year="2019" />
<!-- If the month and year are both specified and are the current ones, xml2rfc will fill
in the current day for you. If only the current year is specified, xml2rfc will fill
in the current day and month for you. If the year is not the current one, it is
necessary to specify at least a month (xml2rfc assumes day="1" if not specified for the
purpose of calculating the expiry date). With drafts it is normally sufficient to
specify just the year. -->
<!-- Meta-data Declarations -->
<area>General</area>
<workgroup>Internet Engineering Task Force</workgroup>
<!-- WG name at the upperleft corner of the doc,
IETF is fine for individual submissions.
If this element is not present, the default is "Network Working Group",
which is used by the RFC Editor as a nod to the history of the IETF. -->
<keyword>template</keyword>
<!-- Keywords will be incorporated into HTML output
files in a meta tag but they have no effect on text or nroff
output. If you submit your draft to the RFC Editor, the
keywords will be used for the search engine. -->
<abstract>
<t>
Many large-scale service provider networks use some form of
scale-out architecture at peering sites. In such an architecture,
each participating Autonomous System (AS) deploys multiple
independent Autonomous System Border Routers (ASBRs) for peering,
and Equal Cost Multi-Path (ECMP) load balancing is used between
them. There are numerous benefits to this architecture, including
but not limited to N+1 redundancy and the ability to flexibly
increase capacity as needed. A cost of this architecture is an
increase in the amount of state in both the control and data planes.
This has negative consequences for network convergence time and
scale.
</t>
<t>
In this document we describe how to mitigate these negative
consequences through configuration of the routing protocols, both
BGP and IGP, to utilize what we term the "Abstract Next-Hop" (ANH).
Use of ANH allows us to both reduce the number of BGP paths in the
control plane and enable rapid path invalidation (hence, network
convergence and traffic restoration). We require no new protocol
features to achieve these benefits.
</t>
</abstract>
</front>
<middle>
<section title="Introduction">
<t>
Common to all large Internet networks are the requirements for
large aggregate bandwidth and low latency. As network sizes
and traffic volumes have increased, it has become common to use
scale-out architectures to satisfy these requirements. Use of
these techniques within individual networks is well-known. Here,
we explore a scale-out architecture for interconnecting different
Autonomous Systems (ASes).
</t>
<t>
Below, we show an example topology. Content is hosted within AS 2,
consumers connect via the various ISP Metro ASes.
<figure align="center" anchor="xml_happy">
<artwork align="left">
<![CDATA[
+---------------+ +----------------+ +---------------+
| | | +-------+ |
| +------+ +-------+ AS 30 |
| +------+ | | ISP Metro |
| +------+ | /----+ |
| | | | //----+ |
| AS 2 | | AS 1 |// +---------------+
| Content | | ISP BackBone X/
| provider +------+ X\
| +------+ |\\ +---------------+
| | | | \\----+ |
| | | | \----+ AS 31 |
| +------+ | | ISP Metro |
| +------+ +-------+ |
| +------+ +-------+ |
+---------------+ +----------------+ +---------------+
]]>
</artwork>
</figure>
ASes 1 and 2 are connected at multiple, geographically diverse, sites.
Geographic diversity is required for reasons including resiliency,
minimization of latency, and minimization of cost associated with
long-distance data transmission.
</t>
<section title="Scale-Out peering">
<t>
The same trends that have driven the use of
scale-out architectures within ASes drive interest in using them
at peering sites. In such an architecture, each AS at the
peering site deploys multiple independent
Autonomous System Border Routers (ASBRs). Benefits that can be
realized include N+1 redundancy and the ability to flexibly
increase capacity as needed. The ASBRs are often connected to
the rest of their AS in a leaf-spine topology through core routers,
and augmented with a per-site pair of BGP route reflectors (RRs).
See for example SITE1 in <xref target="ref_diagram"/>, below.
</t>
<t>The fundamental requirements in this architecture are:
<list style="letters">
<t>Keep traffic on a path that has low latency.</t>
<t>Utilize all peering links that offer low latency.</t>
<t>In the event of failure, minimize the time needed to
restore service.</t>
</list>
</t>
<section title="Low latency">
<t>
BGP, the Border Gateway Protocol, does not directly carry delay
information. We make the general assumption in this document
that paths selected by the BGP best path algorithm <xref
target="RFC4271"/> will provide lower latency than those not
selected. This assumption is not guaranteed to be true, but
lacking special arrangements between peering ASes, it is what
the protocol is able to provide.
</t>
</section>
<section title="All equal cost paths utilization">
<t>
In order to use all links between peering ASes that provide the
same BGP path costs to the destination prefix, at a minimum
BGP speakers need to be enabled for multi-path operation.
Additionally, all AS ingress BGP speakers need to know at
least all equal and best paths to the destination via multiple
ASBRs. If a full IBGP mesh is used, this happens naturally.
However, IBGP full meshes are uncommon in large networks and
are even more impractical in scale-out architectures due to the
high total number of ASBRs.
</t>
<t>
The well-known techniques to deal with full-mesh scale
challenges - Route Reflection <xref target="RFC4456"/> and
Confederations <xref target="RFC5065"/> - hide redundant paths,
as they advertise only a single selected path to their clients.
While this helps keep path and session scale manageable, it
makes BGP multipath unusable. We overcome this by using BGP
ADD-PATH <xref target="RFC7911"/> between the RR and its
clients (or among sub-ASes).
</t>
</section>
<section title="Summary">
<t>In summary, for a scale-out peering architecture:
<list style="symbols">
<t>BGP multipath needs to be enabled on all IBGP sessions inside the AS.</t>
<t>BGP multipath needs to be enabled on all EBGP sessions of each ASBR.</t>
<t>BGP ADD-PATH needs to be enabled on all IBGP sessions.
<list style="symbols">
<t>RRs need to be able to send multiple paths per prefix. The upper limit
depends on:
<list style="symbols">
<t>The maximum number of ASBRs per site (say N).</t>
<t>Possibly also on the maximum number of EBGP sessions held by a
single ASBR with single peer AS (say M), depending on
BGP next-hop attribute (BGP-NH) configuration.</t>
</list>
</t>
<t>
RR clients/ASBRs may need to be able to send multiple paths per
prefix if BGP-NH configuration is "next hop unchanged". The
upper limit depends on the maximum number of EBGP sessions held
by a single ASBR with single peer AS (say M).
</t>
</list>
</t>
</list>
</t>
<t>For further consideration the following network diagram will be used for
reference:
<figure align="center" anchor="ref_diagram">
<artwork align="left"><![CDATA[
+------------------------------------------------------------------+
| AS 1 +--------------------+|
| +----------------------------------+ |+------+ SITE3 o--o ||
| | SITE1 +-------- Cost 10 -+------+|CR_3.1|--+ o-|RR| ||
| | o------o | | |+------+ | |Ro--o ||
| | O-|RR_1.1| | | |+------+ | o--o ||
| | |Ro------O | +--- Cost 10 --+|CR_3.K| |+-------+||
| | O------O +------+ | | |+---+--+ ||BR_3.N"|||
| | |CR_1.1|-------+- Cost 10 -+ | | |+-------+||
| | +------+ | | | +----+-----+---------+|
| | / / \ +------+ | | Cost 15 Cost 15 |
| | / / \ |CR_1.K|--Cost+ | +----+-----+---------+|
| | / | \ +------+ 10 | | |+---+--+ | SITE2 ||
| | / | \ / | \ | | +--+|CR_2.K| | o--o ||
| | / | \--X-\ / \ | | |+------+ | |RR|-o ||
| | / /--+--------/ X | | | |+------+ | o--oR| ||
| | / / | /-------/ \ \ | +----+|CR_2.1|<-+ o--o ||
| | / / | / \ \ | |+------+ ||
| | +------+ +------+ +------+ | |+------+ +-------+||
| | |BR_1.1| |BR_1.2|- - -|BR_1.N| | ||BR_2.1| |BR_2.N'|||
| | +X----X+ +-X---X+ +-X---X+ | |+-+--+-+ +-+---+-+||
| +---X----X----X---X--------X-----X-+ +--+--+-------+---+--+|
+-------X----X----X---X-------+------X----------+--+-------+---+---+
\ \ | \ | \----\ | | | |
BR_1.1 \ \ | \-----+----------\ \ | | \ |
^ \-\ \-+-----------+-------\ \ \ \ \ \ \
X BR_1.2 \ | | \ \ \ \ \ \ \
X ^ \ | / \ \ | \ \ | |
X X BR_1.N \ \ /------/ \ | | \ \ | |
X X ^ \ \ / \ | | \ \ | |
X X X | | | ^ ^ ^ | | | \ \ | |
X X X | | | | | | | | | \ \ | |
+---------+ +----+-+-+---+-+-+------------+-+-+--------X--X--+--+--+
| | | | | | | | | | | | \ \ | | |
| | | +-+-+-++ ++-+-+-+ +------+ +------+ | |
| | | |PR_2.1| |PR_2.2|- - - |PR_2.M| |PR_2.P+--+ |
| | | +------+ +------+ +------+ +--+---+.T| |
| | | +------+ |
| AS 3 | | AS 2 |
+---------+ +------------------------------------------------------+
|==================================================================|
|CR - Core Router |
|BR - ASBR and/or Customer Edge in AS1 |
|PR - ASBR in peering ASes |
|==================================================================|
]]></artwork>
</figure>
</t>
</section>
</section>
<section anchor="common_configs" title="Common BGP Deployment Configurations">
<section title="IBGP with Next-Hop Unchanged">
<t>
In one standard BGP configuration, an ASBR, when it advertises an
externally learned prefix into IBGP, does not modify the BGP-NH.
So, the BGP-NH is set to the IP address of an interface on the
external peering router. The strength of this technique is the
shorter time needed to restore connectivity with all equal cost
multi-path (ECMP) in-use and on low latency paths. The drawback is
extremely high BGP Routing Information Base (RIB) scale -
proportional to the number of inter-AS links.
</t>
<section title="Example">
<t>
Let's assume that in the network of <xref target="ref_diagram"/>,
all PR2.x of AS2 advertise the same set of prefixes on all sessions
to AS1.
</t>
<t>
If BR1.1-BR1.N and BR2.1-BR2.N' each advertise only one path per
prefix to their respective RRs, then as the result of ADD-PATH
among RRs, BRs and CRs, at site 3 the BRs and CRs will learn
N+N' paths per prefix learned from AS2. This is sufficient to
equally distribute load among all N ASBRs on site 1 (note the IGP
cost between site 2 and site 3).
</t>
<t>
However, when interfaces over which all BR1.1-BR_1.N learned their
best path become unavailable (say interfaces to PR_2.1 in all
cases, as a result of the failure of PR_2.1), the route to the BGP
BGP-NH - that is, the IP address of the PR_2.1 interface - is removed
from the IGP. BGP speakers at other sites (BR_3.x) will react by
temporarily directing traffic to site 2 (BR_2.1-BR_2.N'). This
switchover may happen in sub-second time, in a
prefix-scale-independent manner, thanks to techniques commonly
known as BGP PIC Edge <xref target="I-D.ietf-rtgwg-bgp-pic"/>. As a result, traffic is on a path other than
the lowest cost path, as the connection from site 1 to AS2 is not
entirely broken (links to PR_2.2-PR_2.M are operational).
</t>
<t>
Subsequently, all BR1.x will update their RRs with a new best path
(say for PR_2.2) for each prefix (for example, 100,000 of them), triggering
global convergence. Such a convergence, for a large number of
prefixes, may take many minutes.
</t>
<t>
In the above example, BRs, RRs, and possibly CRs keep N+N' paths
per prefix (N from site 1, and N' from site 2). Provided N=N'=4,
this makes 8 path per prefix.
</t>
<t>
The solution for sub-optimal routing right after the failure would
be to enable each BR to advertise multiple paths to its RRs, and
for them in turn to propagate it to all other RRs and hence BRs.
So, each of BR1.x at site 1 will advertise M paths (from
PR_2.1-PR_2.M), RR1.x will have N*M ECMP best paths and advertise
them to other sites (site 3). As a result, BGP speakers at other
sites (BR3.x at site 3) are provided with N*M paths per prefix from
site 1 and N'*M' from site 2. Therefore to achieve optimal routing
immediately after failure, a considerably higher scale of BGP paths
needs to be handled. If M=N=N'=M'=4 then for each prefix we have 16
best paths and 16 non-best, a total of 32. If AS2 advertises 100,000
prefixes, this becomes 3.2M paths.
</t>
<t>
Although this solution provides a mean of fast,
prefix-scale-independent traffic switchover, it does it only if an
ASBR external interface goes down, which triggers an IGP event. In
case an EBGP session fails but the underlying interface remains up
(misconfiguration, software defect, etc), recovery still requires
per-prefix withdrawal/update that could take many minutes at high
scale.
</t>
</section>
</section>
<section anchor="next_hop_self" title="IBGP with Next-Hop-Self">
<t>
The other common technique is to modify BGP-NH to "self" (a local
IP address, typically a loopback) when the BR advertises an
externally learned path into IBGP. This technique allows the
reduction of the number of paths per prefix, while keeping optimal
forwarding - least cost and ECMP - in case of failure discussed
above (e.g. PR_2.1 node failure). Actually, because IP addresses
of BGP-NH as seen by other BGP speakers do not change in response
to external failure events, and are resolvable by the IGP, there is no
need to reprogram the Forwarding Information Base (FIB) at all.
Unfortunately, other failures - loss of all connectivity between a
single BR (say BR1.1) and a peer AS (all PRs in AS2) would not be
handled quickly. As the BGP-NH advertised by BR_1.1 is not
changed and is reachable by the IGP, BGP speakers in AS1 (BRs, CRs)
will keep BR_1.1 as a feasible exit point until they receive BGP
withdraws on a prefix-by-prefix basis. This is a global
convergence process that at high scale can take minutes, during
which time packets may be discarded or loop.
</t>
</section>
</section>
</section>
<section title="The BGP Abstract Next-Hop">
<t>
The Abstract Next Hop (ANH) concept presented below does not require
any changes to the BGP protocol itself. It is architectural solution to
network configuration, that uses existing protocols' capabilities
while achieving higher scale and faster routing convergence when
scale-out peering sites exist.
</t>
<t>
When a BGP speaker advertises a path to its IBGP peer, it modifies
the Protocol Next-Hop to be the ANH value. The ANH is just an IP
address that identifies the BGP session or a set of BGP sessions.
The set of BGP sessions is defined by the operator in local
configuration, according to network design needs. For example, an
ANH might identify:
<list style="symbols">
<t>a set of BGP sessions with the same peer AS and handled by a given
single ASBR</t>
<t>a set of BGP sessions with same the peer AS and handled by one or
more ASBRs at a given site</t>
<t>a set of BGP sessions with any upstream provider AS</t>
<t>a set of BGP sessions with a given peer device and handled by one
or more of ASBRs of the local AS</t>
</list>
A host route to the ANH is installed in the relevant RIB and
redistributed into the IGP. BGP maintains the ANH host route based
on the state of the associated group of BGP sessions:
<list style="symbols">
<t>As soon as all BGP sessions in the set go down, the ANH route
is removed.</t>
<t>When at least one BGP session in of the set comes up, the ANH
route is created only after initial route convergence is
complete for the peer (End-of-RIB (EoR) <xref target="RFC4724"/> is
received).</t>
</list>
Taken together, these procedures ensure that as soon as the final
session in the set goes down, ingress routers will see the
associated ANH withdrawn from the IGP. Since the ANH is used to
resolve the associated BGP next hops, the ingress routers are
triggered to converge to send traffic to their alternate (new best)
route. They also ensure that as soon as one session in the set comes
up and is synchronized (that is, the EoR is received), ingress
routers will see the ANH advertised in the IGP and will be able to
reconverge to use routes that are associated with that next hop.
</t>
<t>
The ANH can be any IP address that the router is eligible to
advertise according to the local network's IP address management
scheme. More details are given in <xref target="nh_assign"/>.
</t>
</section>
<section title="Use of Abstract Next-Hop in scale-out peering design">
<t>
In traditional configurations as described in <xref target="common_configs"/>
the meaning of the BGP-NH is either:
<list style="symbols">
<t>An egress interface in the case of next-hop-unchanged configuration, or</t>
<t>An egress ASBR in the case of next-hop-self configuration.</t>
</list>
The meaning of Abstract Next Hop is more context-dependent. This document
describes network configurations when the BGP-NH identifies:
<list style="letters">
<t>An (egress ASBR, peer AS) pair. The ANH should be advertised into
the IGP if, and only if, the given egress ASBR has at least one
EBGP session in the ESTABLISHED state with the given peer AS, and the
EoR marker has been received on that session. We call this the
ASBR-Peer AS Abstract Next Hop (AP-ANH).</t>
<t>An (egress site in local AS, peer AS) pair, where a "site" may include
multiple ASBRs. The ANH should be advertised into the IGP if, and only
if, at least one ASBR of the given site has at least one EBGP session
in the ESTABLISHED state with the given peer AS, and the EoR marker has been
received on this session. We call this the Site-Peer AS Abstract Next Hop
(SP-ANH).</t>
</list>
Note that reachability of the ANH address in the IGP depends on EBGP
session state and not inter-AS interface state, although of course,
interface state may impact session state. How the IP route to the ANH
address is instantiated on an ASBR and inserted into the IGP on
particular device is a matter of local implementation.
</t>
<section title="Egress ASBR-Peer AS Abstract Next Hop (AP-ANH)">
<t>
The AP-ANH is unique to an ASBR and its peer AS. For example, in
the network of <xref target="ref_diagram"/>, BR_1.1 would have two
AP-ANH assigned - one for its peering with AS2 and the other for AS3.
Similarly, BR_1.2 would have two AP-ANH, one per peer AS, with
values different from the AP-ANH of BR_1.1, and so on. All AP-ANH
are exported into the IGP by their ASBRs. Each ASBR advertises only
one path per prefix to its RR, with the BGP-NH set to the
appropriate AP-ANH. The RR will propagate it through the entire AS
by means of IBGP ADD-PATH. In consequence, the number of paths
learned per prefix is equal to number of ASBRs servicing a given
peer AS. In the network as of <xref target="ref_diagram"/>, for AS2
prefixes, this would be N+N' (from site_1 + from site_2) paths per
prefix. This sets the scale requirements of this solution to be on par
with <xref target="next_hop_self">Next-Hop-Self</xref>. However, thanks
to the properties of ANH, more failures are covered by
prefix-independent techniques, as withdrawal of the ANH from the
IGP makes the BGP-NH unresolvable.
</t>
<t>
Provided that all ASBRs in a given site (site1 in <xref
target="ref_diagram"/>) receive the same routing information from
their peer AS (AS2), in non-faulty conditions, one could consider
setting the ANH value on all ASBRs the same. However, failure(s)
can create situations when multiple ASBRs will have a session in
ESTABLISHED state with a given peer AS, but some prefixes would be
learned from EBGP only on a subset of these ASBRs. To prevent
problems from arising in this situation, the per-ASBR AP-ANH needs
to be advertised into the IGP and ASBRs need to set it as the BGP-NH
when advertising routes to the site's Route Reflectors. However, for
IBGP path advertisement being propagated beyond the site (into the
RR mesh), the BGP-NH may be replaced by another ANH value, the
Site-Peer AS ANH.
</t>
</section>
<section title="The Site-Peer AS Abstract Next Hop (SP-ANH)">
<t>
The AP-ANH works on an ASBR level. From a given local AS
perspective, the number of ANH is proportional to the number of
pairs of ASBRs and ASes each of them peers with. With hundreds of
peer ASes, tens of sites and ~10 ASBRs per site, the number of
AP-ANH may scale into the thousands. At the same time, it may not
be necessary or even desirable for every BGP speaker in the network
to have visibility to every path down to individual egress ASBR
granularity. With symmetrical multiplane backbone and/or
leaf-spine designs, it is sufficient that BGP speakers on other
sites have information that a given site (site1 in <xref
target="ref_diagram"/>) has at least one ASBR with an ESTABLISHED
session to the peer AS (AS2). For example, in the network of <xref
target="ref_diagram"/>, even if BR3.1 has only one path with its BGP-NH
equal to the ANH of BR1.1, BR3.1 resolves the BGP-NH in the IGP and
spreads traffic among all CRs on site 3. Thus, traffic will be
delivered to CR1.x at site 1. As long as CR1.x has visibility to
all paths, traffic will be distributed equally to all site 1 ASBRs.
</t>
<t>
At the same time, when multiple paths are available on BGP
speakers, every change is propagated, with consequent transmission
and processing costs on all BGP speakers across the network. This
will be true even if the route change doesn't impact the forwarding
plane. For example, in the network of <xref target="ref_diagram"/>,
even if BR3.1 has N paths with BGP-NHs set to the ANHs of BR1.1
through BR1.N, BR3.1 will resolve those BGP-NHs in the IGP and spread
traffic among all CRs of site 3. When one of the egress ASBRs (say
BR1.2) loses its connectivity to the peer AS, the affected BGP routes
(those with BGP-NH equal to AP-ANH of BR1.2) are withdrawn from all BGP
speakers (e.g. BR3.1) of the network. All BGP speakers perform path
selection and possibly update their forwarding data structures. Since
the actual forwarding paths do not change, all this work represents
unnecessary churn.
</t>
<t>
To avoid the above drawbacks, the RR of a given site (site1 in
<xref target="ref_diagram"/>), when re-advertising a BGP path
learned from its ASBR client, modifies the BGP-NH to another
abstract value - the Site-Peer AS Abstract NH (SP-ANH). This value
is unique per (site, peer AS) pair, and is shared by all RRs of a
given site. With this modification, it is sufficient that
inter-site IBGP sessions carry only one path per prefix (no
ADD-PATH needed). Consequently, BGP RIB scale is reduced
significantly. This frees up memory, reduces the amount of data RRs
need to exchange, and mitigates churn. The BGP speakers in other
sites of AS 1 need to resolve SP-ANH in order to build their local
FIBs. Therefore SP-ANH have to be present in the IGP - some
router(s) in the local site (RR, ASBR or CR) need to inject it into
the IGP. While the selection of role that is responsible of SP-ANH
injection is discussed below, in any case, the SP-ANH should be
reachable in the IGP if, and only if, at least one of AP-ANH (for
the same peer AS and ASBR belonging to given site) is reachable. <xref
target="figure3"/> illustrates routing information flow in a
network such as that of <xref target="ref_diagram"/>:
<figure align="center" anchor="figure3">
<artwork align="left">
<![CDATA[
+------------------------------------------------
| +----->IBGP to SITE2
| AS 1 | +--->IBGP to SITE3
/=============================\ | |
|a.a.a.a/a |----------------->| | SP-ANH
| as-path "^2 .*" | | | (SITE1&AS2)
| BGP-NH SP-ANH(SITE1&AS2)| | | IP/32 into IGP
\=============================/ | | ^
| | | |
| +-------------------------+-+------------+---+
/==============================\ o------o o-+-+--o |
|ADD-PATH | |RR_1.2| |RR_1.1| SITE1 |
|a.a.a.a/a | o------O o----X-O |
| as-path "^2 .*" | ^ ^ \ |
| BGP-NH AP-ANH(BR_1.1&AS2)| / / \ |
|a.a.a.a/a |--------------X-X---->| |
| as-path "^2 .*" | / | | |
| BGP-NH AP-ANH(BR_1.2&AS2)| / | | |
\==============================/ / | | |
/==============================\ / | \ |
|a.a.a.a/a | | | \ |
| as-path "^2 .*" |--------->/ | v |
| BGP-NH AP-ANH(BR_1.1&AS2)| / | +------+ |
\==============================/ / | |CR_1.1+--+ |
/==============================\ / / +--+---+.1+-+ |
|a.a.a.a/a |------X------->/ +-+----+X| |
| as-path "^2 .*" | / / +------+ |
| BGP-NH AP-ANH(BR_1.2&AS2)| +------+ +------+ +------+ |
\==============================/ |BR_1.1| |BR_1.2|- - -|BR_1.N| |
| | +------+ +------+ +------+ |
| | ^ ^ |
| | \ \ |
| +-------------X--X---------------------------+
/======================\--------------X--X---------------------------
|a.a.a.a/a | \ \
| as-path "^2 .*" |--------------->\ \---------\
\======================/ \ \
/======================\ \ \
|a.a.a.a/a |-------------------X----------->\
| as-path "^2 .*" |----------------+ +-X------------X-----------
\======================/ | | +X-----+ +--X---+ +
| AS 3 | | |PR_2.1| |PR_2.2|- - -|
| | | +------+ +------+ +
| | | AS 2
+-------------------+ +----a.a.a.a/a network-----
]]>
</artwork>
</figure>
</t>
</section>
<section anchor="nh_assign" title="Assignment of Abstract Next Hops">
<t>
In the following subsections we provide more details of how
abstract next hops can be injected in several different common
network architectures.
</t>
<section title="Native IP Networks">
<t>
In this network every router, including core routers, has full BGP routing
information and forwards each packet based on destination IP
lookup. Provided that all routers at an egress site receive
multiple paths with BGP-NH set to AP-ANH (and not SP-ANH), it is a
matter of the operator's decision which node - RR, ASBR or CR -
will inject the SP-ANH route into the IGP. One may argue that
injection of SP-ANH by ASBRs may be simpler, as it will be done by
the same procedure and policy as injection of AP-ANH. Others may
prefer injection at RR, as it limits the number of configuration
touch-points.
</t>
</section>
<section title="MPLS">
<section title ="Identical BGP address space and paths received on all ASBRs">
<t>
In the MPLS network, since traffic is carried over LSP tunnels, the
SP-ANH needs to be injected into the IGP by a node that has the
ability to perform an IP lookup. This eliminates the RR, and
possibly CRs (in "BGP-free core" architectures). Instead, all ASBRs
are used to insert SP-ANH addresses into the IGP. In case of
LDP-based networks, this is sufficient. The CR will create an ECMP
forwarding structure for labels of SP-ANH FEC coming from other
sites. In RSVP-TE based networks, ECMP needs to happen on the
ingress LSR and therefore, every BGP speaker needs to establish an
LSP to every ASBR, and the SP-ANH address needs to be part of the
FEC for its respective LSP. If SP-ANH is used as an RSVP
(signaling) destination, some other means (such as affinity
groups) needs to be used to ensure the desired 1:1 LSP to egress
ASBR mapping.
</t>
</section>
<section title="Different address space sets or paths received on different ASBRs">
<t>
In the case when the set of prefixes received from a given peer AS
by one ASBR is different from the set received by another one, a
combination of SP-ANH and MPLS-based load balancing on a CR may
lead to a situation where an IP packet will be directed to an ASBR
that lacks external routing information and hence can't forward
traffic directly out of the AS. Similarly, if path attributes for a
given prefix received by one ASBR are different from those received
by another, again packets can be directed to the "wrong" ASBR. In
this case the ASBR would use the IBGP route it learned from another
ASBR of the same site (via RR, with AP-ANH) and forward traffic
over an LSP to the "correct" ASBR. This extra hop constitutes a
sub-optimal traffic path through the network.
</t>
<t>
For example in the network of <xref target="ref_diagram "/>, let's
assume that prefix P2 is advertised to BR1.2-BR1.N by AS2 but not to
BR1.1. BR3.1 has a BGP best route to P2 with its BGP-NH set to the
SP-ANH of (site1, AS2). It resolves it by ECMP over N MPLS LSPs,
terminating on BR1.1-BR1.N. So, some packets are forwarded by BR3.1
over an LSP via CR1.x and terminated on BR1.1. BR1.1 has no
external route to P2, but it has (N-1) IBGP routes to P2 w/ BGP-NHs
equal to the AP-ANHs of BR1.2-BR1.N. Therefore BR1.1 performs an IP
lookup and forwards this packet over LSPs via CR1.x and terminated
on BR1.2-BR1.N. Traffic is U-turned on BR1.1 and traverses CRs at
site 1 twice.
</t>
<t>
Such asymmetry may be considered acceptable by the provider, as
long as it's a transient condition. However, in the general case
such a situation could be persistent, as the result of intentional
configuration on the peer AS's ASBRs. Therefore the better solution
would be to insert the SP-ANH into the IGP on CRs. In this case,
CRs need to perform forwarding based on destination IP lookup.
Therefore CRs would have to be able to learn and handle large IP
routing and forwarding tables - at least all prefixes learned from
peer ASes by the local ASBRs.
</t>
</section>
</section>
<section title="SPRING">
<section title="Identical BGP address space and path received on all ASBRs">
<t>
For SPRING based networks, we can take advantage of the unique
capability of Anycast-SID <xref target="RFC8402"/>. The ASBRs of a
single site allocate an Anycast-SID for each SP-ANH address. This
SID can be used as the only SID by an ingress BGP speaker or, if a
TE routed path is desired, depending on TE constraints, the TE
controller can provision a SPRING path with the Anycast-SID at the
end, instructing the CR to perform load balancing among connected
ASBRs.
</t>
</section>
<section title="Different address space sets or paths received on different ASBRs">
<t>
Similarly to a classic MPLS environment, such a situation may lead
to suboptimal routing (redirecting from one ASBR to another), or
may require the CR (instead of ASBR) to insert the SP-ANH into the
IGP and generate a PREFIX-SID (or Anycast-SID if there is more then
one CR) for it.
</t>
</section>
</section>
</section>
<section title="Localization of AP-ANH">
<t>The architecture as described above reduces number of BGP paths exchanged between sites of local AS by mean of use of SP-ANH. Paths with BGP Next hop set to AP-ANH are visible only to routers in same site as ASBRs advertising it. However as route to AP-ANHs are inserted into IGP, in general case they could be visible to all nodes in local AS, contributing to IGP's LSDB scale. Further optimization is possible by limiting reach ability of AP-ANH only to site given AP-ANH is originated. This could be achieved in multiple way. For example: by running additional IGP instance internally to each site, or by running L1 ISIS among all nodes of single site and then make core routers L1/L2 systems, etc.
</t>
<t>The benefit would be reduction of Network-wide LSDB size hence faster IGP convergence and lower resource requirement.
</t>
<t>Additionally, localization of AP-ANH allows to re-use IP addresses of AP-ANH between sites. Although such practice is controversial, it may be beneficial in certain provisioning automation and ZTP scenarios.
</t>
</section>
</section>
<section title="Worked Examples">
<t>
Below we illustrate the operation of the proposal by working through
its operation in the context of several different types of failures.
Here, we assume that each ASBR in a given site of the local AS (site
1 of AS1 in <xref target="ref_diagram"/>), that has an EBGP session
with the given peer AS (AS2 in <xref target="ref_diagram"/>),
receives from its peer routers (PR2.x) routes to exactly same
address space on each session.
</t>
<section title="Failure of a proper subset of EBGP sessions with a given peer AS on a single ASBR">
<t><list style="symbols">
<t>The impacted ASBR keeps advertising the AP-ANH into the IGP, as at least one
session to the peer AS remains in the ESTABLISHED state.</t>
<t>The impacted ASBR may send UPDATEs to RRs, however the BGP-NH remains the same and equal to
the pre-failure AP-ANH.</t>
<t>The RRs may send UPDATEs to their clients (CRs, BRs) and to RRs in other sites,
however the BGP-NH remains the same as its pre-failure value: AP-ANH and SP-ANH
respectively.</t>
<t>As BGP-NH do not change, there are no changes in forwarding data structures (FIB)
on any BGP speaker across the network, except possibly the ASBR that holds the
impacted session.</t>
</list></t>
</section>
<section title="Failure of a proper subset of EBGP sessions with a given peer AS on each ASBR of a given site">
<t><list style="symbols">
<t>The impacted ASBRs keep advertising the AP-ANH into the IGP, as at least one
session to the peer AS remains in the ESTABLISHED state on each ASBR.</t>
<t>The impacted ASBRs may send UPDATEs to RRs, however the BGP-NH remains the same and equal to
the pre-failure AP-ANH.</t>
<t>The RRs may send UPDATEs to their clients (CRs, BRs) and to RRs in other sites,
however the BGP-NH remains the same and equal to its pre-failure value: AP-ANH and SP-ANH
respectively.</t>
<t>As BGP-NH do not change, there are no changes in forwarding data structures (FIB)
on any BGP speaker across the network, except possibly the ASBRs that hold the
impacted sessions.</t>
</list></t>
</section>
<section title="Failure of all EBGP sessions with a given peer AS on single ASBR; Failure of a single ASBR">
<t><list style="symbols">
<t>The impacted ASBR stops advertising the AP-ANH into the IGP, as it has lost all
sessions with given peer AS.</t>
<t>The SP-ANH is kept reachable in the IGP.</t>
<t>All other BGP speakers at the impacted site invalidate all paths with BGP-NH equal to
the AP-ANH. This may trigger prefix-independent FIB
data-structure patching/temporary fixing for sub-second traffic restoration.</t>
<t>The impacted ASBR sends WITHDRAWs to its RRs.</t>
<t>
Each RR:
<list style="symbols">
<t>Sends WITHDRAWs to its clients at the local site (CRs, BRs) for paths from
the impacted ASBR. As
these sessions support ADD-PATH, paths from other ASBRs will remain.
Other BGP speakers at this site have to modify their FIBs.</t>
<t>May send UPDATEs to RRs in other sites, however the BGP-NH remains the same, equal to
the pre-failure SP-ANH. As the BGP-NH does not change, there are no changes in
forwarding data structure (FIB) on any of BGP speakers across network,
except those at the impacted site.</t>
</list>
</t>
<t>Routing churn is mitigated in many cases to a single peering site, and does
not propagate across the network. FIB changes are limited to a single peering
site, and do not propagate across the network.</t>
</list></t>
</section>
<section title="All EBGP sessions with a given peer AS on all ASBRs">
<t><list style="symbols">
<t>Each ASBR stops advertising its AP-ANH into the IGP, as it has lost all sessions
with the given peer AS.</t>
<t>The SP-ANH is no longer reachable in the IGP, as none of AP-ANH are reachable.</t>
<t>All other BGP speakers across the network invalidate all paths with a BGP-NH
equal to the removed AP-ANH or SP-ANH. This may trigger prefix-independent FIB
data-structure patching/temporary fixing for sub-second traffic
restoration.</t>
<t>Each impacted ASBR sends WITHDRAWs to its RRs.</t>
<t>The RRs send WITHDRAWs to their clients at the local site (CRs, BRs) and RRs
in other sites for paths from the impacted ASBRs. As these sessions support
ADD-PATH, paths from ASBRs at other sites will remain. The BGP speakers across
the network may need to modify their FIBs.</t>
</list></t>
</section>
</section>
<section anchor="Acknowledgements" title="Acknowledgements">
<t>
Valuable comments and suggestions on solution covered by this
document was provided by John Scudder and Ron Bonica.
Special thanks to John Scudder, who also helped with editorial changes.
</t>
</section>
<!-- Possibly a 'Contributors' section ... -->
<section anchor="IANA" title="IANA Considerations">
<t>This memo includes no request to IANA.</t>
</section>
<section anchor="Security" title="Security Considerations">
<t>
Since this is a deployment architecture and not a protocol
modification, it doesn't introduce any new issues to the BGP
protocol itself. General BGP security considerations are discussed
in <xref target="RFC4271"/> and <xref target="RFC4272"/>, BGP
deployment best practices are documented in <xref
target="RFC7454"/>, and nothing in this proposal impedes their use.
Many of the practices recommended in that document are
self-evidently still applicable, for example the use of
cryptographic session protection methods such as <xref
target="RFC2385">TCP MD5</xref> or the <xref target="RFC5925">TCP
Authentication Option</xref>, and the <xref target="RFC5082">
Generalized TTL Security Mechanism</xref>. Since we propose a novel
use of IP addresses to assign ANHs, it's worth considering if
anything new is required to protect them. We conclude there isn't,
they fall into the existing category of "Prefixes Belonging to the
Local AS" discussed in section 6.1.4 of <xref target="RFC7454"/>.
</t>
</section>
</middle>
<back>
<references title="Informative References">
<?rfc include="reference.I-D.ietf-rtgwg-bgp-pic"?>
&RFC2385;
&RFC4271;
&RFC4272;
&RFC4456;
&RFC4724;
&RFC5065;
&RFC5082;
&RFC5925;
&RFC7454;
&RFC7911;
&RFC8402;
</references>
<!-- Change Log
v00 2018-10-10 RJS Initial version
-->
</back>
</rfc>