Networking: BGP and BGP in DataCenter

iBGP rule:

A router adds its AS number to a route’s AS_PATH only when the route is sent to an EBGP neighbor. The AS number is not added to routes sent to an IBGP neighbor.

Why we need full-mesh in iBGP: If we have RTR1–>RTR2–>RTR3 and iBGP peering is between RTR1 and RTR3 logically, the control plane is fine. When RTR1 sends a prefix to RTR3, it will send itself as the next hop. but for the data plane, When RTR3 send packet to RTR1, it will see next hop as RTR1. RTR3 will do the recursive lookup and will see that to reach RTR1, next hop is RTR2. RTR3 will send the packet to RTR2 but RTR2 doesn’t have entry for RTR1 prefix and hence it will drop it.

Now we have the full mesh. how to avoid loops in full mesh:

Routes learned from an internal neighbor are never sent to another internal neighbor.

Path attributes:

Well known and mandatory:

Origin:
- IGP: If NLRI was learned from the protocol internal to the origin AS
- Incomplete: Routes that BGP learns through redistribution carry the incomplete origin attribute because there is no way to determine the original source of the route.
- IGP > Incomplete
next-hop
As-path

BGP Path selection:

Next hop should be reachable
Weight (only Cisco)
Local-Preference: Prefer the route with higher local preference.
Locally originated: prefer the route that was originated locally on the router and injected into BGP with the network or aggregate statement or through redistribution. That is, prefer a route that was learned from an IGP or from a direct connection on the same router.
AS-PATH: prefer route with shortest AS PATH
ORIGIN: prefer route with IGP origin code over Incomplete origin code
MED: prefer the route with the lowest MED (MULTI_EXIT_DISC) value. By default, this comparison is done only if the AS number is the same for all the routes being considered.
eBGP > iBGP :
Lowest metric to IGP next hop
If the routes are still equal, they are from the same neighboring AS, and BGP multipath is enabled with the maximum-paths statement; install all the equal-cost routes in the Loc-RIB.

Unlike IGPs, iBGP sessions often span multiple router hops; a router cannot establish an iBGP session unless it knows how to reach its peer. Therefore, one of the first steps in troubleshooting an IBGP session that stays in Active state (listening for a configured neighbor) is to look in the routing tables of both neighbors and see if they know how to find each other.

By default, an outgoing TCP session is sourced from its outgoing physical interface address. If every router in Figure 2-27 tried to originate its IBGP TCP session from a physical interface and going to a loopback interface, although its peer also originates at a physical interface and terminates at the local router’s loopback, the endpoints of the attempted TCP sessions never match and the sessions do not come up.

When the pre- fixes were specified with the network statement, BGP looks into the IP routing table. If the specified prefix is not in that table, BGP does not enter it into the BGP table. That is, BGP does not inject a prefix unless the router has a valid path to the destination.

By default, the IGP metric of the injected prefix becomes the MULTI_EXIT_DISC (MED) attribute of the advertised BGP route, which displays in the BGP table as “Metric.”

ATOMIC_AGGREGATE and AGGREGATOR Attributes

When aggregation is performed in a BGP-speaking router, the information that is lost is path detail.

The ATOMIC_AGGREGATE is a well-known discretionary attribute that alerts down- stream routers that a loss of path information has occurred. Any time a BGP speaker summarizes more-specific routes into a less-specific aggregate, and path information is lost, the BGP speaker must attach the ATOMIC_ AGGREGATE attribute to the aggregate route.

When the ATOMIC_AGGREGATE attribute is set, the BGP speaker has the option of also attaching the AGGREGATOR attribute. This optional transitive attribute provides information about where the aggregation was performed, by including the AS number and the IP address of the router that originated the aggregate route.

ATOMIC_AGGREGATE serves as a “tag” to remind you that the route is an aggregate; when examining a number of BGP routes, the ones that are aggregates might not be readily apparent to you, especially if you look at them far upstream of the aggregation point. AGGREGATOR, then, leads you back to the aggregation point.

MED:

when you advertise a prefix to multiple external peers in a neighboring AS, you want to “tell” the neighboring AS which route to prefer.MED is an optional, nontransitive attribute.When a BGP speaker learns a route from an external peer, it can pass the route’s MED to any IBGP peers. But a router cannot advertise a MED that was originated in a neighbouring AS to a peer in another AS.

Route-Reflector

A client router in a route reflection cluster can peer with external neighbors, but the only internal neighbor it can peer with is a route reflector in its cluster or other clients in the cluster.

Rules:

If the route were learned from a non-client IBGP peer, it is reflected to clients only.
If the route were learned from a client, it is reflected to all nonclients and clients,except for the originating client.
If the route were learned from an EBGP peer, it is reflected to all clients and nonclients.

To prevent routing loops, route reflectors use two BGP path attributes: ORIGINATOR_ID and CLUSTER_LIST

BGP in DataCenter

Why BGP vs OSPF/ISIS:

Less flooding with BGP
Easy to t-shoot as compared to link state database
Why not iBGP: loop avoidance in iBGP is tricky.

BGP ASN assignment:

use eBGP everywhere
advertise loopbacks of the device and ToR prefixes
use unique ASN for S0
use same ASN for S1(S0 and S1 is POD). Each S1 will have own ASN
S2 will have same ASN(BigPOD)

BGP convergence in DC using BGP:

We can do below optimisation to support BGP in DC. By default, BGP is path vector protocol and routing is done by rumour. Each neighbour update its neighbour on a change. Convergence is when the change/update has been updated in RIB/FIB and traffic is now taking new path: https://blog.ipspace.net/2020/11/fast-failover-challenge/

Detect failure in BGP:
- two types:
  - Loss of signal or loss of light(layer 1): In this case, for arista, interface will go down which will notify the ASIC. ASIC won’t consider that path as viable to forward traffic anymore. any next-hops for that interface will be marked disabled.
    - Keep in mind, most of the links in the DC are directly connected and hence this scenario plays into picture
    - Control plane is also notified concurrently and hence BGP will tear down the BGP session immediately. it doesn’t wait for hold time to expire
  - Layer 2 issue: We can use protocols like BFD for faster detection of failure due to layer 2 issues
Fast rehash for BGP with ECMP: Fast Rehash is a forwarding construct, where the next-hop (could be called differently) is not a single entry but an array of entries (ECMP bundle) downloaded in the forwarding hardware by the control plane. If one of them becomes unavailable (BFD DOWN, carrier loss, or interface down events) it is simply removed from the array and the hashing is updated accordingly, hence the name.
Process update as soon as possible
Propagate updates as soon as possible(MRAI timers):
- by default, BGP will wait for MRAI timer before sending update to the neighbor
- set MRAI timer to 0
Propagate updates to everyone at the same time(peer groups)
Hold/keepalive timer doesn’t matter
Connect timer: if you lose a session, wait for X seconds before you go to connect state again. We should reduce it too.

ECMP in CLOS:

input value + hash algo + hash-seed = hash. we want to make sure all the devices in the tier (say S2) uses same hash. this will make sure that the traffic still goes to the same different in previous device link fails.

hash-offset: this is used to offset once hash is calculated.

How to make ECMP work in CLOS using eBGP:

multi-path as-path relaxed: During BGP tie breaking process, eBGP will look at AS-PATH SEQUENCE or list. If the length of the list is same but content are different, by default, eBGP will move to next tie breaking criteria. To break this rule so that eBGP doesn’t look at the content of the list, we have to use “multi-path as-path relaxed”. Arista has it enabled by default.

BGP configuration without P2P IPs:

An interface without an IP address of its own was called an “unnumbered” interface.interfaces borrow the IP address from an interface that never fails: the loopback interface.

Routers can respond to ARPs on unnumbered interfaces with the received interface’s local MAC address because the interface has an IP address, even if borrowed.

BGP Unnumbered:

every link in an IPv6 network is automatically assigned an IP address that is unique only to that link. Such an address is called the link local IPv6 address.

Typically, an LLA is derived from the MAC address on the link.

The IPv6 LLA is used only to establish a TCP connection for starting a BGP session. Besides enabling IPv6 on a link, which is typically enabled automatically, and the enabling of the IPv6 router advertisement on the link, no other knowledge of IPv6 is expected of the operator.

Even though we now potentially can establish a BGP peering without requiring an interface IP address, advertising routes also requires a way to specify how to reach the router advertising the routes. In BGP, this is signalled explicitly in the route advertisement via the NEXTHOP attribute.

Except for getting the MAC address to put on the packet, the nexthop IP address is not used in the packet at all.

IPv6 RA has an option to carry the sender’s MAC address, as well.

Below is config for Arista:https://blog.ipspace.net/2024/03/arista-interface-ebgp/

config for Junos: https://www.theasciiconstruct.com/post/junos-bgp-and-bgp-unnumbered/

https://docs.nvidia.com/networking-ethernet-software/cumulus-linux-37/Layer-3/Border-Gateway-Protocol-BGP/#bgp-unnumbered-interfaces

How traceroute Interacts with BGP Unnumbered Interfaces:

Every router or end host must have an IPv4 address to complete a traceroute of IPv4 addresses. In this case, the IPv4 address used is that of the loopback device.

Device maintenance in a Datacenter:

Option 1: AS Path prepend

if spine01 is going to be upgraded, you should ask all the leaves to ignore spine01 in their best path computation and send all traffic to only spine02 during this time to ensure a smooth traffic flow. Similarly, in the case of the leaves with dual-attached servers, it would be useful for the spines to avoid sending traffic to the leaf undergoing the upgrade and use only the working leaf. In this fashion, routers can be upgraded, one box at a time, without causing unnecessary loss of traffic.

The most common and interoperable way to drain traffic is to force the routes to be advertised from the node with an additional ASN added to the advertisement, causing the AS_PATH length to increase in comparison to the node’s peers

option 2: graceful shutdown community:

After a BGP speaker receives a route with the GRACEFUL_SHUTDOWN community set, it lowers the LOCAL_PREF attribute to 0, making it less preferable than the routes with the default LOCAL_PREF of 100.

option 3: increase MED

Connecting Servers to the network:

We have few options depending on the server types. Usually subnets are assigned to rack and for 40 servers, you are looking at /26 for IPv4.

Servers with anycast: multiple servers can provide same services using anycast. For example, DNS servers can use anycast to provide services. Each server will have the VIP announced to the CLOS network.
- This Anycast IP won’t be part of the TOR subnet for the serves(/26 for IPv4)
- We will use BGP on the host to advertise anycast VIP:
  - We can use BGP unnumbered or dynamic BGP between ToR and server
  - We can use same ASN for all the servers or all the Rack server’s will have the Same ASN. Servers connected to same ToR will have the single ASN whereas servers connected to different ToR will have different ASN
  - We will usually announce default route to the anycast server and accept anycast IP from the servers using route-maps
servers with SVI: These servers will connect to VLAN SVI interface on the ToR and VLAN SVI interface will be configured with /26 subnet on the ToR. we will inject the /26 subnet in the BGP of the ToR along with the ToR loopback IP
Server which needs layer connectivity using VLAN
Servers running Kubernetes: In this case, we will assign a subnet for the K8 PODs via Kubernetes bridge interface .server itself will have the a separate subnet connecting to the ToR. the next hop for the PODs will be the ToR interface. We will peer kube-router will the leaf too using eBGP.

Below is an example of anycast peering with servers:

show ipv6 bgp summary | grep SDN
  SDN_ANYCAST              fdbd:dc71:1:1::25 4 64512        2110654   2231366    0    0  219d22h Estab   1      0
  SDN_ANYCAST              fdbd:dc71:1:1::26 4 64512        2111082   2231321    0    0  219d22h Estab   1      0
  SDN_ANYCAST              fdbd:dc71:1:1::27 4 64512        2110846   2231469    0    0  219d22h Estab   1      0
  SDN_ANYCAST              fdbd:dc71:1:1::28 4 64512        2110820   2231505    0    0  219d22h Estab   1      0
  SDN_ANYCAST              fdbd:dc71:1:1::29 4 64512        2111089   2231657    0    0  219d22h Estab   1      0

Below is example of SVI interface:

show running-config interfaces vlan1000
interface Vlan1000
   mtu 9000
   no autostate
   ipv6 dhcp relay all-subnets
   ipv6 dhcp relay destination fdbd:dc00::10:8:8:36
   ipv6 address fdbd:dc71:1:1::1/64
   ipv6 nd managed-config-flag
   ipv6 nd prefix fdbd:dc71:1:1::/64 no-advertise
   ipv6 access-group BMC_SEC_V6 out

show running-config section router bgp | grep network
      network fdbd:dc71:1:1::/64
      network fdbd:dc71:98:109::/128

show running-config interfaces loopback 0
interface Loopback0
   ipv6 enable
   ipv6 address fdbd:dc71:98:109::/128