Configuration Guide
Intelligent Lossless Network Configuration Guide
29 min
static intelligent routing static intelligent routing in an ai intelligent computing network with an uplink and downlink bandwidth of 1 1, complex scenarios with multiple ingresses and multiple egresses are often encountered in such scenarios, we hope to evenly distribute traffic with the same destination from multiple ingresses to multiple egresses while maintaining the order of data packets, that is, forwarding them flow by flow however, in actual environments, traffic from multiple inlets often converges to one outlet, while there is no traffic on other outlets, resulting in unbalanced load sharing and network congestion in response to this problem, asterfusion proposed an intelligent routing algorithm that allocates vrfs based on source ip addresses, implements service isolation through vrfs, and forwards traffic based on policy routing, eliminating the traditional hash process to solve the problem of uneven load static intelligent routing workflow static intelligent routing workflow routing learning process local leaf local leaf after the tenant comes online, the local leaf generates host routes under the default vrf of the downstream server directly connected to the nic through the arp to host function, generates host routes under the vrf according to the ips and tenant vrfs, and carries the tag to announce to the remote end spine spine spine receives the host route with the specified tag and generates a normal route locally and synchronizes it to the peer leaf peer leaf peer leaf when the peer leaf receives the host route information with the specified tag, it generates the host/prefix route under the corresponding tenant vrf locally (the default vrf route is imported through tag filtering), and when the route status is action, it will additionally be assigned with the tenant vrf route table hit information for the following policy route forwarding forwarding process according to the flow forwarding path, the flow forwarding model can be classified into local communication forwarding and cross spine communication forwarding as shown in the following figure for the two types of communication forwarding, the specific processing flow of the device is as follows (1) local communication (1) local communication forwarding path traffic is sent from nic1 of server 1 to nic1 of server 16 after leaf 1 receives the traffic sent from nic1 of server 1, it assigns vrf according to the source ip, queries the routing table of the corresponding vrf for guided forwarding, and finds that the destination ip belongs to nic1 of server 16, hits the local routing table, and forwards it to the destination nic (nic1 of server 16) directly according to the next hop of the route (2) (2) cross spine communication cross spine communication traffic path traffic is sent from nic1 of server 1 to nic5 of server 16 after leaf 1 receives the traffic sent from server 1 nic1, it assigns vrfs according to the source ip, queries the routing table of the corresponding vrfs, and finds that the destination ip belongs to nic5 of server 16, hits the remote routing table, and the destination ip is in the downlink ip list of the opposite end then determine the corresponding policy route according to the routes of the source ip and the vrf where the tenant is located, and redirect to the specified next hop according to this policy route; finally, enter the spine for forwarding according to the uplink interface corresponding to the next hop spine receives the traffic and queries the routing table for forwarding according to the standard l3 logic leaf 5 receives the traffic and forwards it to the destination nic according to the default routing table link failure handling link failure handling uplink failure processing flow of the device in case of uplink failure and recovery if the link between leaf 1 and spine 1 fails, leaf 1 senses the link failure, the policy route corresponding to the failed link is invalid, and the corresponding traffic goes through ecmp (spine 2+spine 3+ + spine n), and the other traffic other traffic is still forwarded through the corresponding policy route other leaf senses the change of the remote route, and the corresponding next hop is reduced from n spines to (n 1) spines causing its number to be less than the number of ports in the uplink list, then the policy route fails for the remote route, and then the corresponding next hop of the route is changed to ecmp (spine 2+spine 3+ + spine n), and other traffic is not affected after the link is restored, leaf 1 redistributes the corresponding policy route, and the corresponding traffic is restored to be forwarded according to the policy route when other leafs sense the change of the corresponding remote route, the corresponding next hop is increased from (n 1) spines to n spines, and the number of next hops is greater than or equal to the number of ports in the uplink list, the policy route for the remote route takes effect, and it is restored from forwarding in accordance with ecmp to forwarding in accordance with the policy route spine failure device handling process during spine failure and recovery spine failure when a spine fails, all leaf devices detect that the number of next hops for all remote routes decreases from n spines to (n 1) spines if the number of next hops is less than the configured number of ports in the uplink list, the policy based routing for remote routes becomes invalid traffic forwarding then switches from policy based routing to regular ecmp (equal cost multi path) in this failure scenario, traffic is forwarded using standard ecmp, so it is recommended to combine this with enhanced hash mechanisms spine recovery after a spine recovers, all leaf devices detect that the number of next hops for all remote routes increases from (n 1) spines back to n spines if the number of next hops meets or exceeds the configured number of ports in the uplink list, policy based routing for remote routes is restored traffic forwarding then switches back from ecmp to policy based routing dynamic intelligent routing dynamic intelligent routing three mainstream technologies currently dominate network load balancing flow based ecmp (equal cost multi path) balancing, flowlet based subflow balancing, and packet based ecmp balancing flow based ecmp balancing flow based ecmp balancing the most widely used load balancing algorithm leverages 5 tuple flow hashing it performs well in scenarios with numerous flow connections, offering the advantage of zero packet reordering however, it suffers from hash collisions in flow sparse environments (e g , ai training), leading to suboptimal load distribution flowlet based subflow balancing flowlet based subflow balancing this technique relies on configuring the inter flowlet time gap (gap) for load balancing accurate gap configuration becomes unfeasible if global path level latency information is unavailable in the network packet based ecmp balancing packet based ecmp balancing theoretically optimal for balancing granularity, but causes severe packet reordering issues at the receiver in practice dynamic intelligent routing is a sensing driven load balancing technology by detecting path quality through switches in the network, it dynamically adjusts local switch path selection to achieve flexible load balancing it further supports dynamic weighted load balancing and introduces the dynamic wcmp (weighted cost multipath) algorithm by enhancing flow based ecmp given that data centers and carriers predominantly use bgp as the underlying routing protocol, dynamic intelligent routing extends bgp by defining a new extended community attribute this attribute evaluates path quality based on multi dimensional high precision metrics, transmitted via bgp to guide traffic forwarding—improving overall load balancing efficiency and reducing application response time path quality synchronization and calculation path quality synchronization and calculation based on long term observations in ai cluster networks, dynamic intelligent routing incorporates critical parameters (bandwidth utilization, queue occupancy, and forwarding latency) as factors for comprehensive path quality assessment bandwidth/queue utilization collected from asic hardware registers with hundred millisecond accuracy results are announced via bgp at 1 second intervals (weighted averaging prioritizes recent data) to reduce control plane overhead forwarding latency measured via hdc (high delay capture) , an int (in band network telemetry) technology that captures packets exceeding user defined latency thresholds switches extract the first 150 bytes of such packets with metadata (ingress/egress ports, latency) and send them to collectors for high precision latency analysis path quality is propagated using the path bandwidth extended community attribute via bgp extensions the synchronization logic is illustrated below when nic1 communicates with nic2, nic2 first advertises its ip to leaf2 leaf2 then advertises nic2's ip to the spine while appending the corresponding link quality (calculated as the link quality toward nic2 multiplied by leaf2's downlink weight) the spine subsequently advertises nic2's ip to leaf1, appending the corresponding link quality (calculated as the link quality toward leaf2 multiplied by the spine's weight plus the accumulated path metric already carried in the routing information) finally, leaf1 aggregates the path quality and generates routing instructions to guide traffic forwarding wcmp wcmp wcmp (weighted cost multipath) forwards traffic proportionally across paths, with ecmp as a special case in dynamic intelligent routing, wcmp dynamically adjusts route weights based on real time path quality to achieve flexible load balancing as shown above, when two paths exist between nic1 and nic2, assume leaf1 calculates the comprehensive quality of the red path to nic2 as 38 and the green path as 80 through path quality synchronization and computation algorithms the wcmp provisioning then assigns a 3 7 weight ratio (30% 70%) to these paths as global network traffic fluctuates, path quality dynamically changes these updates are converted into path quality metrics, propagated via bgp to every leaf switch, where dynamic wcmp routes are generated to guide traffic forwarding dynamic routing workflow dynamic routing workflow route learning process local leaf local leaf generates default vrf routes for directly connected nics (via arp or bgp) tags routes with path quality and advertises to remote devices spine spine receives tagged routes, accumulates local path quality, and synchronizes routes to peer leaf switches peer leaf peer leaf imports routes into tenant vrf (filtered by tags) for multi path destinations, aggregates path quality and generates wcmp routes traffic forwarding process according to the flow forwarding path, the flow forwarding model can be classified into local communication forwarding and cross spine communication forwarding as shown in the following figure for the two types of communication forwarding, the specific processing flow of the device is as follows (1) local communication (1) local communication forwarding path traffic is sent from nic1 of server 1 to nic1 of server 16 after leaf 1 receives the traffic sent from nic1 of server 1, it assigns vrf based on the source ip, queries the routing table of the corresponding vrf for guided forwarding, and finds that the destination ip belongs to nic1 of server 16, hits the local routing table, and forwards it to the destination nic (nic1 of server 16) directly according to the next hop of the route (2) cross spine communication (2) cross spine communication forwarding path traffic is sent from nic1 of server 1 to nic5 of server 16 leaf 1 assigns vrf, matches the remote route, and distributes traffic to spines via wcmp weights spine forwards traffic using standard l3 routing leaf 5 delivers traffic to the destination nic via its default route table link failure handling link failure handling uplink failure device handling process during uplink failure and recovery if the link between leaf 1 and spine 1 fails, leaf 1 detects the failure and withdraws corresponding routes traffic continues via wcmp (spine 2 + spine 3 + … + spine n) other leaf switches detect the route withdrawal traffic destined for leaf 1 continues via wcmp (spine 2 + spine 3 + … + spine n) after link recovery, leaf 1 re advertises routes via bgp, restoring full path wcmp for affected traffic other leaf switches relearn leaf 1’s routes, restoring full path wcmp spine failure device handling process during spine failure and recovery when spine 1 fails, all leaf switches detect withdrawal of routes passing through spine 1 traffic continues via wcmp (spine 2 + spine 3 + … + spine n) after spine 1 recovers, all leaf switches relearn routes through spine 1, restoring full path wcmp intelligent routing configuration intelligent routing configuration intelligent routing default setting intelligent routing default setting the default setting of intelligent routing is shown in the table below table 1 default setting of intelligent routing table 1 default setting of intelligent routing parameters default value router type leaf ai network mode static configure intelligent routing mode configure intelligent routing mode notes ensure router bgp is configured before enabling intelligent routing leaf and spine devices must use the same routing mode intelligent routing supports two modes static mode assigns vrf based on source ip for service isolation uses policy based routing for 1 1 forwarding dynamic mode assigns vrf based on source ip for service isolation dynamically adjusts path weights using real time path quality data from switches and combines with wcmp for flexible load balancing table 2 configure intelligent routing mode table 2 configure intelligent routing mode purpose command description enter global configuration view configure terminal configure router bgp and enter the appropriate configuration view router bgp asn asn local as number disable the ebgp policy requirement no bgp ebgp requires policy return to global configuration view exit configure device roles router type { leaf | spine } configure intelligent routing mode ai network mode { static | dynamic } configure leaf uplink configure leaf uplink notes configure intelligent routing mode before setting leaf uplink leaf’s uplink bgp mode must match spine’s downlink bgp mode the bgp configuration supports two modes bgp link local mode the device automatically configures bgp neighbors using the ipv6 link local addresses of the interfaces bgp normal mode this mode directly uses the bgp configuration of the uplink interfaces therefore, bgp neighbors must be manually configured on the uplink interfaces before enabling this mode bgp link local mode is simpler to configure and is recommended the uplink mode must be consistent between spine and leaf devices the following sections detail the uplink configuration steps for each mode bgp link local mode table 3 leaf uplink configuration in bgp link local mode table 3 leaf uplink configuration in bgp link local mode purpose command description enter global configuration view configure terminal configure the uplink interface for intelligent routing ai network uplink list { interface list | default } interface list interface list, the interface needs to be a physical layer 3 port default top half of front panel ports configure uplink bgp mode for intelligent routing ai network uplink bgp link local bgp normal mode table 4 leaf uplink configuration in bgp normal mode table 4 leaf uplink configuration in bgp normal mode purpose command description enter global configuration view configure terminal configure router bgp and enter the appropriate configuration view router bgp asn asn local as number configure bgp neighbor neighbor neighbor ip remote as asn neighbor ip ip address of the bgp neighbor asn neighbor as number return to global configuration view exit configure the uplink interface for intelligent routing ai network uplink list { interface list | default } interface lis t interface list, the interface needs to be a physical layer 3 port default bottom half of front panel ports configure uplink bgp mode for intelligent routing ai network uplink bgp link local configure spine downlink configure spine downlink notes configure routing mode before spine downlink setup spine’s downlink bgp mode must match leaf’s uplink bgp mode spine's downlink bgp mode is functionally identical to leaf's uplink bgp mode the following sections detail the uplink configuration steps for each mode bgp link local mode table 5 spine downlink configuration in bgp link local mode table 5 spine downlink configuration in bgp link local mode purpose command description enter global configuration view configure terminal configure the downlink interface for intelligent routing ai network downlink list { interface list | default } interface list interface list, the interface needs to be a physical layer 3 port default bottom half of front panel ports configure downlink bgp mode for intelligent routing ai network downlink bgp link local bgp normal mode table 6 spine downlink configuration in bgp normal mode table 6 spine downlink configuration in bgp normal mode purpose command description enter global configuration view configure terminal configure router bgp and enter the appropriate configuration view router bgp asn asn local as number configure bgp neighbor neighbor neighbor ip remote as asn neighbor ip ip address of the bgp neighbor asn neighbor as number return to global configuration view exit configure the downlink interface for intelligent routing ai network downlink list { interface list | default } interface lis t interface list, the interface needs to be a physical layer 3 port default bottom half of front panel ports configure downlink bgp mode for intelligent routing ai network downlink bgp link local configure leaf downlink configure leaf downlink notes complete leaf uplink configuration first configure leaf downlink within an intelligent routing instance the downlink routing supports two modes arp mode is simpler to configure, bgp mode is suitable for scenarios where dynamic route exchange between leaf devices and servers is required therefore, arp mode is recommended in the absence of specific business requirements arp mode converts arp entries on the downlink interfaces into host routes and advertises them via the uplink bgp the configured ip list must be in the same subnet as the downlink interface ip bgp mode requires bgp peering between the leaf device and servers the leaf advertises server learned bgp routes to other leaf devices via uplink bgp configuration steps for both modes arp mode table 7 leaf downlink configuration in arp mode table 7 leaf downlink configuration in arp mode purpose command description enter global configuration view configure terminal create an instance of intelligent routing downlink and enter the instance configuration view ai network instance instance id instance id instance id, range 1 1024 configure the downlink interface for intelligent routing downlink list { interface list | default } interface list interface list, the interface needs to be a physical layer 3 port default default interface configuration, half of the ports below the front panel configure the routing mode of intelligent routing downlink and enter the corresponding configuration view downlink mode arp arp arp mode configure intelligent routing downlink ip list ip list ip list ip list interface routing list, only supports 32 bits bgp mode table 8 leaf downlink configuration in bgp mode table 8 leaf downlink configuration in bgp mode purpose command description enter global configuration view configure terminal enter router bgp configuration view router bgp asn asn local as number configure bgp neighbor neighbor neighbor ip remote as asn neighbor ip ip address of the bgp neighbor asn neighbor as number create an instance of intelligent routing downlink and enter the instance configuration view ai network instance instance id instance id instance id, range 1 1024 configure the downlink interface for intelligent routing downlink list { interface list | default } interface list interface list, the interface needs to be a physical layer 3 port default default interface configuration, half of the ports below the front panel configure the routing mode of intelligent routing downlink and enter the corresponding configuration view downlink mode bgp bgp bgp mode configure intelligent routing downlink ip list ip list interface name ip list interface name specify the downlink interface ip list interface route list, only supports 32 bits display and maintenance display and maintenance table 9 sflow display and maintenance table 9 sflow display and maintenance purpose commands description display the role type of the device in intelligent routing show router type display intelligent routing summary configuration show ai network summary show intelligent routing downlink instance configuration show ai network instance { all | instance id } instance id instance id, range 1 1024 configuration examples configuration examples network requirement in intelligent computing scenarios, the roce network will adopt a two tier spine leaf architecture to achieve high bandwidth, lossless interconnectivity between gpu servers as shown below, both spine and leaf layers utilize cx532p nt devices the gpu nics are configured in 16 groups of 8 nics each leaf devices serve as gateways for the servers, while dynamic intelligent routing between spine and leaf layers implements a layer 3 wcmp network this enables high speed data forwarding while supporting redundant backup paths for forwarding routes topology the asns of leaf1 8 are in increasing order the port numbering rule of each leaf is the same as that of leaf1, and the spine is connected to leaf in order and the configuration of two spines is the same; only the interfaces of leaf1 and spine1 are labelled in the figure the interface ip configuration is shown in the following table table 19 interface ip address table table 19 interface ip address table leaf1 interface ip address ethernet 0/0 0/60 ipv6 link local ethernet 0/64 21 10 0 1/24 ethernet 0/68 21 11 0 1/24 ethernet 0/72 21 12 0 1/24 ethernet 0/76 21 13 0 1/24 ethernet 0/80 21 14 0 1/24 ethernet 0/84 21 15 0 1/24 ethernet 0/88 21 16 0 1/24 ethernet 0/92 21 17 0 1/24 ethernet 0/96 21 18 0 1/24 ethernet 0/100 21 19 0 1/24 ethernet 0/104 21 20 0 1/24 ethernet 0/108 21 21 0 1/24 ethernet 0/112 21 22 0 1/24 ethernet 0/116 21 23 0 1/24 ethernet 0/120 21 24 0 1/24 ethernet 0/124 21 25 0 1/24 leaf8 device name interface ip address leaf 8 ethernet 0/0 0/60 ipv6 link local ethernet 0/64 28 10 0 1/24 ethernet 0/68 28 11 0 1/24 ethernet 0/72 28 12 0 1/24 ethernet 0/76 28 13 0 1/24 ethernet 0/80 28 14 0 1/24 ethernet 0/84 28 15 0 1/24 ethernet 0/88 28 16 0 1/24 ethernet 0/92 28 17 0 1/24 ethernet 0/96 28 18 0 1/24 ethernet 0/100 28 19 0 1/24 ethernet 0/104 28 20 0 1/24 ethernet 0/108 28 21 0 1/24 ethernet 0/112 28 22 0 1/24 ethernet 0/116 28 23 0 1/24 ethernet 0/120 28 24 0 1/24 ethernet 0/124 28 25 0 1/24 configuration roadmap (1) configure the interface ip, router bgp, intelligent routing mode and router type of the device (2) configure intelligent routing uplink (interface list, bgp mode) (3) configure intelligent routing downlink instance (interface list, routing mode, and ip list) procedure the following configuration takes leaf 1 and spine 1 as examples connect each device correctly, and configure the interface ip of each device as required (omitted) configure router bgp \# spine 1 sonic# configure terminal sonic(config)# router bgp 65165 sonic(config router)# no bgp ebgp requires policy \# leaf 1 sonic# configure terminal sonic(config)# router bgp 101 sonic(config router)# no bgp ebgp requires policy configure router type of the device \# spine 1 sonic# configure terminal sonic(config)# router type spine configure intelligent routing mode sonic# configure terminal sonic(config)# ai network mode dynamic configure spine downlink \# spine 1 sonic# configure terminal sonic(config)# ai network downlink list 0/0 0/124 sonic(config)# ai network downlink bgp link local configure leaf uplink \# leaf 1 sonic# configure terminal sonic(config)# ai network uplink list 0/0 0/60 sonic(config)# ai network uplink bgp link local configure intelligent routing downlink instance sonic# configure terminal sonic(config)# ai network instance 10 sonic(config ai network 10)# downlink list 0/64 0/124 sonic(config ai network 10)# downlink mode arp \# configured based on actual user ips sonic(config ai network 10 arp)# ip list 21 10 0 2, 21 11 0 2, 21 12 0 2, 21 13 0 2, 21 14 0 2, 21 15 0 2, 21 16 0 2, 21 17 0 2, 21 18 0 2, 21 19 0 2, 21 20 0 2, 21 21 0 2, 21 22 0 2, 21 23 0 2, 21 24 0 2, 21 25 0 2 verify the configuration confirm the router type of the device sonic# show router type router type leaf confirm intelligent routing summary configuration sonic# show ai network summary ai network info ai network mode dynamic uplink uplink list 0/0 0/60 use bgp link local true confirm intelligent routing instance configuration sonic# show ai network instance all ai network instance 10 downlink list 0/64 0/124 downlink mode arp downlink iplist 21 10 0 2,21 11 0 2,21 12 0 2,21 13 0 2,21 14 0 2,21 15 0 2,21 16 0 2 21 17 0 2,21 18 0 2,21 19 0 2,21 20 0 2,21 21 0 2,21 22 0 2,21 23 0 2,21 24 0 2,21 25 0 2 route table vrf prefix nexthop ifname weight local or remote \ neigh table ifname ip neigh mac family extern learn \ server 1 is forwarding traffic between different nics to each other successfully without packet loss the same numbered nics between server 1 and server 2 are forwarding to each other successfully without packet loss
