RoCE/RDMA/DCB what is it and how to configure it

Share this post:

Updated January 29th with new Priority Flow Control recomendations to add Cluster HeartBeat to Priority ID 7 for Windows Server and Dell Switches.

Updated May 26th 2018 with HPE FlexFabric config

You have probably heard these acronyms somewhere, so what are these and are they the same. In short yes and no

RoCE stands for RDMA over Converged Ethernet, the RDMA part is Remote Direct Memory Access.

RDMA allows for network data(TCP packets) to be offloaded on the Network cards and put directly in to the memory, bypassing the hosts CPU. Allowing for the host to have all the access to the CPU. In normal TCP offload all the network traffic goes trough the CPU and with higher speeds will take more CPU. On a 10gbit network it would take about 100% cpu on a 12 core Intel Xeon V4 CPU.

Mellanox has a good explanation for RDMA here.

DCB stands for Data Center Bridging

What it contains are enhancements to Ethernet communication protocol. Ethernet is a best-effort network that may experience packet loss when network devices are busy, creating re transmission. DCB allows for selected traffic to have zero packet loss. It eliminates loss due to queue overflow and to be able to allocate bandwidth on links. DCB allows for different priorities of packets being sent over the network.

 

In this post i will cover how to enable RDMA and DCB in Windows for SMB and on different switches. I will update with more switches as i read trough different vendors configuration. As the setup varies a lot from vendor to vendor.

In the last year Microsoft has started to recomend iWARP as the default RDMA solution for S2D. This is based on that iWARP do not need DCB, PFC and ETS for it to work. In general RoCE does not need it, but as RoCE communicates over UDP flow controll is needed if there are packet drops.

RoCE is comming with a DCB free solution in the future. But for any High IOPS RDMA configuration today, DCB and PFC is needed. Even for iWARP. To configure DCB/PFC for iWARP it’s identical to RoCE, so the same configuration apply to both.

Switches and Vendors that is covered in this post

Lenovo

NE2572 (CNOS)

Dell

N4000 series
Force 10 S4810p, S6000, S6000-on(FTOS)

Cisco

Nexus NX-OS

Mellanox

SN2100

HPE

FlexFabric 5700/5900

Quanta

LB8

 

How to configure Windows Server 2012, 2012R2, 2016 and 2019 with RDMA and DCB

For SMB you will need to install WindowsFeauture Data-Center-Bridging

Install-WindowsFeature -Name Data-Center-Bridging

Reboot the server and let’s configure the DCB settings. SMB always use Priority 3, you can use any other, but best practice is 3. And Cluster HeartBeat uses Priority 7

Create QoS Policy
New-NetQosPolicy "SMB" -NetDirectPortMatchCondition 445 -PriorityValue8021Action 3
New-NetQosPolicy "Cluster" -PriorityValue8021Action 7

# Turn on Flow Control for SMB and Cluster
Enable-NetQosFlowControl -Priority 3,7

# Make sure flow control is off for other traffic
Disable-NetQosFlowControl -Priority 0,1,2,4,5,6

#Disable DCBx
Set-NetQosDcbxSetting -Willing $false -Confirm:$false

# Apply a Quality of Service (QoS) policy to the target adapters
Enable-NetAdapterQos -InterfaceAlias "NIC1","NIC"

# Give SMB Direct a minimum bandwidth of 50%
New-NetQosTrafficClass "SMB" -Priority 3 -BandwidthPercentage 50 -Algorithm ETS

#Give Cluser a minimum bandwith of 1%
New-NetQosTrafficClass "Cluster" -Priority 7 -BandwidthPercentage 1 -Algorithm ETS

#Disable Flow Controll on physical Nics
Set-NetAdapterAdvancedProperty -Name "NIC1" -RegistryKeyword "*FlowControl" -RegistryValue 0
Set-NetAdapterAdvancedProperty -Name "NIC2" -RegistryKeyword "*FlowControl" -RegistryValue 0

#Enable QoS and RDMA on nic's
Get-NetAdapterQos -Name "NIC1","NIC2" | Enable-NetAdapterQos
Get-NetAdapterRDMA -Name "NIC1","NIC2" | Enable-NetAdapterRDMA

After the QOS part is done, let’s configure a network team or a switch. For S2D one uses a setswitch with Embededteaming

New-VMSwitch –Name S2DSwitch –NetAdapterName "NIC1","NIC2" -EnableEmbeddedTeaming $true -AllowManagementOS $false

Let’s create some network cards and enable RDMA on them. Once RDMA is enabled DCB will also be enabled for SMB.

Add-VMNetworkAdapter –SwitchName S2DSwitch –Name Managment –ManagementOS
Add-VMNetworkAdapter –SwitchName S2DSwitch –Name SMB1 –ManagementOS
Add-VMNetworkAdapter –SwitchName S2DSwitch –Name SMB2 –ManagementOS

# Enable RDMA on the virtual network adapters just created
$smbNICs = Get-NetAdapter  -Name *SMB* | Sort-Object
$smbNICs | Enable-NetAdapterRDMA #Let's find the physical nics in the team. $physicaladapters = (Get-VMSwitch | Where-Object { $_.SwitchType -Eq "External" }).NetAdapterInterfaceDescriptions | ForEach-Object { Get-NetAdapter -InterfaceDescription $_ | Where-Object { $_.Status -ne "Disconnected" } } #Map SMB interfaces to Physical Nics Set-VMNetworkAdapterTeamMapping -VMNetworkAdapterName $smbNICs[0].Name -ManagementOS -PhysicalNetAdapterName (get-netadapter -InterfaceDescription $physicaladapters[0].InterfaceDescription).name Set-VMetworkAdapterTeamMapping -VMNetworkAdapterName $smbNICs[0].Name -ManagmentOS -PhysicalNetAdapterName (get-netadapter -nterfaceDescription $physicaladapters[0].InterfaceDescription).name

To check if RDMA is enabled you can run this command

Get-SmbClientNetworkInterface | where RdmaCapable -EQ $true | ft FriendlyName

Now DCB and RDMA is configured in Windows, let’s move to the switch setup.

 

This is where the hard part is, figuring out the correct setup for your switch. Most switch vendors support this.

Lenovo NE2572

Use default port settings, and enable DCB on switch in global mode.

Cee Enable 

cee ets priority-group pgid 3 priority 3
cee ets priority-group pgid 3 description "RoCEv2"
cee pfc priority 3 enable
cee pfc priority 3 description "RoCEv2"

cee ets priority-group pgid 7 priority 7
cee ets priority-group pgid 7 description "Cluster"
cee pfc priority 7 enable
cee pfc priority 7 description "Cluster"

cee ets priority-group pgid 0 description "Default"
cee ets priority-group pgid 0 priority 4 5 6

cee ets bandwith-percentage 0 49 3 50 7 1 

DEll N4000 Series

Turn off flowcontrol on all interfaces.

Conf t

interface range tengigabitethernet 1/0/13,ten1/0/14,ten1/0/15,ten1/0/16,ten2/0/13,ten2/0/14,ten2/0/15,ten2/0/16

classofservice traffic-class-group 0 1
classofservice traffic-class-group 1 1
classofservice traffic-class-group 2 1
classofservice traffic-class-group 3 0
classofservice traffic-class-group 4 1
classofservice traffic-class-group 5 1
classofservice traffic-class-group 6 1
classofservice traffic-class-group 7 2
traffic-class-group max-bandwidth 49 50 1
traffic-class-group min-bandwidth 49 50 1
traffic-class-group weight 49 50 1

datacenter-bridging
priority-flow-control mode on
priority-flow-control priority 3 no-drop
priority-flow-control priority 7 no-drop
exit
exit

What you set here is that we have sett traffic class 3 into group 0, and we have set max and min bandwith on the groups. The groups are 0,1,2. This gives max bandwith for group 0 and 1 50% each.  Then we enable the DCB config on the interfaces with mode on. And with priority 3 no-drop we enable the no packet drop on the traffic class 3.

Dell Force 10 S4810p

Turn off flowcontrol on all interfaces.

dcb enable

dcb-map SMBDIRECT
 priority-group 0 bandwidth 50 pfc on
 priority-group 1 bandwidth 49 pfc off
 priority-group 2 bandwidth 1 pfc on
 priority-pgid 1 1 1 0 1 1 1 2
exit

interface TenGigabitEthernet 1/46
 description
 no ip address
 mtu 12000
 switchport
 spanning-tree pvst edge-port
 dcb-map SMBDIRECT
 no shutdown
exit

Dell Force 10 S6000, S6000-On(FTOS)

Turn off flowcontrol on all interfaces.

conf t

protocol lldp
advertise management-tlv system-capabilities system-description system-name
advertise interface-port-desc

dcb enable

dcb-map RDMA-dcb-map-profile
 priority-group 0 bandwidth 50 pfc on
 priority-group 1 bandwidth 50 pfc off
 priority-group 2 bandwidth 1 pfc on
 priority-pgid 1 1 1 0 1 1 1 2
exit

interface fortyGigE 1/5
description 
no ip address
mtu 9216
portmode hybrid
switchport
dcb-map RDMA-dcb-map-profile
no shutdown
exit

Cisco Nexus NX-OS

By default PFC(Priority Flow Control) is enabled on Cisco Nexus switches. To hard enable it do the following.

No Priority 7 for cluster

configure terminal 
interface ethernet 5/5 
priority-flow-control mode on 

switch(config)# class-map type qos c1
switch(config-cmap-qos)# match cos 3
switch(config-cmap-qos)# exit

switch(config)# policy-map type qos p1
switch(config-pmap-qos)# class type qos c1
switch(config-pmap-c-qos)# set qos-group 3
switch(config-pmap-c-qos)# exit
switch(config-pmap-qos)# exit

switch(config)# class-map type network-qos match-any c1
switch(config-cmap-nqos)# match qos-group 3
switch(config-cmap-nqos)# exit

switch(config)# policy-map type network-qos p1
switch(config-pmap-nqos)# class type network-qos c-nq1
switch(config-pmap-nqos-c)# pause buffer-size 20000 pause-threshold 100 resume-threshold 1000 pfc-cos 3
switch(config-pmap-nqos-c)# exit
switch(config-pmap-nqos)# exit
switch(config)# system qos
switch(config-sys-qos)# service-policy type network-qos p1
exit

Cisco Nexus 3132  NX-OS 6.0(2)U6(1)

By default PFC(Priority Flow Control) is enabled on Cisco Nexus switches. To hard enable it do the following.

No Priority 7 for cluster

#Global settings

class-map type qos match-all RDMA
match cos 3
class-map type queuing RDMA
match qos-group 3
policy-map type qos QOS_MARKING
class RDMA
set qos-group 3
class class-default
policy-map type queuing QOS_QUEUEING
class type queuing RDMA
bandwidth percent 50
class type queuing class-default
bandwidth percent 50
class-map type network-qos RDMA
match qos-group 3
policy-map type network-qos QOS_NETWORK
class type network-qos RDMA
mtu 2240
pause no-drop
class type network-qos class-default
mtu 9216
system qos
service-policy type qos input QOS_MARKING
service-policy type queuing output QOS_QUEUEING
service-policy type network-qos QOS_NETWORK

#Port Specific settings
switchport mode trunk
#Set your vlans on next lines
switchport trunk native vlan 99
switchport trunk allowed vlan 99,2000,2050
spanning-tree port type edge
flowcontrol receive off
flowcontrol send off
no shutdown
priority-flow-control mode on

 

Mellanox SN2100

No Priority 7 for cluster

configure terminal
priority-flow-control priority 
dcb priority-flow-control 3 enable

interface ethernet 1/1
dcb priority-flow-control mode on

dcb ets tc bandwidth 10 50 40 0

 

HPE FlexFabric 5700/5900 series

No Priority 7 for cluster

#Setting the ETS priority 3 to group 1
qos map-table dot1p-lp
 import 0 export 0  
 import 1 export 0  
 import 2 export 0  
 import 3 export 1 
 import 4 export 0  
 import 5 export 0  
 import 6 export 0  
 import 7 export 0 
 exit

#ETS configuration for 50% dropless on group 1 priority 3 wich is default for SMB RDMA
interface ten-gigabitethernet 1/0/1 
 qos trust dot1p
 qos wrr be group 1 byte-count 15  
 qos wrr af1 group 1 byte-count 15  
 qos wrr af2 group sp  q
 os wrr af3 group sp  
 qos wrr af4 group sp  
 qos wrr af group sp  
 qos wrr ca6 group sp  
 qos wrr ca7 group sp

#Turning on PFC on the interfaces
interface ten-gigabitethernet 1/0/1  
 priority-flow-control auto
 priority-flow-control no-drop dot1p 3
 qos trust dotlp

#For these next lines you don't realy need unless you are realy pushing your config and maxing out speeds.

#RoCEv1 QCN congestion config
qcn enable 
qcn priority 3 auto  
Exit
interface Ten-GigabitEthernet1/0/10  
 lldp tlv-enable dotl-tlv congestion-notification

#RoCEv2 ECN congestion config
qos wred queue table ROCEv2  
 queue 0 drop-level 0 low-limit 1000 high-limit 18000 discard-probability 25  
 queue 0 drop-level 1 low-limit 1000 high-limit 18000 discard-probability 50  
 queue 0 drop-level 2 low-limit 1000 high-limit 18000 discard-probability 75  
 queue 1 drop-level 0 low-limit 18000 high-limit 37000 discard-probability 1  
 queue 1 drop-level 1 low-limit 18000 high-limit 37000 discard-probability 5  
 queue 1 drop-level 2 low-limit 18000 high-limit 37000 discard-probability 10  
 queue 1 ecn 
exit
interface Ten-GibabitEthernet1/0/10  
 qos wred apply ROCEv2

Quanta

This is the basic how to enable, not had the chance to test this out my self yet. So this will be updated as the manual is not straight forward. 

No Priority 7 for cluster

#To make sure DCB is enabled we can run this command
priority-flow-control mode ON/Auto (Default is Auto and it is enabled)

#Now we need to set priority no-drop for priority 3. Standard is no-drop on 3,4,5,6
#First we clear all priority
no priority-flow-control priority

#Then we set on only priority 3
priority-flow-control priority 3 no-drop


#Now let's set ets queue bandwith
#to enable
queue ets

#To set bandwith between san/lan to 50/50 run
no queue ets weight

#let's set the san bandwith to priority 3
queue ets pg-mapping lan 0 1 2 4 5 6 7
queue ets pg-mapping san 3

#let's configure pfc for interface
interface 1/1
storm-control flowcontrol pfc

 

reference: https://jtpedersen.com/2017/06/rocerdmadcb-what-is-it-and-how-to-configure-it/

Subscribe to our newsletter

Get the inside scoop! Sign up for our newsletter to stay in the know with all the latest news and updates.

Don’t forget to share this post!

37 thoughts on “RoCE/RDMA/DCB what is it and how to configure it”

  1. Just a note, I believe your Dell Force 10 S4810 config is slightly off. You are marking Priority 4 not Priority 3 with the current command of “priority-pgid 1 1 1 1 0 1 1 1”

    1. To be honest i do not know. I have no experience with Juniper, they do say that it support DCB and PFC. But no mention of RoCE only FCoE. I think you will need to dig deep to find the correct info/config. You could ofc ask juniper. But you would need to to a lot of googling 🙂

      Let me know if you figure it out.

      Oh and remember to turn off DCBX as it’s not supported with S2D.

      JT

      1. Hi,

        We have configured Juniper QFX5100 to work with DCB and PFC, it only supports RoCEv1.
        To support RoCEv2 you need 17.4 release and at least QFX5110 modell.
        I am working with JTAC on some PFC issues and I will ask them if EX4550 is supported for RoCEv1.

        NS

        1. Thanks for this 🙂

          You know how to change the settings in the OS to work on RoCEv1? If it’s Mellanox cards it’s a registry key.

          Regards
          Jan-Tore Pedersen

          1. Yes we are using Mellanox cards.

            # The following RoCE modes are supported:
            # •RoCE V1 MAC based (legacy) : 1
            # •RoCE V2 IP based (routable) : 2

            # Check status on Mellanox NIC
            Get-MlnxDriverCoreSetting

            # Set RoceMode
            Set-MlnxDriverCoreSetting –RoceMode 1

          1. Hello

            Im sorry but i have no experience with Juniper, and it’s a OS i have glanced at found that im not touching it 🙂

            As far as i can see it does not support RoCE and RDMA. It supports DCB and DCBx over FCoE and iSCSI but not RoCE.

            I would recomend contacting Juniper for this.

            Regards
            Jan-Tore Pedersen

  2. Hello JT.
    I wanted to let you know that we found this blog post extremely helpful. I do have a question if you have a moment. We have Cisco Nexus 9000 series switches with NX-OS 7. My network admins said that the values you provided we not allowed. You posted this: pause buffer-size 20000 pause-threshold 100 resume-threshold 1000 pfc-cos 3, but they said the minimum values they could set were this: pause buffer-size 27456 pause-threshold 12480 resume-threshold 12480 pfc-cos 3. Can you help me out here? I want to make sure I have it right. We are setting up a Storage Space Direct Cluster.
    Thanks
    -Matthew

    1. Thanks for the feedback.

      My guide is a baseline for how to set it up. The os might change as new versions come out. There is a guide for Nexus 3132 in the official MS doc and it does not have these settings. As it’s a diffrent NX-OS i belive. But what i recomend is using my baseline, do a google search of the latest NX-OS and see what they put in there. I will update my post with the guidelines for the NX3132 switch.

      But if the minimum threshold’s have change i do not see any reason not to use the new values. But always refer to the latest CLI guide for the OS you are running. If you get it to work, let me know and il update the blog post.

      JT

      1. To Patrick’s comment, in regards to the Dell Force 10 S4810p you said “turn off flowcontrol for all interfaces”. We have other servers plugged into other interfaces of the switches with flowcontrol enable (flowcontrol rx on tx on). Will the configuration affect/conflict with these interfaces? DCB-Map are not applied to those non-rdma interfaces.

        -ken

  3. Hello,
    Does the HPE 5700 Support RDMA/ROCE?

    I don’t See IEEE 802.1Qaz Enhanced Transmission Selection (ETS) available on this switch,

    However, ECN, DCB and PFC are available.

    Thanks

    1. I am helping someone with a 5700 right now. Will update once i have been able to look into it. They are having RDMA issues, so will let you know if it’s working or not.

      Unless you have them, get something else. They are not too easy to configure. Some HPE Aruba or Dell S/Z series, Lenovo NE series.

      Regards
      Jan-Tore

      1. The basic config should be similar. But there is very little info out there on the FlexFabric DCB/PFC config. And i don’t have access to HPE support site to check for more docs. And i have not gotten the complete config.

        But from what i could see the specefic config on the switch for DCB is this.

        priority-flow-control auto
        priority-flow-control no-drop dot1p 3
        qos trust dot1p

        But i would say that there might be some config missing. but i can’t confirm as i don’t have full config example for 5700. I did find one for 5940, still a bit not sure about the HPE setup. But il go trough this guide and see what i figure out.

        http://manualzz.com/doc/32098665/rdma-over-converged-ethernet–roce–design-guide

  4. Hi JT, what a great information shared here. Thanks!

    But I have a question with regards to my setup. I have 4 10Gbe Mellanox ConnextX Pro3. 2 of the ports are teamed together using SET. I have enabled PFC for group 3 for use of my Live Migration traffic.

    Another 2 ports, not team and use for SMB traffic. I have also created PFC priority 3 for this. On Windows 2016, I have also enabled the same priority 3 / 99% weight as both ports is solely used for SMB traffic. The problem I’m facing is, when I run Test-RDMA.ps1, it keeps showing me error that physical switch need to be configure for PFC. I am lost and confused. Can you guide me what I did wrong here? I have disabled vlan tagging for that ports as well.

    Also, can I have 2 different PFC on same priority group 3 on my switches?

    Thanks in advance for sharing your knowledge.

  5. I’m a littlebit confused, I have a Dell S4048-ON switch. I configured the DCB on the switch with:

    dcb enable

    dcb-map SMBDIRECT
    priority-group 0 bandwidth 50 pfc on
    priority-group 1 bandwidth 49 pfc off
    priority-group 2 bandwidth 1 pfc on
    priority-pgid 1 1 1 0 1 1 1 2
    exit

    However I found another article about ROCE configuration over here: https://www.fredericstefani.com/configure-dell-s4048-switches-for-storage-spaces-direct/

    Indicating that you need to have the oposite configuration:

    dcb enable
    dcb-buffer-threshold RDMA
    priority 3 buffer-size 100 pause-threshold 50 resume-offset 35
    exit
    dcb-map RDMA
    priority-group 0 bandwidth 50 pfc off
    priority-group 3 bandwidth 50 pfc on
    priority-pgid 0 0 0 3 0 0 0 0
    exit

    Under the S2D node interface configure:
    dcb-policy buffer-threshold RDMA
    dcb-map RDMA

    I think in his configuration he forgot the PFC for the cluster hardbeats.

  6. I think you made a mistake on the priority-group 7 bandwidth 1 pfc on command because you are now actually pausing the frames of the cluster hardbeats.

  7. Thanks for your article. Helped me a lot.

    My Dell N4000 has no traffic-class-group 7:
    “Value is out of range. The valid range is 0 to 6”

    As well as when configuring flow-control priority 7
    “Valid no-drop priorities are between 0 and 6”

    I probably have an outdated firmware but is this a major problem?

  8. Great article! It’s super helpful. I was looking at your powershell commands where you set up the priorities, and I noticed that you said that Cluster traffic uses priority 5, but in your PowerShell commands, you set it for priority 7. Is there a reason for this, or is it just a typo?

    Bryan

  9. Jan, I have been trying to reach you directly. I have some of my Team unconvinced that bout Tip #10. (30 tips in 30 min) My team says DCB is useless for Iwarp, since Iwarp is for TCP. Can you speak to why DCB would be recommended for Iwarp?
    I presume its as a fallback, but if you should shed any light on it, I would be glad to be part of that conversation with my team. Email me please if possible. I work for an S2d Support team.
    Thank you, Louis

    1. Hello Louis

      It comes down to how TCP works during lost packets, it will retransmit. And the cause of a lost packet is that the link is full. Basicly trying to send more data then the network can handle. TCP handles this with retransmit, but at one point it would just increase with lost packets untill the system can recover. There is no flowcontroll to basicly limit how much data is transported. And with microsoft also recomending to reserve 1-2% for cluster traffic as well, DCB will help in a high performance solution to tell the system to lower the bandwith beeing sendt. Instead of loosing packets and getting retransmits that could cause congestion in the network and pull down the performance to a halt.

      For low performance systems, it’s fine. Let’s say you have a cluster averaging 600k to 1 million iops. And you reboot a host, there will be alot of normal IO and alot of rebuild action that could cause congestion on 25GBit links. So we do recomend it. Even MS say in high performance iWarp clusters it’s recomended.

Leave a Comment

Scroll to Top