There are many reasons an ExaNIC can drop frames, some more likely than others. As a rule, they fall into two categories:
- Software drops
- Hardware drops
The software drops category is out of scope for this article, so let's look at hardware drops. These occur when there is insufficient bandwith from the NIC to the CPU. So, why might there be insufficient bandwidth from the host to the CPU? It could be any of the following:
- The PCIe slot is of the wrong generation (e.g. the card is plugged into a Gen 2 slot)
- The PCIe slot is of the wrong width (e.g. there is only x1 available instead of the required x8)
- The PCIe slot is not attached to a root port
- The card is plugged into a PCIe slot on the wrong NUMA node
- Power-saving settings are getting in the way, causing weird latency spikes
- On a card with many ports, e.g. an X40, it is possible for the combined ingress rate across all ports to exceed the available PCIe bandwidth
These are all covered in the benchmarking guide, but there is a class of issues rarely/never encountered on benchmark-only machines: memory bandwidth.
To be clear, the NIC does not touch main memory - the RX and TX regions live on the card itself. However, it does contend for L3 cache slots when sending frames to the host, which means the host will have to write evicted lines back out to main memory - causing memory bandwidth pressure. It may also be the case that there is L3 cache pressure but no memory bandwidth pressure, for example with multiple cards receiving at line rate. In this case, you'll see packet drops even though there is enough PCIe bandwidth for all of the cards.
To determine whether you're running into this, it's worth shutting down non-essential applications (that may be consuming memory bandwidth) and turning off other ports. If hardware drops go away when you do this, you may have an L3 bandwidth issue.
It is possible to fix this using what Intel calls "Cache QoS Enforcement", the setup for which is outside the scope of this article.