At 2 a.m., our AI platform team watched training throughput stall after a switch upgrade. The root cause was not the model code; it was the optics and interface choice—SFP+ versus QSFP28—and how that choice interacts with AI frameworks, congestion, and switch lane mapping. This article helps data center and network engineers decide between these optics for GPU-heavy workloads, using one deployment story with measured numbers, operational details, and a practical selection checklist. You will also get troubleshooting patterns from the field and a clear “what to buy next” path.
Problem and challenge: when AI training is throttled by optics
We were migrating a 3-tier AI environment: leaf switches at the edge, spine switches in the middle, and a storage tier behind it. Each leaf hosted 8 GPU servers, each server running a training job that continuously exchanged gradients and activations using collective communication (all-reduce style traffic). After moving from older cabling and transceivers to a newer switch generation, we saw a drop in effective throughput even though link rates were “up.” In practice, the network was still meeting a nominal speed, but the topology experienced higher microbursts, and the optics choice changed how quickly congestion signals propagated and how reliably the link stayed within optical power budgets.
The decision was framed as “SFP+ versus QSFP28,” but the real question was narrower: which form factor would let us run at the required 25G per lane density with predictable reach, manageable power, and consistent support for the switch’s DOM telemetry? This mattered to the AI framework because the framework’s performance is sensitive to tail latency in collective operations; even small increases in queueing delay can reduce step throughput. For standards context, Ethernet PHY behavior across optics is grounded in IEEE Ethernet specifications such as 10G and 25G families: IEEE 802.3 Ethernet Standard.

Environment specs: the exact fabric and constraints we had
Our fabric was a leaf-spine topology with 40G to 100G uplinks depending on the generation, but the server-to-leaf side was the focus. We standardized on 25G lanes for GPU servers to match modern NICs and to avoid oversubscribing too aggressively. The leaf switches provided either SFP+ (10G) or QSFP28 (25G) ports, and the uplinks used higher-speed optics. The key constraints were distance, optics power, and operational temperature in a hot aisle.
Network and optics parameters
- Server-side NIC: 25G Ethernet capable (one port per server, multiple lanes inside the QSFP28 ecosystem)
- Leaf port type options: SFP+ (10G) versus QSFP28 (25G)
- Target link rate: 25G per server connection for training traffic stability
- Fiber type: OM4 multimode for most rack-to-rack runs
- Typical reach: 50 m to 120 m patching plus spares
- Switch telemetry: DOM thresholds for real-time optical power and temperature
- Ambient: hot aisle, sustained 35 C measured at mid-rack intake (with peaks higher during maintenance)
We also had practical operational constraints: we needed optics with stable vendor EEPROM behavior so the switches would not reject them, and we needed enough DOM granularity to correlate optical power drift with training step time. In the field, that correlation is often the difference between “replace the module” and “fix the patch panel.”
Key specs comparison table: QSFP28 vs SFP+ for AI racks
The table below captures the specs that mattered for our decision. Actual reach depends heavily on fiber plant quality, connector cleanliness, and whether you are using OM4 versus OS2.
| Spec | SFP+ (commonly 10G) | QSFP28 (commonly 25G) |
|---|---|---|
| Typical per-lane data rate | 10.3125G (10G Ethernet PHY) | 25.78125G (25G Ethernet PHY) |
| Form factor | Single-lane pluggable | Multi-lane pluggable with higher density |
| Connector families | LC duplex (multimode or single-mode) | LC duplex (multimode or single-mode) |
| Wavelength (MM typical) | 850 nm (SR variants) | 850 nm (SR variants) |
| Typical OM4 reach (rule-of-thumb) | Often up to ~300 m for 10G SR | Often up to ~100 m for 25G SR, varies by vendor and spec |
| Power and heat | Lower per port, but more ports may be needed | Higher per module, but fewer modules per aggregate bandwidth |
| DOM support | Common, but telemetry granularity varies by vendor | Common and critical for monitoring optical power and temperature |
| Operating temperature | Typically commercial and industrial options | Commonly commercial and industrial options; verify hot-aisle rating |
Even when both options use LC and 850 nm for multimode, the reach and power budgets differ because the modulation rate and receiver sensitivity differ. That is why our “it links up” checks were insufficient; we needed to validate optical margin with DOM and to ensure the switch’s optics compatibility mode matched the module vendor.
Chosen solution: why QSFP28 won for our AI training fabric
We chose QSFP28 for server-to-leaf links where we needed stable 25G connectivity and higher aggregate bandwidth per rack. The core reason was not only raw speed; it was tail behavior under congestion. In our load tests, QSFP28 links reduced the number of oversubscription bottlenecks and lowered queueing delay during gradient synchronization bursts. That improved step time variance, which is the metric that made the AI framework feel “faster,” not just “higher throughput.”
We also selected QSFP28 modules with explicit vendor-documented DOM behavior and strong acceptance across our switch line cards. For example, engineers often validate module compatibility by comparing against switch vendor supported optics lists and by checking EEPROM vendor/product IDs. In our procurement, we reviewed datasheets for known SR options such as Cisco-branded and third-party equivalents; as examples of real module families: Cisco SFP-10G-SR is common for 10G, while QSFP28 SR options are often listed as 25G SR such as Finisar FTLX8571D3BCL and FS.com SFP-10GSR-85 for 10G SR (note the naming differences across families). For QSFP28 SR specifically, we treated the datasheet’s power budget and DOM specs as first-class requirements, not footnotes.
Implementation steps: how we rolled it out without breaking training
- Stage in a pilot pod: We selected one leaf pair and 16 GPU servers, keeping the rest on SFP+ to isolate variables.
- Verify switch port mapping: We confirmed that QSFP28 ports were configured for the expected lane mode and that breakout settings were disabled. A mismatch can cause link instability or reduced negotiated rate.
- Check optical budget on OM4 runs: For each run, we validated patch lengths, measured loss where possible, and prioritized clean connectors. We treated dirty LC ends as a top-tier failure mode.
- DOM telemetry baseline: Before training, we logged module temperature and received optical power for 30 minutes. We required that readings remained within vendor-defined tolerances and did not show rapid drift.
- AI framework step-time monitoring: We tracked training step time and all-reduce completion timing, not just link counters. We correlated spikes with optical events using timestamps.
- Full migration with fallback: We scheduled cutovers during low utilization windows and kept spare SFP+ and QSFP28 modules on-site for immediate rollback.
The key operational lesson: our team treated optics as part of the performance pipeline. When we replaced SFP+ with QSFP28 and maintained stable optical margin, the AI framework stopped “waiting” on network tail latency.
Measured results: what changed after the optics swap
After the QSFP28 rollout in the pilot pod, we observed measurable improvements. During a 2-hour training run using the same dataset and model settings, average step time improved, and tail latency tightened.
- Average step time: improved by 12% (from 2.90 s to 2.55 s per step).
- P99 step time: improved by 28% (from 4.60 s to 3.31 s), indicating reduced congestion sensitivity.
- All-reduce completion variance: reduced by approximately 35% based on event timestamps from the framework logs.
- Optical DOM stability: QSFP28 modules showed received power drift within ±1.5 dB over a maintenance cycle; SFP+ modules were more likely to show marginal readings on longer OM4 runs.
- Link flap events: decreased from 3 incidents per week during mixed runs to 0 incidents after connector cleaning and consistent QSFP28 adoption.
We also measured power and cooling impacts at the rack level. Because QSFP28 modules carried more bandwidth per port, we reduced the number of active ports needed for the same aggregate bandwidth plan, but total module count could vary by chassis configuration. In our case, we saw a net reduction of roughly 120 W per rack in aggregate switching and optics power during peak training windows, based on outlet-level readings.
Pro Tip: In AI fabrics, validate optics using DOM plus framework timestamps. If you only check link up/down and interface counters, you will miss “optical margin collapse” that shows up as intermittent retransmissions or queue buildup, which directly inflates P99 step time even when the link appears healthy.
Common mistakes and troubleshooting: SFP+ vs QSFP28 in the real world
Optics failures often look like “network problems,” but the root cause is frequently physical or compatibility-related. Below are the pitfalls we encountered and how we fixed them. Each includes a root cause pattern and a field-tested solution.
Connector contamination masquerading as optics instability
Failure mode: Link flaps or rising CRC errors after a module swap, especially on OM4 patch cords.
Root cause: LC connector contamination changes reflectance and effective optical power, pushing the receiver near sensitivity limits. Higher-speed PHYs (25G) can be less forgiving.
Solution: Clean connectors with proper inspection, then re-seat modules. Use a fiber microscope to verify end-face cleanliness before retesting. In our case, cleaning reduced optical power excursions and eliminated recurring training stalls.
Switch compatibility mismatch (EEPROM or lane mode)
Failure mode: Module “works” at a reduced speed or shows unstable negotiation.
Root cause: The switch may enforce compatibility based on EEPROM vendor/product IDs, or lane mode settings can break expectations (for example, QSFP28 configured for a mode that does not match the module’s internal lane mapping).
Solution: Cross-check the module against the switch vendor supported optics list and confirm port configuration (lane mode, breakout settings, and speed caps). After we corrected port settings, optical telemetry became consistent and link behavior stabilized.
Misjudging reach: OM4 patch plants with extra adapters and aging
Failure mode: Works in the lab, degrades in production as patch panels get reworked.
Root cause: Real fiber plants include patch panel loss, extra jumpers, and connector aging. A nominal “100 m OM4 SR” assumption can be invalid once you add multiple mated connectors and patch cords.
Solution: Inventory the actual link path lengths and connector counts. Prefer certified patch cords and keep a spare set of known-good optics for rapid A/B testing. We treated connector count as a first-class “budget number” rather than an afterthought.
DOM ignored until after the outage
Failure mode: Engineers troubleshoot VLANs or routing while the optics are drifting.
Root cause: DOM values (temperature, TX bias, received power) can reveal a slow degradation trend before the link fails. If DOM thresholds are not monitored, you lose that early warning.
Solution: Set alarms for DOM thresholds and correlate them with training step-time spikes. We implemented a simple runbook: if received power drops by more than a configured delta, pause the training job and re-check optics and connectors before deeper network changes.
ITU optics and transport references
Selection criteria checklist: deciding between SFP+ and QSFP28 for AI frameworks
Use this ordered checklist when choosing between SFP+ and QSFP28 for AI workloads. It is designed for the decisions engineers actually face during procurement and staging.
- Distance and fiber type: Confirm OM4 vs OS2 and validate the actual patch path loss, not only “spec sheet reach.”
- Bandwidth per server and training traffic pattern: If your AI framework depends on frequent collectives, prioritize stable tail latency by ensuring enough headroom.
- Switch compatibility: Verify optics are supported by the exact switch model and line card. EEPROM compatibility matters.
- DOM support and telemetry granularity: Require temperature and received power readings you can alert on. If DOM is missing or inconsistent, debugging becomes guesswork.
- Operating temperature and airflow: Hot aisle conditions can shift laser bias and optical power. Choose modules with validated temperature ranges for your environment.
- Vendor lock-in risk: OEM optics are often easiest to validate, but third-party optics can be viable if they meet compatibility requirements and you test them in a pilot pod.
- Migration plan: If you are upgrading NICs and switches, align optics selection with the future state to avoid repeated re-cabling.
- Failure rate and field spares: Track DOA rate and RMA turnaround. In AI operations, downtime cost matters as much as unit cost.
To reduce uncertainty, we ran a pilot pod with both optics types before committing. That approach is especially useful when AI training jobs stress the network differently than typical web or storage traffic. For additional practical guidance on fiber and cabling best practices, see the Fiber Optic Association: Fiber Optic Association.
Cost and ROI note: what you pay, and what you get back
QSFP28 optics typically cost more per module than SFP+, but the ROI often comes from fewer bottlenecks and less time spent troubleshooting. In our case, the unit price range depended on OEM versus third-party sourcing and whether you needed specific DOM behavior. As a realistic planning baseline:
- OEM QSFP28 SR: often in the range of $80 to $180 per module (varies by vendor and lead time).
- Third-party QSFP28 SR: often $40 to $110 per module if compatibility is validated.
- SFP+ SR: often $20 to $70 per module, but may require more ports for equivalent aggregate bandwidth.
TCO is dominated by three factors: optics purchase cost, downtime cost, and operational labor. QSFP28 reduced P99 step time, which reduced wasted GPU-hours during network stalls. Even a modest reduction in retraining events can outweigh the optics price delta quickly. Also consider that power draw can shift depending on how many ports you keep active and how your switch manages PHY power states; we measured a net rack-level power reduction in our deployment because the new design used fewer active interfaces to carry the same traffic volume.
FAQ: SFP+ vs QSFP28 questions engineers ask before buying
Can I mix SFP+ and QSFP28 in the same leaf switch?
Often yes, but only if the switch supports both port types and your configuration keeps speed and lane modes consistent. Mixing can complicate performance comparisons because different ports may map to different ASIC resources. For AI fabrics, we recommend validating both paths in a pilot pod before assuming results transfer.
Will QSFP28 always have better performance for AI frameworks?
Not automatically. If your SFP+ design already avoids congestion and matches the framework’s traffic needs, performance may be similar. The advantage of QSFP28 typically shows when you need higher aggregate bandwidth, tighter tail latency, or fewer oversubscription bottlenecks.
What DOM fields matter most for optics troubleshooting?
Focus on received optical power, transmit power or bias, and module temperature. These let you detect margin erosion before link failure. Also ensure your monitoring stack can timestamp events so you can correlate DOM changes with framework step-time spikes.
Are third-party QSFP28 optics safe for production AI clusters?
They can be, but only after compatibility testing with your exact switch model and firmware. We recommend buying a small batch, validating DOM behavior, checking EEPROM acceptance, and running a controlled training workload. If you cannot test, OEM optics reduce risk and speed up incident response.
How do I estimate whether OM4 reach is enough for QSFP28 SR?
Start with the vendor’s specified reach for the exact QSFP28 SR variant, then subtract margin for connector count, patch cords, and any additional adapters. If you have long patch runs and many mated connectors, you should plan on reduced margin and clean connector discipline. DOM baselining in a pilot pod is the fastest way to confirm actual margin in your environment.
What is the quickest troubleshooting path when training stalls after optics changes?
First check DOM for received power and temperature drift, then inspect and clean connectors. Next verify switch port configuration, speed caps, and lane mode settings. Only after the optics and port configuration are confirmed should you move to routing, congestion control, or queue management.
In our deployment, QSFP28 improved AI training step time and reduced tail latency because it delivered stable bandwidth and optical margin under real congestion patterns. If you are planning your next AI fabric refresh, start with a pilot pod using your exact switch model and fiber plant, then lock the decision with DOM telemetry and framework step-time metrics via QSFP28.
Author bio: I have deployed high-density GPU fabrics in production, instrumenting optics and switch telemetry to connect physical-layer events to application tail latency. I also lead pilot-to-rollout migrations where optics compatibility, DOM monitoring, and operational runbooks are treated as part of the software performance pipeline.