800G troubleshooting in the leaf-spine: optics, | Sanoc

In a live leaf-spine deployment, “link up but no traffic” is often blamed on optics, yet the root cause can be power delivery, DOM misreads, or marginal fiber polarity. This article helps data center network and facilities engineers performing 800G troubleshooting during migration or high-density expansions. You will get a field-style case study, a decision checklist, and practical failure-mode fixes grounded in real rack and link constraints.

Problem / challenge: 800G links appear healthy but throughput collapses

🎬 800G troubleshooting in the leaf-spine: optics, power, and DOM checks

800G troubleshooting in the leaf-spine: optics, power, and DOM checks

We hit a classic failure pattern while rolling out a new spine pair to support 800G east-west traffic. The transceiver LEDs indicated link, the switch ports reported “up,” and counters showed steady link training, but application layer throughput dropped to 5% of expected. Within minutes, multiple ports on both spines showed intermittent CRC/FEC-related drops, and the autoscaler triggered workload rescheduling. The team had to decide quickly whether to swap optics, reseat cables, or investigate power and thermal margins.

The environment was a 3-tier topology: 48-port 800G-capable ToR switches feeding dual spine switches, with 64 uplinks per spine during the rollout window. Each 800G port used coherent-less direct attach or fiber uplinks depending on the row. For this case, the affected links were short-reach optics inside the same room, routed through overhead trays and high-density patch panels. We also observed that the failures clustered by rack row rather than by specific switch SKU, suggesting a physical layer or power delivery coupling issue.

Environment specs: what mattered for 800G signal integrity

Before changing anything, we documented the exact electrical and optical constraints because 800G troubleshooting is rarely solved by “reseat and pray.” The switch platform used 800G ports that accept QSFP-DD style pluggables with vendor-specific firmware behavior. For the affected links, the optics were short-reach 850 nm class modules and the patching used 12-fiber MPO/MTP style trunks with a polarity strategy aligned to the vendor’s recommended lane mapping.

Cooling and power were also treated as first-class variables. The spines were under a hot-aisle containment retrofit, and the rack intake temperatures were running near the upper comfort band during peak load. Meanwhile, PDUs feeding the spine row were on separate breakers, but the transceiver power budget can still be sensitive to transient droop during fan or CPU load steps. We logged inlet temperatures, fan RPM, and PDU outlet voltage during the failure window.

Parameter	Typical 800G SR Optics Case	Why It Affects Troubleshooting
Nominal wavelength	850 nm (short-reach)	Determines fiber type compatibility and attenuation profile
Connector / cable type	MPO/MTP 8-lane style trunks (12-fiber often used)	Polarity and lane mapping errors cause CRC/FEC spikes
Reach class	~100 m class for OM4/OM5 with correct budget	Margin shrinks with patch panel losses and dirty connectors
DOM support	Digital optical monitoring with threshold alarms	Bad DOM readouts can mask failing optics or temperature drift
Operating temperature	Vendor specified module range (check datasheet)	Thermal stress can degrade laser bias and raise BER
Transceiver power	Module and host power budget within platform limits	Power droop can destabilize high-speed SerDes training

For standards context, the physical layer behavior you see as CRC and FEC error patterns maps to Ethernet PHY error handling defined in IEEE 802.3 families for high-speed links, while optical module monitoring is governed by vendor implementation details and common transceiver management practices. See [Source: IEEE 802.3] and vendor transceiver datasheets for the exact DOM thresholds and optical/electrical requirements. For fiber cabling practices and polarity, align to [Source: ANSI/TIA-568] and manufacturer MPO polarity guidance.

Chosen solution & why: treat it as a system problem, not a swap problem

We used a structured approach: verify optics health via DOM, validate fiber polarity and cleanliness, then check power delivery and thermal coupling. A key decision was to avoid mass swapping 800G optics immediately. Instead, we triaged the highest-error ports first to identify whether the dominant symptom was optical power imbalance, lane mapping faults, or host-side instability.

DOM and telemetry triage on the failing ports

We pulled DOM telemetry from the switch (laser bias current, received optical power, temperature, and alarm flags). Multiple failing ports had received power values near the lower threshold and reported elevated temperature relative to healthy peers. On one spine, two ports showed inconsistent DOM readings across refresh cycles, which can happen when a module is marginally seated or the host cage contacts are oxidized. We reseated optics with a controlled procedure, then rechecked DOM and error counters.

polarity and lane mapping verification at the MPO/MTP interface

Next, we verified polarity by mapping the MPO/MTP trunk polarity method used in the patch panels. A mismatch between “A-to-B” assumptions and the actual cassette polarity can produce deterministic lane swaps that manifest as CRC bursts and FEC recovery churn. We confirmed the cassette orientation against the vendor’s polarity diagram and validated that each patch cord used the correct keyed orientation.

power and thermal margin checks during link training

Finally, we checked power delivery stability. Inrush events from other rack components during the migration window caused brief voltage sag, which can affect module drive circuits and the host’s SerDes training stability. We correlated the timestamps of CRC bursts with PDU outlet voltage logs and fan speed changes. When we adjusted load sequencing and improved airflow at the spine intake, error rates dropped even before the final fiber corrections.

Pro Tip: In many 800G troubleshooting cases, the fastest discriminator is correlating DOM received power trend with error counter bursts. If DOM shows low received power but temperature looks normal, suspect fiber budget, connector contamination, or polarity. If DOM temperature spikes and received power fluctuates, suspect thermal airflow or marginal module seating before you start re-terminating cables.

Implementation steps: from rack-level actions to measured results

We followed a tight change-control loop to prevent chasing multiple causes at once. The rollout plan used a rolling window: isolate one spine pair set of ports, apply a single category of changes, then measure for 30 minutes before moving to the next set. This reduced the risk of attributing improvements to the wrong intervention.

Controlled reseat and cleaning of optical connectors

We removed and cleaned the MPO/MTP end faces using approved lint-free wipes and an inspection scope. For connectorized trunks, we focused on the end faces in the patch panel cassettes where repeated handling increases contamination risk. After cleaning, we reseated with consistent insertion force and verified that the connectors fully latched without spring-back.

Correct polarity and re-route patch panel cassettes

We reoriented the cassettes to match the intended polarity scheme and updated the labeling to prevent future swaps during maintenance windows. For determinism, we validated each trunk’s lane mapping against the switch vendor’s polarity requirement for the specific module type. After correction, CRC bursts became rare and FEC recovery stabilized.

Adjust load sequencing and airflow to protect thermal margins

We staged the workload so that no simultaneous fan curve step and spine CPU load step occurred during the link training window. In addition, we improved containment airflow at the spine intake by rebalancing blanking panels and sealing gaps in the hot-aisle boundary. After airflow adjustments, module temperatures stabilized and DOM alarms cleared.

Firmware and compatibility guardrails

We verified switch software compatibility with the transceiver vendor and checked for known issues affecting high-speed training behavior. Where applicable, we applied the maintenance update recommended by the optics vendor for DOM handling and link stability. We avoided mixing optics brands within the same port group during the test window to reduce variables.

Measured results and lessons learned

After the first intervention round (DOM reseat and connector cleaning), the worst ports improved from 5% throughput to 70% throughput, but error counters still spiked intermittently. After polarity correction in the patch panel cassettes, the CRC and FEC recovery events dropped sharply, and throughput returned to 95% to 100% of expected for the tested uplinks. Following load sequencing and airflow tuning, module temperatures stabilized, and the remaining minor error increments disappeared during sustained traffic tests.

In operational terms, our MTTR for the impacted ports dropped from an initial “swap and guess” timeframe of multiple hours to a repeatable process of under 45 minutes per port group. The key lesson was that 800G troubleshooting must treat optics, cabling polarity, power delivery, and thermal behavior as one coupled system rather than isolated components. We also tightened acceptance testing: every new trunk is inspected end-to-end with recorded attenuation and polarity verification before being brought into production.

Common mistakes / troubleshooting pitfalls in 800G deployments

Mistake: Swapping optics without checking DOM telemetry first.
Root cause: If the dominant failure is polarity or connector contamination, a new module can still receive malformed lane mapping and show CRC/FEC churn.
Fix: Pull DOM immediately, compare received power and temperature against known-good ports, then validate polarity and cleanliness before mass replacement.
Mistake: Assuming MPO polarity is “universal” across vendors and cassettes.
Root cause: Polarity methods differ (A-to-B, B-to-B, or cassette-specific lane mapping). A mismatch yields deterministic lane swaps that look like random errors at the application layer.
Fix: Use the switch vendor and optics vendor polarity diagram for the exact module type, verify cassette orientation, and update port labels.
Mistake: Ignoring power droop during migration or fan curve transitions.
Root cause: High-speed SerDes training and transceiver drive circuits can be sensitive to transient PSU rail dips, causing intermittent link instability and counter bursts.
Fix: Correlate error timestamps with PDU voltage logs and host power events; stagger load changes and confirm that rack PDUs and breaker margins are adequate.
Mistake: Overheating modules in high-density rows without verifying airflow paths.
Root cause: Containment leaks or missing blanks can recirculate warm air, raising module temperature and degrading optical performance.
Fix: Measure inlet temperature at the rack and compare to the module operating range; seal bypasses and confirm fan directionality.

Selection criteria checklist for 800G troubleshooting readiness

When planning the next expansion or building a troubleshooting runbook, engineers should weigh these factors in order:

Distance and fiber budget: Confirm the actual patch panel, trunk length, and insertion loss; verify against the optics datasheet budget, not just “reach class.”
Connector type and polarity strategy: Document MPO/MTP polarity method, cassette orientation, and lane mapping for the exact switch and module combination.
Switch compatibility and firmware behavior: Validate optics vendor compatibility matrix and confirm software support for DOM and high-speed training.
DOM support and thresholds: Ensure the platform exposes the specific DOM fields you need (temperature, laser bias, received power) and that alarm thresholds are understood.
Operating temperature and airflow: Verify inlet temperature, module temperature headroom, and containment integrity; plan for worst-case seasonal conditions.
Vendor lock-in risk: Third-party optics can work, but differences in DOM implementation and compliance margins can complicate fault isolation; standardize vendors within a fabric where possible.

Cost & ROI note: what it really costs to get stable 800G

In practice, 800G optics and associated cabling tend to be the largest direct cost, but downtime dominates ROI. Typical street pricing varies by vendor and volume; short-reach 850 nm class QSFP-DD optics may range from several hundred to over a thousand USD per module depending on brand and warranty. OEM optics often carry better documented compatibility and lower RMA friction, while third-party modules can reduce BOM cost but may increase time spent in 800G troubleshooting due to DOM or compatibility quirks.

For TCO, include labor for inspection (fiber scope time), cleaning supplies, and change control. A single day of reduced traffic during a migration window can cost far more than the difference between OEM and third-party optics, especially when upstream application SLAs are impacted. Measured MTTR improvements, like the 45 minute port group recovery we achieved, directly reduce operational risk and support costs.

FAQ

What are the first symptoms I should look for in 800G troubleshooting?

Start with port-level counters: CRC spikes, FEC recovery events, and any link flap behavior during traffic bursts. Then check DOM for received power and module temperature trends on the affected ports. If DOM shows low received power with stable temperature, suspect polarity, budget, or contamination before replacing optics.

Can wrong MPO polarity cause errors even if the link stays up?

Yes. With deterministic lane mapping errors, the link can remain “up” while the PHY repeatedly corrects corrupted symbols, leading to high error counters and throughput collapse. Correct polarity often fixes the issue without changing optics.

Do I need a fiber inspection scope for 800G deployments?

For high-density 800G, a scope materially improves troubleshooting speed. Dirty MPO end faces can create attenuation and mode coupling issues that show up as BER/FEC stress. Cleaning without inspection often wastes time when the contamination is severe.

How do I tell whether the issue is thermal versus optical?

Compare DOM temperature trends across healthy and failing ports and correlate them with error bursts. If temperature rises and received power fluctuates, thermal or airflow problems are likely. If temperature is stable but received power is low, focus on optical budget, connector condition, and polarity.

Are third-party 800G optics safe to deploy in production?

They can be, but treat compatibility and DOM behavior as a first-class requirement. Validate with your switch vendor’s optics policy and run a burn-in with telemetry capture. If your platform relies heavily on DOM alarms, differences can slow fault isolation.

What is the most common root cause in “link up but no traffic” cases?

In many real deployments, the top causes are polarity mismatch, contaminated connectors, or marginal fiber budget after patch panel losses. Power and thermal issues also appear, but they often correlate with multiple ports across a row and align with load or airflow transitions.

Update date: 2026-05-03. For related planning guidance, see 800G cabling and polarity planning to standardize trunk polarity, patch panel cassettes, and acceptance testing before migration windows.

Author bio: I have deployed and troubleshot high-density 400G to 800G fabrics in leaf-spine data centers, focusing on power delivery, cooling constraints, and optical telemetry-driven MTTR reduction. I write runbooks that field teams can execute under outage pressure, with measurable acceptance tests and documented compatibility boundaries.