We are now firmly in the era of exascale computing, accompanied by all things AI. With the industry showing no signs of slowing down, it is crucial to address the exponential power demands of AI data centers, which can require hundreds of megawatts and tens of thousands of high-performance accelerators.
In this piece, Manja Thessin, market manager for market, strategy, and innovation at AFL – a leading provider of fiber optic solutions – highlights the company’s expertise in addressing the unique demands of AI and machine learning at exascale, where next-generation data centers require next-generation solutions.
Pillar 1: Power and energy infrastructure
We are at a pivotal moment where we see AI workloads driving unsustainable increases in data center power demand – this is no longer a future concern, it is happening now. With this rise comes pressing challenges, including grid instability, rising energy costs, and environmental impacts. Thessin highlights that many regions are already struggling to meet the power demands of hyperscale data centers:
“Here in the US, in Northern Virginia’s ‘data center alley,’ power procurement delays have become the number one bottleneck for new deployments. With this comes skyrocketing energy costs, and without innovations in cooling and energy efficiency, operational costs will soon become unfeasible.”
According to Thessin, we have reached an inflection point where the decisions we make today will determine whether AI can scale sustainably or be severely constrained by infrastructure limitations such as power availability, rising energy costs, and cooling challenges:
“AI data centers can consume up to a thousand times more power than traditional CPU-based data centers. Nearly all of that energy is converted into heat, so if we don’t adopt advanced cooling methods like liquid cooling, immersion systems, or a transition to renewable energy sources, the carbon footprint of AI could offset a lot of the progress made in other industries.”
Transforming power infrastructure to support AI
As AI workloads drive this insatiable surge in data center power consumption, modernizing the grid is essential. But what will it take to make AI truly viable at scale?
Thessin advocates for the development of smarter grids that can dynamically allocate resources, integrate renewable energy sources, and support localized power generation:
“Much of our current transmission infrastructure is decades old and wasn’t designed for the kind of dynamic, high-load requirements that come with AI-driven data centers. As power consumption continues to rise, we need to rethink how energy is generated and distributed.”
She emphasizes that the key to meeting AI’s power demand lies in innovation across multiple areas of energy infrastructure:
“For example, nuclear microreactors are being explored as a localized power solution for states or center clusters because they offer consistent and scalable energy output with minimal carbon emissions. There are also advancements in geothermal energy and large-scale battery storage – both being explored to help stabilize the energy supply while reducing reliance on fossil fuels.”
Both grid and data center operators need to be in on it
Beyond adopting innovations like advanced cooling systems, growing power demands require enhanced collaboration between private companies and utilities – both grid operators and data center providers must rethink their approach to energy efficiency. Reflecting on the current state of collaboration, Thessin emphasizes:
“The scale of power required means that operators can no longer treat electricity as an unlimited resource. Instead, they must work closely with utilities to plan capacity expansions and integrate renewable energy sources into the mix.”
Training AI models generates immense power surges, placing significant strain on the grid. To address this, Thessin offers some practical solutions for operators and owners alike to help mitigate these challenges:
1. Predictive management
To prevent bottlenecks at peak demand, Thessin suggests data centers adopt predictive management strategies that use machine learning to forecast power usage based on workload schedules and dynamically allocate resources:
“Some operators use AI-driven systems to schedule workloads across multiple regions based on energy availability and cost, which has already reduced energy consumption by as much as 20 percent.”
Predictive maintenance further enhances efficiency by leveraging telemetry data to detect potential failures in fiber networks and hardware. Innovations like vibration sensors on transformers help anticipate failures before they escalate, protecting valuable data center investments.
2. Distributed computing
From an architectural perspective, distributed computing can be used to alleviate strain on the grid by spreading workloads across multiple regions, rather than concentrating all compute resources in a single location, as Thessin explains:
“This also helps improve the overall efficiency, so some bottlenecks can be alleviated while reducing latency for the end user in the compute network.”
3. Innovative hardware
Lastly, the latest generation of AI accelerators is designed with energy efficiency in mind. These specialized chips optimize sparsity and lower precision to help reduce the energy required for training large models, without having to sacrifice performance.
Pillar 2: Network and connectivity
Training complex AI clusters requires sophisticated backend networks capable of processing traffic with low latency. Achieving perfect synchronicity across thousands of accelerators is a challenge we have yet to fully master, especially given the significant differences from traditional north-south data center traffic patterns.
Moreover, existing infrastructure is not currently equipped to meet the high bandwidth demands and manage the mammoth datasets – often reaching the petabyte scale – necessary for AI training and inference. This can lead to bottlenecks, reduce system efficiency, and even compromise model accuracy if data integrity is compromised.
Thessin emphasizes that none of this happens automatically or without significant investment, which itself remains a hurdle for the industry:
“Training large language models requires significant upfront investments in specialized hardware, accelerators, real estate, and power infrastructure. Maintaining and upgrading this infrastructure only adds to the financial burden.”
The exascale era
In exascale computing, where massive GPUs and CPUs are used for AI applications, large-scale training clusters may include over 100,000 accelerators, making ultra-high-density fiber solutions essential. Thessin warns that without these high-density cables, which contain thousands of fibers each, the physical space required for cabling could increase by up to 300 percent, presenting significant challenges for cooling, airflow, and overall cable management:
“To address this, we now use UHD cables that offer enormous bandwidth capacity without expanding the physical footprint. This simplifies cable management, improves airflow, and cuts down pathway costs. It also really helps the future network by allowing incremental upgrades, where unused fibers can be activated as demand grows.”
Innovations in fiber technology continue to push the boundaries, reducing fiber diameters to pack even more capacity into the same space, crucial for meeting the evolving demands of large-scale AI infrastructure. For example, high-density connectivity solutions, like SN-MT and MMC connectors, further enhance efficiency. These small-form-factor, multi-fiber connectors maximize port density while maintaining the low insertion loss required for reliable performance in dense environments.
Autonomous vehicles, surgical robotics, financial trading, and Netflix: What do they all have in common?
For many of us, the real-life implications of these technologies can be difficult to grasp. It’s one thing to read about super-advanced fiber types, and quite another to understand how these innovations impact our daily lives.
Thessin seeks to bring the practical applications of AI technologies to life, explaining that advanced fiber technology is critical for three major reasons: ensuring high-speed connectivity for seamless GPU cluster communication, achieving low latency for real-time inference and synchronization, and optimizing white space to reduce cabling, improve airflow, and maximize rack space. In turn, these advancements help overcome physical constraints, latency issues, and cost challenges:
“One example use case is in autonomous vehicles, where vehicle-to-everything communication must occur in under two milliseconds to ensure safety-critical decisions are made in real-time.”
“Similarly, in healthcare, surgical robotics requires sub-fifty-microsecond latencies to enable precise haptic feedback. During a remote surgical procedure, even the slightest delay could compromise patient outcomes.”
“In financial services, high-frequency trading has relied on microsecond-level latencies for several years to execute trades faster than competitors, where every microsecond can mean millions of dollars in profit or loss.”
“Customer-facing applications, like AI-powered recommendation engines used by Netflix and Amazon, depend on sub-second response times to deliver personalized results instantly during user interactions. In gaming, AR and VR applications can significantly impact gameplay if latency is compromised in any way.”
Interestingly, AI is also being used to train itself. In data center operations, we see AI being used for network optimization, managing intelligent load balancing, and congestion in the backend network. These systems efficiently direct data flows across thousands of accelerators during synchronization, ensuring smoother operations.
Special delivery: Advanced transmission protocols
As we’ve seen, AI applications require extremely large amounts of data to be delivered between devices. But how can this data transfer be made more efficient to minimize latency and other performance issues?
The answer lies in advanced transmission protocols – next-generation networking protocols specifically designed for high-performance environments, like AI data centers. As smarter ways of delivering messages, these protocols optimize delivery, ensuring data moves quickly and reliably across distributed systems without bottlenecks or packet loss. Thessin explains:
“One example here is time-sensitive networking, or TSN, which prioritizes critical traffic while minimizing jitter or delay. This is really a must-have for synchronous training, where every accelerator must access updated model parameters simultaneously.
“Another example is RDMA (remote direct memory access) over converged Ethernet also known as RoCE or ‘Rocky.’ This technology allows direct memory-to-memory transfers between nodes, bypassing the CPU and minimizing additional processing tasks, which significantly reduces latency while increasing throughput.
“Finally, in-band network telemetry, or INT, provides real-time visibility into network performance metrics, such as packet loss or latency spikes. This allows operators to proactively identify issues before they impact operations.”
Mastering cable management – it’s more than just housekeeping
The sheer volume of fiber optic cabling required for high-density AI deployments necessitates a systematic approach to cable management – anyone who’s seen an AI data center facility can attest to this.
Poor cable management can undermine network performance, leading to issues like macro bending loss or restricted airflow for cooling around racks. Advanced cable management not only involves the cabling system itself but also aims to reduce congestion and enable quick troubleshooting. Thessin offers a glimpse of her expertise:
“Ultimately, it’s the systematic approach that helps keep everything organized, which includes proper labeling, routing, paths, bend radius, protection, and accessibility for moves and changes.”
Pillar 3: Workforce and collaboration
Last, but far from least, none of this advanced technology would exist without the people who keep the operational wheels turning. It’s essential, therefore, to secure the future of AI-enabled data centers by training and retaining a skilled workforce, including onboarding the next generation of data center professionals.
Many companies, including AFL, are focused on upskilling employees to design, deploy, and maintain AI infrastructure through comprehensive training programs and partnerships with educational institutions. Thessin adds:
“We’re rightfully seeing the rise of specialized certifications focused specifically on AI infrastructure. The key is developing a workforce that understands both the physical layer requirements, like advanced fiber optic technologies, as well as the unique demands of AI workloads, including expertise in areas like liquid cooling, nuclear power, and high-density connectivity.”
Come out, come out, wherever you are
So, where does this skilled workforce come from? Thessin believes that the key to discovering and training the right talent lies in collaboration between industry leaders, academic institutions, and government initiatives, forming a robust talent pipeline:
“You need collaboration between infrastructure providers like AFL for the physical network layer, hardware manufacturers for servers and networking equipment, cooling solution providers, power distribution specialists, and software and AI platform developers.
With the diverse range of industry players and specialties required to build these resource-intensive environments, Thessin compares the resources that AI clusters rely on to a societal ecosystem that must function in harmony.
Looking back at the development of power infrastructure and the creation of power grids, it’s clear that the efforts brought together an entire industry to ensure the fair distribution of power.
In Thessin’s view, the same principle applies to intelligence. In this context, intelligence encompasses things like data collection, the infrastructure required to transmit and store data, and the artificial intelligence layer that now uses this data to make decisions:
“Today, we’re taking a distributed approach, from Edge locations close to the end user to exascale facilities and clusters. It’s really an ecosystem – almost like a nervous system that connects all of these elements.”
Ethical and societal considerations in AI deployment
As we've delved into the demands and challenges of exascale computing for AI/ML data centers, we are confronted with a profound question: How do we ensure that the intelligence we create serves humanity, rather than the other way around?
To put this into perspective, we’re talking about fundamentally reshaping how we process data, including redefining the very nature of decision-making and shifting control from humans to artificial intelligence engines. Thessin wraps up this discussion with a powerful reminder:
“Beyond the infrastructure requirements, making this ecosystem functional means addressing the ethical and societal concerns that must be integral to the conversation. There is significant work ahead to ensure that we do this ethically and responsibly.”
It’s an exciting yet daunting moment in time – one where we are essentially writing the manual for AI use as we go. With so many different voices and perspectives contributing to the conversation, we must continue refining and defining a responsible approach to AI integration.
More from AFL
-
Sponsored Recalibrating expectations for AI in 2025
Reevaluations while the AI industry is navigating a period of tempered expectations and regulatory overhead
-
Sponsored Why your data center needs more fiber in its diet
The importance of denser fiber for higher throughput
-
Sponsored AFL Hyperscale is rebranding: Welcome to AFL
AFL Hyperscale begins an exciting new chapter as it rebrands to AFL