Integrating AI into the Network - Industry Today - Leader in Manufacturing & Industry News
 

May 14, 2024 Integrating AI into the Network

The different network management solutions that ensure continuity.

network sructure

By Tracy Collins, Vice Presidents Sales, America

Artificial Intelligence (AI) will profoundly impact society, changing how people work and live. From a network perspective, AI for IT operations (AIOps) will significantly enhance network operations and operational efficiency. Likewise, concerning infrastructure management, AI-powered automation will optimize resource allocation and untether valuable human resources to focus on more strategic and creative initiatives. Nevertheless, AI is not some simple widget that companies can drop into their complex network environment and expect immediate results. Organizations will need to implement several network management solutions and techniques to maximize the capabilities of AI while upholding network connectivity.

For some necessary context, let’s consider the roots of AI. Decades ago, there was the convergence of Moore’s Law, which claims that the number of transistors on an integrated circuit doubles every two years with negligible increase in cost, and the maturity of Graphics Processing Units (GPUs), which perform a range of tasks within data centers such as scientific computations, machine learning algorithms, and processing large-scale data. This coalescence provided the computational backbone necessary for AI to become what it is today.

Fast forward several years, and large language models (LLMs) are now pervasive throughout GPU data centers. However, LLMs introduce challenges for the networking infrastructure that interconnects GPUs. Unlike traditional compute loads, AI workloads – especially those involving LLMs as well as machine learning and deep learning algorithms – place considerable strain on Ethernet networking through what is known as “elephant flows” or massive data chunks. These elephant flows require significant computational power and can cause congestion and latency problems. Simply put, as AI adoption continues to surge, so too does the demand for computing resources, which invaluably leads to increased power consumption within data centers.

However, Ethernet advocates argue that the elephant flows issue is no longer a concern and that Ethernet is still a viable option for handling advanced LLM workloads. Another contender in the form of High-Performance Computing or HPC networking, typified by the low-latency protocol InfiniBand, posits that they hold the keys to processing elephant flows efficiently. Cisco and NVIDIA are the two giants representing the sides of this dispute between established Ethernet infrastructure and the emerging InfiniBand technology. Interestingly, despite offering InfiniBand switches, NVIDIA (the leader in GPU technology) is hedging its bets by partnering with Cisco. Famous for its Ethernet switching solutions, Cisco criticized InfiniBand, stating that it lacks the requisite scalability to meet the demands of GPU data centers.

So, what does this debate mean for network management within GPU data centers? Unlike Ethernet networks, InfiniBand switches typically do not have console management ports but instead possess Ethernet management ports. As such, enterprises need a way to bridge the gap and ensure continuous connectivity. One potential solution to this quandary is a robust fabric management solution that can provide an independent overlay management network for Ethernet or serial management. Such a network management solution would safeguard companies regardless of the winner between Ethernet or InfiniBand technology. More importantly, it would address the evolving needs of GPU data centers amid the growth of AI.

Likewise, by coupling robust fabric management with other network resilience solutions, like out-of-band (OOB) management, a business’s network engineers and system administrators will have everything they need to remediate connected network resources from anywhere, ensuring AI applications and tools remain operational in the face of networking challenges. Enterprises should also adopt a flexible, software-defined network control plane coupled with autonomous, secure, and remote access capabilities to perform tasks such as provisioning, orchestration, management, and remediation.

Additionally, companies must ensure their AI-automated systems operate transparently and accountably. There is no question that businesses will usher in a new age of efficiency and productivity as AI takes a more predominant role in infrastructure management and IT operations, automating routine tasks, improving decision-making, and streamlining manual processes. Nevertheless, ethical and security risks will abound without built-in mechanisms for human oversight. It is paramount that businesses deploy central management software as part of their OOB management solution to establish human control over AI-driven operations, thereby enabling automated processes to abide by organizational objectives and cybersecurity policies.

Central management software is a comprehensive and intuitive interface that allows IT administrators to monitor, manage, and intervene in the network’s operations. It also provides administrators an aggregated view of various automated network management tasks, giving these human decision-makers the insights and control they need to correct course if AI is not in lock-step with set principles of security, accountability, and transparency. With an OOB management framework, AI automation and central management software will synergize, enhancing operational capabilities and strengthening network resiliency.   

Reading through this list of different network management solutions and techniques may be overwhelming, which is understandable – much goes into enabling AI and safeguarding network connectivity. Thankfully, just as there are AI integration experts, there are network management solution providers that can support enterprises in their journey to build a more robust network environment.

tracy collins opengear
Tracy Collins

Tracy has over 25 years of experience in leadership positions in the IT and Infrastructure industry. Prior to joining Opengear, Tracy led the Americas business for EkkoSense, the leading provider of AI/ML software that allows data center operators to operate more efficiently. Prior to joining EkkoSense, Tracy was the CEO of Alabama based Simple Helix, a regional colocation data center operator and MSP. Tracy spent over 21 years with Vertiv, in various leadership positions including leading the global channel organization.

Tracy has an extensive background in sales leadership, and channel development with a strong track record of driving growth while improving profitability. Tracy holds both a Bachelor’s of Science, Business Administration, and a Master’s of Science in Management from the University of Alabama – Huntsville.

 

Subscribe to Industry Today

Read Our Current Issue

ASME & Discovery Education: STEM Programs Prepare Future Workforce

Most Recent EpisodeASME: Driving STEM Education Initiatives

Listen Now

Patti Jo Rosenthal chats about her role as Manager of K-12 STEM Education Programs at ASME where she drives nationally scaled STEM education initiatives, building pathways that foster equitable access to engineering education assets and fosters curiosity vital to “thinking like an engineer.”