Building a High-Performance Gen AI Setup with NVidia GPUs & KUBE by IG1

This guide explains how we set up Gen AI platform using KUBE by IG1. It starts with installing servers and NVidia GPUs, and setting up the basic software. Then, we configure KUBE by IG1 to manage virtual machines and ensure everything is connected properly. We download and optimize the LLM AI model, integrate it with a system that improves responses, and set up user-friendly interfaces for interacting with the AI. Finally, we test the system thoroughly, check its performance, and set up monitoring tools to keep it running smoothly. This ensures a robust and efficient AI setup.

Layer 01: Hardware & Cloud Setup

Hardware & cloud infrastructure form the foundational layer of the Generative AI stack, providing the necessary computational power and flexibility for training and deploying AI models.

Physical Servers

Unpack and install servers and NVidia GPUs into data center racks. Connect power and networking, ensuring all components are secure and properly seated. This setup provides the foundation for the AI infrastructure.

Base System

Install IG1 AI OS, our home-made OS based on Linux Ubuntu, on each server, update the system, and install NVidia drivers and the CUDA toolkit. This step ensures the servers are ready for GPU-accelerated applications and provides a stable operating environment.

KUBE by IG1 for AI

Install KUBE by IG1 for AI to manage virtual machines and containers. Configure networking within KUBE, initialize the cluster, and verify its health. This step establishes the core infrastructure for managing and deploying AI applications.

Physical Servers

Unpacking and Initial Setup


Unpack the Hardware:

Carefully unpack the servers, NVidia GPUs, and other hardware components.


Rack the Servers: 


Install the servers into the designated racks in the data center.

Connect Power and Networking:

Connect the servers to power sources and the data center network.


Hardware Configuration


Install NVidia GPUs:

Physically install the NVidia GPUs into the servers according to the manufacturer’s instructions.

Verify Hardware Connections:

Ensure all connections are secure and components are properly seated.

Base System

Operating System Installation


Install the OS:

Install the OS: Install IG1 AI OS, a specially designed operating system tailored for AI services, leveraging our deep expertise and capability in managing “plug and play” platforms for AI.

Update the System:

Run system updates to ensure all packages are up to date.

GPU Drivers and CUDA Installation


Install NVidia Drivers:

Install the latest NVidia drivers for the GPUs.

Install CUDA Toolkit:

“CUDA toolkit” is embedded in IG1 OS.

KUBE by IG1 for AI

Installation and Configuration


Install KUBE by IG1:

Follow the installation guide for KUBE by IG1 to set up the virtualization layer.

Configure Networking:

Set up networking within KUBE to ensure communication between nodes and external access.

Cluster Initialization


Initialize KUBE Cluster:

Initialize the KUBE cluster to create a control plane and add worker nodes.

Verify Cluster Health:

Check the health and status of the KUBE cluster to ensure all components are functioning correctly.

Layer 02: Model Foundation LLM and RAG Deployment

AI applications rely on generative models, such as LLAMA3, Mistral, Deepseek, and StarCoder, which are pre-trained models on vast datasets to capture complex patterns and knowledge. These models serve as building blocks for various AI tasks, including natural language processing and image generation. To effectively deploy and manage AI applications, several services are needed to ensure the proper functioning of Large Language Models (LLMs). These services include quantization for resource optimization, inference servers for model execution, API core for load balancing, and observability for data collection and trace management. By fine-tuning and optimizing these models on specific datasets, their performance and accuracy can be enhanced for specialized tasks. This foundational step enables developers to leverage sophisticated models, reducing the time and resources required to build AI applications from scratch.

LLM Model Setup

Download the LLM (Large Language Model) and perform quantization to optimize performance and reduce resource usage. This step ensures the AI model runs efficiently and is ready for integration with other components.

RAG (Retrieval-Augmented Generation) Setup

Integrate RAG components using the most used framework and deploy the RAG pipeline within KUBE. This step enhances the AI model with retrieval-augmented capabilities, providing more accurate and relevant responses.

LLM Model Setup

Download LLM:

Obtain the LLM from the appropriate source.

LLM Optimization:

Optimization consists in optimising resource usage by preparing and enhancing LLMs through a process called quantization. Quantization increases inference performance without significantly compromising accuracy. Our quantization management services utilize the AWQ project, which provides excellent performance in terms of speed and accuracy.

LLM Optimization:

Similar to database engines, LLMs inference servers run LLMs for inference or embedding. IG1 installs and manages all the necessary services for the proper functioning of LLM models. For this, we rely on several instances of:

RAG (Retrieval-Augmented Generation) Setup

Integrate RAG Components:

Set up the necessary RAG components (example using the LlamaIndex framework):

Deploy RAG Pipeline:

Deploy the RAG pipeline within the KUBE environment.

Layer 03: Integration, Orchestration & Deployment Tooling

This Layer is about the critical processes of integrating, orchestrating, and deploying AI infrastructure to ensure seamless and efficient operations. As AI applications become increasingly complex and integral to business operations, it is essential to have a robust framework that supports the integration of various services, the orchestration of containerized applications, and the deployment of these applications with minimal friction.
By leveraging advanced tooling and best practices, organizations can achieve greater scalability, reliability, and performance for their AI systems. We will explore the key components and strategies required to build a resilient and scalable AI infrastructure that meets the evolving needs of modern enterprises.

Integration of 
AI Services

Integrate various AI services seamlessly to ensure efficient communication and operation. This includes:

The API Core acts as a Proxy LLM, balancing the load between LLMs inference server instances. LiteLLM, deployed in High Availability, is used for this purpose. It offers wide support for LLM servers, robustness, and usage information and API key storage through PostgreSQL. LiteLLM also enables synchronization between different instances and sends LLM usage information to our observability tools.

Observability & Traceability

Implement observability tools to gain insights into the behavior and performance of your AI applications:


The LLMs observability layer collects usage data and execution traces, ensuring proper LLM management. IG1 efficiently manages LLM usage through a monitoring stack connected to the LLM orchestrator. Lago and OpenMeter collect information, which is then transmitted to our central observability system, Sismology.

Layer 04: Al Applications

It represents the tangible end-user implementations of generative models, demonstrating their practical value. These applications, such as text, code, image, and video generation tools, leverage advanced AI to automate tasks, enhance productivity, and drive innovation across various domains. By showcasing real-world uses of AI, this section highlights how generative models can solve specific problems, streamline workflows, and create new opportunities. Without this layer, the benefits of advanced AI would remain theoretical, and users would not experience the transformative impact of these technologies in their daily lives.

GPT-like Prompting Interface

Install Hugging Face Web Interface:

Set up the Hugging Face web interface for model management and prompting.

API Setup

Deploy API Server:

Set up an API server to provide programmatic access to the LLM and RAG services.

RAG Interface

Configure RAG UI:

Implement a user interface for interacting with the RAG system.

Dev Copilot

Deploy API Server:

Set up an API server to provide programmatic access to the LLM and RAG services.

Low Code LLM Applications Tool

Deploy Low Code Tool:

Install a low code tool for building LLM-based applications.

Inside Look:

Gen AI Event at Iguana Solutions' Paris Office:
Gen AI implementation @Easybourse

Explore GenAI’s impact on professional services: from LLMs’ pros and cons to RAG’s benefits, challenges, and improvements, and its application at Iguana Solutions.

Play Video

Experience feedback: Gen AI implementation @Easybourse

Consumer tools for LLMs bridge the gap between the LLM core API and practical applications . These tools empower developers to integrate generative models into real-world systems, augmenting them with contextual information using RAG or employing tool agents to build an LLM army. These tools are vital as they serve as interfaces between the AI platform and end-user applications. They offer critical capabilities such as user and model management interfaces, API key management, document interfaces for enriching RAG context, a comprehensive developer Copilot enabling developers to converse with their codebase for better coding, and a low-code interface for building applications effortlessly without coding. These plug-and-play services make it easier for developers and team members to incorporate AI into their daily routines.

“ With our previous partner, our ability to grow had come to a halt.. Opting for Iguana Solutions allowed us to multiply our overall performance by at least 4. “

Cyril Janssens

CTO, easybourse

Trusted by industry-leading companies worldwide

Our Platforms for Gen AI Offers

Revolutionize Your AI
Capabilities with
Plug-and-Play Gen AI Platforms

We offer innovative Gen AI platforms that make AI infrastructure effortless and powerful. Harnessing NVIDIA’s H100 and H200 GPUs, our solutions deliver top-tier performance for your AI needs.

Our platforms adapt seamlessly, scaling from small projects to extensive AI applications, providing flexible and reliable hosting. From custom design to deployment and ongoing support, we ensure smooth operation every step of the way. In today’s fast-paced AI world, a robust infrastructure is key. At Iguana Solutions, we’re not just providing technology; we’re your partner in unlocking the full potential of your AI initiatives. Explore how our Gen AI platforms can empower your organization to excel in the rapidly evolving realm of artificial intelligence.

Contact Us

Start Your DevOps Transformation Today

Embark on your DevOps journey with Iguana Solutions and experience a transformation that aligns with the highest standards of efficiency and innovation. Our expert team is ready to guide you through every step, from initial consultation to full implementation. Whether you’re looking to refine your current processes or build a new DevOps environment from scratch, we have the expertise and tools to make it happen. Contact us today to schedule your free initial consultation or to learn more about how our tailored DevOps solutions can benefit your organization. Let us help you unlock new levels of performance and agility. Don’t wait—take the first step towards a more dynamic and responsive IT infrastructure now.