In the world of AI, as it currently exists, engineers hear a lot of the “buzzy/hype” pieces around it. Very rarely are they even hearing the benefits of AI on Kubernetes as it’s primarily conversation about GenAI (which isn’t the only form of AI).
Instead, they should be hearing about the benefits from a technical perspective. Things like:
- A smaller footprint with AI on Kubernetes.
- How AI works underneath the hood.
- What the benefits are from an ecosystem perspective.
And the overall engineering journey necessary to implement because as AI continues to grow, it will become part of the day-to-day in an engineers (primarily Platform Engineers) life.
In this blog post, you’ll learn not only about the theory behind AI on Kubernetes, but how to implement it yourself right now.
Prerequisites
Have you heard of the GPU (graphics card) shortage due to AI? The reason why is that generally, running/building AI workloads (building data models with Machine Learning) requires powerful machines and powerful GPUs. Although the need for several GPUs is mitigated when running AI workloads on Kubernetes, powerful clusters are still necessary.
To run Kubeflow, you will need a cluster with the following minimum specs:
- A Kubernetes cluster running:
- Kubernetes v1.27 or above
- 32 GB of RAM recommended
- 16 CPU cores recommended
What Is Kubeflow
Before diving into the thick of things, let’s briefly discuss some “AI” related concepts.
First, AI isn’t just ChatGPT or other Generative AI (GenAI) solutions. AI is the concept of taking data that was trained (learned) and expanded to formulate an idea of its own. The data comes from data sets, which are as it sounds, sets of data. It could be anything from complex data to an Excel spreadsheet with 3 rows and 4 columns. The data set is then fed to data models, which are collections of data sets. The data models are then fed into AI workloads.
💡 These explanations are high-level and for good reason – because these concepts will take up entire books and blog posts within themselves. However, these explanations should give you a good starting point.
To train the data models, you need specific software. For example, TensorFlow is a big brand in the AI/ML space that gives the ability to train models. Kubeflow takes various tools/software that exist in the AI/ML space and makes them usable on Kubernetes.
Source: https://www.kubeflow.org/docs/started/introduction/
The goal with Kubeflow is to take existing tools and provide a way that’s straightforward to deploy them on Kubernetes. Kubeflow offers both a standalone approach where you can deploy particular pieces of Kubeflow or a method to deploy all tools available within Kubeflow.
The overall idea is for engineers to have the ability to build and train models in an easier, more efficient fashion.
Kubeflow isn’t it’s own entity. It’s taking tools that exist, putting them together under one place, and allowing them to be used via Kubernetes.
Tools Outside Of Kubeflow
There are a few different tools outside of Kubeflow. They aren’t Kubernetes-native and it’ll require you to learn “their way” of doing things, but it’s still good to understand the other options that exist:
- mlflow: https://mlflow.org/docs/latest/deployment/deploy-model-to-kubernetes/index.html
- TensorFlow (part of the kubeflow ecosystem): https://www.tensorflow.org/tfx/serving/serving_kubernetes
- Ray: https://docs.ray.io/en/latest/
There’s also a stack called JARN which consists of Jupyter, ArgoCD, Ray, and Kubernetes: https://aws.amazon.com/blogs/containers/deploy-generative-ai-models-on-amazon-eks/
How Does Kubeflow And Kubernetes Help Each Other
As mentioned in the previous section, the whole idea of Kubeflow isn’t to create more tools and software that you have to learn and manage. It’s to take the existing ML/AI-related tools and give you one location to use them all versus having to manage them all as single entities.
The primary stack you’ll see used with Kubeflow is:
- Istio
- Jupyter Notebooks
- PyTorch
- TensorFlow
- RStudio
Source: https://www.kubeflow.org/docs/started/architecture/
With the tools available, you will have:
- Data preparation
- Model training
- Experiments and Runs
- Prediction serving
- Pipelines
- The ability to test and write with Notebooks
Here’s a quote from Kristina Devochko, CNCF Ambassador and senior engineer – “There are some interesting examples of potential AI has in helping the environment and society as a whole. For example, projects that won in CloudNativeHacks hackathon where AI was used to spread awareness and automatically detect and monitor deforestation on a global level. Or usage of AI in projects for reducing food waste. However, it’s important that AI resource consumption is significant and increasing, not only when it comes to carbon but also water and electricity so we must keep working on optimizing AI”.
With a quote like that from someone who’s deep in the Kubernetes and Sustainability space, it makes sense to utilize something like Kubernetes to decouple and simplify resources and workloads as much as possible.
How Kubeflow Helps Platform Engineering
Platform Engineering, at its core, is the ability to make using tools, services, and add-ons easier. If you’re an engineer/developer, you don’t want to learn all of the underlying capabilities. You want the ability to use them to get your job done, but you don’t have the bandwidth to become a master of them all. Platform Engineering makes using the tools more straightforward without having to become a master.
Kubeflow helps Platform Engineering by being readily available on the Platform Engineering underlying platform of choice, Kubernetes. With the ability to do anything from manage containers to virtual machines to resources outside of Kubernetes WITH Kubernetes, adding AI capabilities puts the icing on the cake.
Kubeflow Installation And Configuration
Throughout this blog post, we went into the internal engineering details behind the “how and why” when it comes to Kubeflow. Let’s now dive into the hands-on portion and install Kubeflow.
You’ll see two sections below – one for AKS, EKS, and a vanilla installation that works on all Kubernetes clusters . There will be differences in where you install Kubeflow as the underlying infrastructure that it resides on (the cloud) requires particular resources to be installed or has several different options. For example, on AWS, there are a ton of configuration options for RDS, S3, Cognito, and more based on how you want to use Kubeflow.
AKS
Ensure that you have the prerequisites: https://azure.github.io/kubeflow-aks/main/docs/deployment-options/prerequisites/
- Clone the Kubeflow repo.
git clone --recurse-submodules https://github.com/Azure/kubeflow-aks.git
cd
into the repo.
cd kubeflow-aks
cd
into the Manifests directory.
cd manifests/
- Checkout the v1.7 branch and go back to the root directory.
git checkout v1.7-branch
cd ..
- Install Kubeflow.
cp -a deployments/vanilla manifests/vanilla
cd manifests/
while ! kustomize build vanilla | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
EKS
Ensure you have the proper prerequisites: https://awslabs.github.io/kubeflow-manifests/docs/deployment/prerequisites/
With EKS, there are a lot more options available and instead of writing them all out via this blog, you can see the installation configurations here: https://awslabs.github.io/kubeflow-manifests/docs/deployment/vanilla/guide/
Vanilla Installation
Aside from cloud-based installations, there’s a vanilla installation that (theoretically) works on any Kubernetes cluster.
- Clone the Kubeflow repo.
git clone https://github.com/kubeflow/manifests.git
- Checkout the latest release. For example below is the branch checkout of v1.8.
git checkout v1.8-branch
- From the
manifests
directory, run the following:
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 20; done
- Once everything is installed you can access Kubeflow by logging into the dashboard.
Default username: user@example.com
Default password: 12341234
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
Closing Thoughts
Kubeflow is currently the de facto standard in terms of using ML and AI on Kubernetes with tools and software that already exist. Are there other options? Absolutely. The thing is with other options you’ll be learning other tools and API’s vs the tools and API’s that you’re already using.
Kubeflow both incorporate new tools (like Katib and Model Registry) and software/tools that have already existed in the AI/ML space (like PyTorch) and puts them in one stack, which is a good thing. It means you don’t have to reinvent the wheel by learning a ton of new tools and workflows. If you’re already in AI and ML, you’ll be well familiar with the existing toolset.