ML Code Examples

Here's a list of some example Machine Learning code URLs (briefly explained) that you can submit to NuNet through our Service Provider Dashboard (SPD). You could also deploy and test your own ML code.

Teaching a Machine to Recognize Images with PyTorch and the CIFAR-10 Image Set (Not Checkpointed)

This Python code is for training and testing a simple convolutional neural network (CNN) using PyTorch on the CIFAR-10 dataset.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/cifar-10.py

Check for GPU availability: The code checks if a GPU is available for faster computation. If it is, the code will utilize the GPU.
Load and preprocess the images: The code retrieves the CIFAR-10 dataset and preprocesses the images, preparing them for the machine learning process.
Display random images*: The code contains a function to display a few random images from the dataset that will be used for training.
Define a Convolutional Neural Network (CNN) architecture: The code creates a CNN architecture to guide the machine in learning how to recognize images.
Prepare the machine for training: The code configures the machine to follow the CNN architecture and determines the approach to learning from mistakes.
Train the machine: The code trains the machine using the images for 100 iterations (epochs). It keeps track of the machine's performance during training.
Test the trained machine: After training, the code evaluates how well the machine can identify images it has not seen before, using a test set of images.
Evaluate the machine's performance: The code calculates the overall accuracy of the machine in identifying the test images, as well as the accuracy for each category of images in the dataset.

However, this code does not include a checkpointing system to save the machine's learning progress. If training is interrupted, the process will have to start from the beginning.

Teaching a Machine to Recognize Images with PyTorch and the CIFAR-10 Image Set (Checkpointed)

This code is a modified version of the previous one. The main changes are related to adding functionality for checkpointing and resuming training from a saved state.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/cifar-10_checkpointed.py

Here's the explanation:

Imports and configurations: The code imports necessary libraries and sets up the device for using GPU or CPU, based on availability.
Data preparation: It loads and preprocesses the CIFAR10 dataset for training and testing purposes.
Visualizing images*: The code provides a function to display images from the dataset.
Defining the network architecture: The code defines a Convolutional Neural Network (CNN) with two convolutional layers, two pooling layers, and three fully connected layers.
Checkpoint and resume functionality: The code reads and writes the epoch count from a text file and saves the model weights in a file. If a checkpoint file exists, it resumes training from the last saved state.
Training the model: The training loop is modified to include saving the current model state after each epoch, and resuming from the saved state if needed.
Evaluating the model: The code tests the model's performance on the test dataset, calculating overall accuracy and accuracy per class.

Overall, this version of the code is designed for checkpointing and resuming the training process, making it more convenient when training is interrupted or needs to be paused.

Building a User-friendly Chatbot with T5 and Gradio

This code sets up a chatbot using the Gradio library for the interface and the T5 large model from Hugging Face's Transformers library as the backbone.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/flan-t5-large_chatbot-multi-gpu.py

Here's an overview of the code:

Import libraries: Gradio, PyTorch, and Transformers are imported.
Hyperparameters: The code defines various hyperparameters for the model's text generation process, such as the maximum sequence and output lengths, number of beams for beam search, length penalty, and more.
Load the model and tokenizer: The pre-trained T5 large model and its corresponding tokenizer are loaded. The model is set up to run on GPU(s) if available.
Define the chatbot function: The chatbot() function takes user input, tokenizes it, feeds it to the T5 model, generates a response, and decodes the output tokens back into text.
Create a Gradio interface: The Gradio library is used to create a simple user interface for interacting with the chatbot. A text input box and a text output box are provided, along with a title and description.
Launch the Gradio interface: The Gradio interface is launched, and a shareable link is created.

This code sets up a chatbot using the T5 large model, providing an easy-to-use interface for users to ask questions and receive responses.

Chatbot that Remembers Conversations and Saves Them to a File

This code sets up a chatbot that can remember the conversation and save it to a file. It uses the T5 model from the Transformers library and Gradio for a user-friendly interface. The chatbot will work on your computer's GPU if available.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/flan-t5-large_chatbot-multi-gpu_checkpointed.py

Here's what the code does:

It imports the needed tools (Gradio, PyTorch, os, and Transformers).
The code sets some rules (hyperparameters) to help the model give better answers.
The T5 model and tokenizer are loaded to understand and process text. The model will use your computer's GPU if possible, making it work faster.
The code checks if a file named "conversation.txt" exists. If not, it creates one to save the conversation.
A chatbot function is created that opens the conversation file, reads the previous conversation, and adds new input and output to the file. It also processes the input and generates a response using the T5 model.
Using Gradio, a simple chat window is created for you to ask questions and see the answers. The chatbot will remember the conversation and display the updated conversation after each response.
Finally, the chat window is launched, and you can share it with others if you want.

This code helps you set up a chatbot that can remember and save conversations, making it fun and easy to interact with the T5 model while keeping track of the discussion.

Simple Chatbot Using T5 Model and Gradio Interface

This code is similar to previous two. It sets up a basic chatbot using the T5 large model from the Transformers library and Gradio for a user-friendly interface. The chatbot will run on your computer's GPU if available.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/flan-t5-large_chatbot.py

Here's what the code does:

It imports the required tools (Gradio, PyTorch, and Transformers).
The T5 model and tokenizer are loaded, which helps the chatbot understand and process text. The model will use your computer's GPU if possible, making it work faster.
A chatbot function is created that processes user input by tokenizing it and converting it into a PyTorch tensor. It then generates a response using the T5 model with specified settings.
The generated response is decoded, and any special tokens are removed before it's returned.
Using Gradio, a simple chat window is created for users to input text and see the chatbot's response.
Finally, the chat window is launched, allowing users to interact with the T5 model and see its responses.

This code helps you set up a simple and interactive chatbot, making it easy to use the T5 model to generate responses for user inputs.

Simple Chatbot Using GPT2 and Gradio

This is a similar chatbot like the previous ones above.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/gpt2-large_chatbot.py

Here's what it does:

This code creates a chatbot using a smart model called GPT-2 and a tool named Gradio that makes it easy to talk to the chatbot.
The code uses tools from the Transformers library to load the GPT-2 model and set it up.
A special token is added to help the model understand when a message starts and ends.
The code has a chatbot function that changes your message into a form the model understands, makes the model think of a reply, and then changes the reply back to normal text.
A simple Gradio chatbot interface is made, so you can ask questions and get answers from the chatbot.

Comparing this code with the previous ones, this one uses a model named GPT-2 instead of another model called T5. The way the model is loaded is a little different, but the overall idea of creating a chatbot using Gradio remains the same.

Training a Machine Learning Model with Dummy Data - Testing on multiple GPUs (Checkpointed)

This code uses nn.DataParallel to enable the model to be trained on multiple GPUs, thereby accelerating the training process through distributed computing.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/multi-gpu-test_checkpointed.py

The code checks whether a GPU is available and displays information about the available devices.
Defines a PyTorch model to be trained with an example linear function.
Initializes the model and moves it to GPU(s) if available.
Defines the loss function and optimizer.
Generates dummy input data for the model.
Attempts to read the epoch count from a file, if the file doesn't exist, creates it with a default value of 0.
Defines two helper functions to save and load the epoch count to/from a file.
Defines a function to initialize the model's weights.
Tries to load the model's weights from the last saved checkpoint. If there are no saved checkpoints, initializes the model's weights.
Defines the total number of epochs to run and the interval for printing the loss.
Trains the model for a specified number of epochs, printing the loss every few epochs.
Saves the model's weights and epoch count to a file after each epoch.

Please note that the sole purpose of this code was to test simultaneous execution of the process on multiple GPUs.

Train and Generate Text with PaLM Model using PyTorch (Not Checkpointed)

This code is a PyTorch implementation of the PaLM model for text generation. The code is written by Phil Wang and licensed under MIT license.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/palm-rlhf.py

It trains and generates text by following the steps:

The code uses PyTorch and Palm-rlhf-pytorch library to implement PaLM model for training and inference.
The enwik8 dataset (a subset of Wikipedia), is used for training the model, which is downloaded using curl command and saved locally in the data directory.
A TextSamplerDataset is defined to create train and validation datasets from the enwik8 dataset.
A PaLM model is instantiated with hyperparameters and moved to the device (GPU) using accelerator.
An optimizer is created to optimize the PaLM model using Adam optimization algorithm
The training is done for a fixed number of batches, where the loss is calculated using PaLM model on training dataset, the gradients are accumulated and backpropagated, and the parameters are updated using the optimizer.
Validation loss is also calculated every few batches to check the performance of the model on the validation dataset.
The model is also used for generating text after every few batches.
After training, the model is used for inference by getting user input, generating text using the trained PaLM model, and printing the output.

Summarizing, this code trains a neural network using the PaLM architecture to generate text similar to a given dataset. The dataset used here is enwik8, which is an 100 MB dump of the English Wikipedia. The code defines a PaLM model and uses PyTorch DataLoader to feed the data to the model. It uses an accelerator library to distribute the computation across available devices, such as GPUs. Finally, it allows the user to test the generated model by inputting a sentence, and the model responds with a predicted output.

Train and Generate Text with PaLM Model using PyTorch (Checkpointed)

This code is a language model, PaLM, trained on enwik8 dataset. It trains the model on GPU by using PyTorch's DataParallel library, allowing distributed computing on GPUs. Additionally, the code also implements checkpointing which allows resuming from the last checkpoint instead of restarting the training from the beginning.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/palm-rlhf_checkpointed.py

Revisiting the points with checkpointing:

This code uses PyTorch's PaLM model and Accelerator library for distributed computing on GPUs.
The code downloads the enwik8 dataset and divides it into training and validation sets.
The code uses TextSamplerDataset to load data into PyTorch DataLoader, which is then used for training.
It uses the Adam optimizer for training and also employs learning rate scheduler.
The code implements checkpointing to save model weights and optimizer states at a defined interval and allows resuming from the last checkpoint.
The code trains the model for a defined number of batches and validates after a defined interval.
It generates a sample text after a defined interval and also provides an option for the user to enter a prompt for generating text.
The training and generation logs are displayed using the tqdm library.

Summarizing, this code trains a PaLM language model on the enwik8 dataset using PyTorch and Accelerator libraries. It implements checkpointing to resume training from the last checkpoint and uses PyTorch's DataParallel library for distributed computing on GPUs. The model is trained for a defined number of batches and generates sample text after a defined interval. The user can also enter a prompt for generating text. The training and generation logs are displayed using the tqdm library.

Fashion MNIST Image Classification using TensorFlow and Keras (Not Checkpointed)

This code is an implementation of image classification using Fashion MNIST dataset with TensorFlow and Keras. The dataset consists of images of clothing items such as shirts, shoes, trousers, etc. The model is trained to classify these images into different categories.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/tensorflow/fashion-mnist.py

Here's how it works:

The code imports TensorFlow and Keras libraries.
Fashion MNIST dataset is loaded using Keras.
The images are shown using matplotlib.
The images are normalized to 0-1 range.
A sequential model is created using Keras with two dense layers.
The model is compiled using adam optimizer and sparse categorical crossentropy loss.
The model is trained using the training data.
The test loss and accuracy are evaluated using test data.
The model is used to predict the labels for the test data.
Functions are defined to plot the images and the predicted labels.
Plots are generated to show the predicted labels and true labels for some test images.
An individual image is selected and its label is predicted using the model.

In summary, this code trains a machine learning model to classify images of clothing from the Fashion-MNIST dataset. The code first loads and preprocesses the data, then builds and trains a sequential neural network model using the TensorFlow library. The trained model is used to make predictions on test data and visualize its performance through plots of images and their corresponding predicted and true labels. Finally, the model is used to predict the class of a single image.

Fashion MNIST Image Classification using TensorFlow and Keras (Checkpointed)

This code trains a neural network classifier on the Fashion-MNIST dataset and uses checkpointing to save and restore the model weights. Checkpointing allows training to be interrupted and resumed without losing progress. The checkpoint is saved after every epoch, and the number of epochs completed before the training was interrupted is recorded in a text file.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/tensorflow/fashion-mnist_checkpointed.py

Here is a breakdown of the code:

The necessary libraries are imported.
The Fashion-MNIST dataset is loaded and preprocessed. The class names are also defined.
The first image in the training set is displayed using Matplotlib.
The images in the training set are normalized to values between 0 and 1.
The first 25 images in the training set are displayed using Matplotlib.
The neural network model is defined using Keras.
A checkpoint callback is created to save the weights after each epoch.
If the checkpoint file exists, the model weights are loaded, and training is resumed. Otherwise, a new checkpoint file is created.
A custom callback is defined to update the epoch counter in the text file at the end of each epoch.
The model is trained with the fit() method, using the checkpoint and counter callbacks.
The model is evaluated on the test set.
The predictions are computed for the test set.
Two functions are defined to display the predicted labels and confidence scores for each test image.
The predicted labels and confidence scores for two test images are displayed using Matplotlib.
The predicted labels and confidence scores for several test images are displayed using Matplotlib.
An individual test image is displayed, and its predicted label and confidence score are computed and displayed using Matplotlib.

In summary, this code trains a neural network classifier on the Fashion-MNIST dataset, using checkpointing to save and restore model weights and a custom callback to update the epoch counter. The predicted labels and confidence scores for test images are displayed using Matplotlib.

Training a Convolutional Neural Network on CIFAR-10 dataset using PyTorch (for CPU-only machines)

Introduction: This code trains a convolutional neural network on the CIFAR-10 dataset using PyTorch. It includes loading and preprocessing the dataset, defining the neural network, training the model, and evaluating its performance on the test set.

https://gitlab.com/nunet/ml-on-gpu/ml-on-cpu-service/-/raw/develop/examples/cifar-10_cpu_checkpointed.py

How it works:

The code imports PyTorch and torchvision modules.
It still checks if a GPU is available and sets the device accordingly.
It normalizes the CIFAR-10 dataset using torchvision.transforms.
It loads the training and test data using torchvision.datasets.CIFAR10 and creates dataloaders for them using torch.utils.data.DataLoader.
It defines the class names for the CIFAR-10 dataset.
It defines a function to display an image from the dataset using matplotlib.
It displays a few random images from the training set and their labels.
It defines the neural network using nn.Module and initializes its weights using a custom function.
It checks if a checkpoint file exists and loads the model's weights from it if it does.
It defines the loss function, optimizer, and the number of epochs to train for.
It trains the model for the specified number of epochs using the training set and the defined optimizer and loss function.
It saves the model's weights and the current epoch count to a file after each epoch.
It evaluates the performance of the model on the test set and prints the accuracy.
It calculates the accuracy of the model for each class in the dataset and prints it.

Summary: This code trains a convolutional neural network on the CIFAR-10 dataset using PyTorch. It loads and preprocesses the data, defines the neural network, trains the model, and evaluates its performance on the test set. It also saves the model's weights and epoch count to a file after each epoch and calculates the accuracy of the model for each class in the dataset.

Building and Evaluating a Decision Tree Classifier on the Iris Dataset with scikit-learn (CPU-only)

The Iris dataset is a classic example in the field of machine learning used for classification tasks. In this code, we will use the scikit-learn CPU-only library to build a Decision Tree Classifier on the Iris dataset. The code will train the classifier on 80% of the data and test it on the remaining 20% of the data. Finally, the code will evaluate the model's performance using accuracy, classification report, and confusion matrix.

https://gitlab.com/nunet/ml-on-gpu/ml-on-cpu-service/-/raw/develop/examples/cpu-ml-test-scikit-learn.py

What it does:

Load necessary libraries such as numpy, Scikit-learn's load_iris, train_test_split, DecisionTreeClassifier, accuracy_score, classification_report, and confusion_matrix.
Load the Iris dataset and separate input features (X) and output labels (y).
Split the dataset into train and test sets (80% training, 20% testing) using train_test_split.
Create a Decision Tree Classifier and fit it to the training data using DecisionTreeClassifier and fit methods.
Make predictions on the test set using predict method.
Evaluate the model's performance using accuracy_score, classification_report, and confusion_matrix methods.
Print the accuracy of the model on the test set.
Print the classification report, which shows precision, recall, f1-score, and support for each class.
Print the confusion matrix, which shows the number of true positives, false positives, true negatives, and false negatives for each class.

So in this code, we used the CPU-only Scikit-learn library to build a Decision Tree Classifier on the Iris dataset. The code split the dataset into 80% training and 20% testing sets, trained the classifier on the training set, and tested it on the test set. Finally, we evaluated the model's performance using accuracy, classification report, and confusion matrix. The accuracy of the model on the test set was printed, and the classification report and confusion matrix were shown to provide additional insights into the model's performance.

To Understand ML Job Error Handling on NuNet

This ML code snippet will give an error. So it can be used for testing the workflow to understand how we handle failed jobs:

https://gitlab.com/-/snippets/2523096/raw/main/tensor-shape-pytorch-error.py

When you try to run this program, you will receive the following message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: The size of tensor a (5) must match the size of tensor b (3) at non-singleton dimension 0

This error occurs because we're trying to perform an operation (in this case, addition) on two tensors that don't have the same shape. In PyTorch, element-wise operations require the tensors to have the same shape or be broadcastable to a common shape. In this case, the shapes (5, 2) and (3, 2) are not compatible because the sizes along the first dimension do not match, and they can't be broadcasted to a common shape.

There are of course numerous other types of errors for other ML/Computational code that you might encounter while working with PyTorch (or any ML/Computational library for that matter), including but not limited to:

TypeError: This could happen when you pass arguments of the wrong type to a function. For example, passing a list where a tensor is expected.
ValueError: This could occur when you pass arguments of the correct type but incorrect value to a function. For example, passing negative integers to a function that expects positive integers.
IndexError: You may encounter this when you try to index a tensor using an invalid index.
MemoryError: This occurs when the system runs out of memory, often when trying to allocate very large tensors.
RuntimeError: This is a catch-all for various kinds of errors that can occur during execution. The tensor shape mismatch error we discussed earlier is a type of RuntimeError.

Here's an example of a code snippet that will give a TypeError:

import torch

# Create a list
list1 = [1, 2, 3, 4, 5]

# Try to perform a tensor operation on the list
result = torch.tanh(list1)

When you run this program, you'll receive a TypeError with the following message:

tanh(): argument 'input' (position 1) must be Tensor, not list

The error handling approach varies depending on the type of error. For this TypeError, you can handle it by converting the list to a tensor before performing the operation:

import torch

# Create a list
list1 = [1, 2, 3, 4, 5]

# Convert the list to a tensor
tensor1 = torch.tensor(list1)

# Perform the tensor operation
result = torch.tanh(tensor1)

From an ML developer's or researcher's perspective, it's always good practice to anticipate potential errors and handle them gracefully in the ML/Computational code.

Deploying any GPU-based Python Project on NuNet

This tutorial will guide you through the process of deploying any GPU-based Python project on NuNet, which is our base platform that allows running machine learning or computational jobs. We will use a Python file URL and pip dependencies through the dashboard interface.

Please note that this tutorial assumes that your Python project is structured as a command-line-interface (CLI) based project with a requirements.txt file for specifying dependencies.

Prerequisites

Access to the service provider dashboard interface.
A GPU-based Python project hosted on a platform like GitLab or GitHub, with the main script and requirements.txt file accessible via URLs.

Steps

Prepare Your Python Script: Modify your main Python script to programmatically install dependencies from the requirements.txt file. The following sample code demonstrates how you might do this:

import subprocess
import os
import urllib.request

# Define the URL of your requirements.txt file
requirements_url = 'https://raw.githubusercontent.com/yourusername/yourrepo/master/requirements.txt'

# Download the requirements.txt file
urllib.request.urlretrieve(requirements_url, 'requirements.txt')

# Use pip to install the requirements
subprocess.check_call(["python", '-m', 'pip', 'install', '-r', 'requirements.txt'])

# Rest of your code follows

# Define the home directory assuming your code saves all checkpoints/models/datasets in this location
home_dir = os.path.expanduser('~')

# Define the tar file name
tar_file = 'archive.tar.gz'

# Create a tar file of the home directory
subprocess.check_call(['tar', '-czf', tar_file, '-C', home_dir, '.'])

# Define the URL of your repository
repo_url = 'https://oauth2:{token}@gitlab.com/yourusername/yourrepo.git'.format(token=os.getenv('GITLAB_TOKEN'))

# Clone the repository
subprocess.check_call(['git', 'clone', repo_url])

# Move the tar file into the repository
subprocess.check_call(['mv', tar_file, 'yourrepo'])

# Change the current directory to the repository
os.chdir('yourrepo')

# Add the tar file to the repository
subprocess.check_call(['git', 'add', tar_file])

# Commit the changes
subprocess.check_call(['git', 'commit', '-m', 'Add tar file'])

# Push the changes
subprocess.check_call(['git', 'push'])

This script will:

Download and install the dependencies from the requirements.txt file.
Execute your main script's logic (which you would add after the "# Rest of your code follows" comment).
Tar the entire contents of the /home/$LOGNAME directory. (assuming your code saves all checkpoints/models/dataset at this location)
Push this tar file to the specified GitLab or GitHub repository.

Please replace the placeholders 'https://raw.githubusercontent.com/yourusername/yourrepo/master/requirements.txt' and 'https://oauth2:{token}@gitlab.com/yourusername/yourrepo.git' with the actual URLs of your requirements.txt file and GitLab or GitHub repository, respectively.

Remember that {token} should be replaced with your Personal Access Token. This script expects the Personal Access Token to be stored in an environment variable named GITLAB_TOKEN. You can create this token temporarily and set it to expire around the same time your job is presumed to be finished.

Navigate to the NuNet Dashboard: Launch the service provider dashboard via localhost:9991 on your preferred browser and navigate to the dashboard interface.
Enter the Python File URL: In the appropriate field, enter the URL of your modified Python script.
Specify the Dependencies: In the dependencies field, you might need to specify any dependencies that your script needs beyond those specified in the requirements.txt file. If all dependencies are included in the requirements.txt file, this field can be left blank.
Fill up the remaining relevant fields: Apart from the above two fields, also fill up the other fields as described in the service provider dashboard usage guidelines.
Execute the Job: Click the appropriate button to execute the job. The NuNet platform will download your script, install the necessary dependencies, and run the script.

Considerations

Complex Dependencies: If your project requires dependencies that cannot be installed with pip, you may need to find a workaround, such as including the installation process within your Python script.
Data Dependencies: If your project requires access to specific data files, you may need to modify your Python script to download or access these files.
Security: Only use this approach with trusted scripts and dependencies, as the platform will execute your script and install the specified packages without further confirmation.
Long-Running Processes: If your project initiates long-running processes, you'll need to manage and monitor these within the constraints of the NuNet platform.

Following these steps should allow you to deploy and run a wide variety of GPU-based Python projects on the NuNet platform. While this method may not work for every project without some adjustments, it provides a flexible starting point for deploying projects on NuNet.

PreviousRequest mNTX and Faucet NextResearch Papers

Last updated 2 years ago