1 of 30

NuNet Public Alpha on Testnet

NuNet in Brief

Learn more about NuNet by reading our or . Please read NuNets before installing any software on your devices.

NuNet is building a globally decentralized computing framework that combines latent computing power of independently owned compute devices across the globe into a dynamic marketplace of compute resources, individually rewarded via a multi chain Tokenomic ecosystem based on the NuNet Utility Token (NTX).

Being a multi-sided platform, NuNet will support a variety of decentralized services that are defined by use cases. These use cases will be decided on by both community and partnerships, they currently range from distributed machine learning jobs to decentralized Cardano nodes. The requirements of the use cases will drive the development of the core platform, this will ensure features that support actual use cases are prioritized and add utility to the network.

NuNet is part of the SingularityNET eco system and was originally intended to support the decentralized operation of their global AI market place. NuNet has also sought to integrate with other decentralized platforms, the diagram below shows some of the platforms we could integrate with as we build out our vision. As with the use cases, as partnerships are formed with other platforms the core infrastructure will be developed in order to support interoperability.

NuNet Architecture

Read our blog on for more information on the Decentralized ML use case. Please read NuNets before installing any software on your devices.

This is the description of the current NuNet architecture for testing the ML use case. The business goal of implementing this use case is to allow users to onboard both latent GPU and CPU resources to be used by service providers to run compute jobs (in this case, ML jobs) on the NuNet platform and be compensated in NTX (NuNet’s Utility Token) for the job. For Public Alpha we are using the PreProd Cardano Network.

These are the NuNet core elements for the Public Alpha:

Decentralized Components:

In NuNet, the term service provider refers to the individual or group that wants to run a job on NuNet’s decentralized community hardware.
The term compute provider refers to an individual on the community who has on-boarded their devices onto the NuNet platform.
The term DMS refers to Device Management Service, which is essentially the NuNet platform itself. It is the lightweight peer-to-peer connection between services and providers.
Service Provider Dashboard is the application that service providers use to deploy jobs on NuNet and to lock funds on the smart contract.
Compute Provider Dashboard is the application that compute providers use to claim their tokens for work carried out on their devices.

Centralized Components:

NuNet Oracle and Haskell Server are components responsible for locking funds on the smart contract and validating results of compute jobs prior to the withdraw transaction. For Public Alpha we implemented a smart contract on the PreProd Cardano Network.
Control Server is responsible to any control functionalities necessary on NuNet. On this use case, it servers the captcha functionality.
NuNet Network Status displays real-time statistics about all computational processes executed on the network and their telemetry information.

Fundamental Algorithms

Talks about the basic functionalities of NuNet workflows

Please read NuNets before installing any software on your devices.

Estimating Resource Types

In order to effectively allocate resources for machine learning/computational tasks on the NuNet platform, it is essential to categorize the different resource types available. We have classified resource types into three main categories based on the capabilities of the machines and GPUs:

Low Resource Usage: This category represents low-end machines with low-end GPUs.
Moderate Resource Usage: This category represents medium-end machines with medium-end GPUs.
High Resource Usage: This category represents high-end machines with high-end GPUs.

The outlines a function called estimate_resource, which is designed to estimate the resource parameters for different categories of machines based on their resource usage type. The function accepts a single input, resource_usage, which can take on one of three possible values: "Low", "Moderate", or "High".

The function then checks the value of resource_usage and, depending on the category, sets the minimum and maximum values for the CPU, RAM, and GPU VRAM, as well as the estimated levels of GPU power and GPU usage. These values are assigned based on the specific resource usage category:

For "Low" resource usage, the function sets lower values for CPU, RAM, and GPU VRAM, as well as lower GPU power and usage levels. This category represents low-end machines with low-end GPUs.
For "Moderate" resource usage, the function sets medium values for CPU, RAM, and GPU VRAM, as well as medium GPU power and usage levels. This category represents medium-end machines with medium-end GPUs.
For "High" resource usage, the function sets higher values for CPU, RAM, and GPU VRAM, as well as higher GPU power and usage levels. This category represents high-end machines with high-end GPUs.

Finally, the function returns a dictionary containing all the estimated parameters for the given resource usage category. By using this function, the NuNet platform can estimate resource parameters for machines in different categories, helping to efficiently allocate resources for machine learning tasks based on the specific requirements of each job.

Calculating Resource Prices

For each resource type, resource prices are calculated in the NuNet ML on GPU API. These functions help estimate the cost of using different types of machines for executing machine learning tasks. The process involves the Estimated Static NTX.

Estimated Static NTX is calculated using the user's input for the estimated execution time of the task and the chosen resource type (Low, Moderate, or High). The function Calculate_Static_NTX_GPU(resource_usage) calculates this value based on these inputs and returns the Estimated NTX.

Reporting Task Results

Upload_Compute_Job_Result() is a function that regularly updates the task's progress. It runs in a loop and performs the following steps every 2 minutes until the job is completed or the off-chain transaction of the Estimated Static NTX is done:
- Wait for 2 minutes.
- Save the machine learning log output as a file that can be appended with new information.
- Upload the file to the cloud, making sure that only an authenticated user can access it.
This function allows users to keep track of their tasks and view intermediate results during the execution.
Send_Answer_Message() is a function that provides a unique link to the WebApp, which helps users track their tasks. It performs the following steps:
- Retrieve a unique URL (permalink) for each machine learning job.
- Send the permalink from the Decentralized Management System (DMS) to the WebApp as an answer message.
This function enables users to access their task's progress updates and results using the provided link.

In summary, these functions work together to ensure users can monitor their machine learning tasks' progress and access the results in a user-friendly and organized manner.

Testing Use Cases

Decentralized ML Use Case

This page details our Decentralized ML Use Case on Cardano.

See project scoping discussion and implementation with external stakeholders and full description as a Cardano Catalyst Fund8 proposal on the Catalyst platform. Read our blog on NuNet Public Alpha Testnet for more information on the Decentralized ML use case. Please read NuNets Disclaimer before installing any software on your devices.

Public Alpha is released with the ML use case, which allows users to run simple open source machine learning training on NuNet on-boarded computers and pay for the compute in NTX. We can use widely-used machine learning libraries, such as TensorFlow, PyTorch, and scikit-learn, ensuring that users can effortlessly integrate their preferred tools and frameworks. Moreover, the platform provides the flexibility to run jobs on either CPUs or GPUs, catering to various computational needs and budget constraints.

Designed with a user-centric approach, the Service Provider Dashboard has a simple interface that allows users to easily submit their ML models, define resource usage based on their job requirements, and keep track of their job's progress in real-time. This level of transparency and control empowers users to manage their machine learning jobs effectively and efficiently, ultimately facilitating and accelerating the development & deployment of innovative AI solutions.

For Public Alpha we implemented a smart contract on the PreProd Cardano Network to lock service provider NTX funds and reward compute provider users for the use of their resources.

You need to choose one role to play on this use case: you can be a service provider that will requires to run a ML job on NuNet’s decentralized community hardware or you can be a compute provider who has on-boarded their devices onto the NuNet platform and will be compensated in NTX (NuNet’s Utility Token) for running the ML job requested by some service provider.

During this testing you can contribute in NuNet's development by reporting bugs and suggesting improvements. Please, refer to this documentation about the contribution guidelines: https://gitlab.com/nunet/documentation/-/wikis/Contribution-Guidelines

You can also connect with us on Discord at: https://discord.gg/pg5BnFM89n

Compute Provider on NuNet

Please read NuNets before installing any software on your devices.

Basic Steps to Onboard a Machine as a Compute Provider on NuNet:

Download and install the Device Management Service (DMS): on your machine, which will enable you to connect to the NuNet platform and make your computing resources available for running ML jobs.
Download and install the Compute Provider Dashboard (CPD) : on your local machine, making it accessible for you to manage and monitor the ML jobs assigned to your machine and receive NTX tokens for your contributions.
Onboard your machine: Please refer to our and user guides for complete details on .
1. Specify your Cardano address: During the DMS onboarding process, provide the Cardano address of the wallet connected to the Compute Provider Dashboard when using the NuNet CLI. This address will be used for receiving NTX token rewards for your deployment contributions. For Public Alpha we are using the PreProd Cardano Network.
2. Allocate your Resources: Completing the onboarding process for your machine makes it available for receiving ML job requests while connected to the NuNet platform. Onboarding allows providing the necessary information about your machine to NuNet's Distributed Hash Table, or DHT, such as the available CPU or GPU resources, RAM, and other hardware details. This information will be used to match your machine with ML jobs requiring appropriate resources to deploy them.
Create Cardano testnet wallet: Make sure you choose the testnet network when setting up your wallet (Eternl or Nami wallet). The name of the specific Cardano testnet would be PreProd Cardano Network.
Set up your wallet: Connect your Cardano wallet to the Compute Provider Dashboard at localhost:9992. Ensure that your wallet has enough tADA for transaction fees. This wallet will be used to receive the NTX tokens on the PreProd Cardano Network as a reward for providing compute resources on NuNet. Make sure the wallet address on this wallet is the same as you specified in Step 3.1 when requesting your NTX tokens.
Receive ML jobs: Your machine will be automatically assigned ML jobs based on the resources needed and your machine's availability.
Receive NTX token rewards: Upon the successful completion of an ML job, you will receive NTX token rewards in your connected wallet. The amount of NTX tokens received will be based on the resources contributed and the terms set by the service provider.
Continue providing compute resources: Keep your machine connected to the NuNet platform and continue providing compute resources for ML jobs, earning NTX tokens as a reward for your contributions.

For a more comprehensive overview, you can always refer to our user guides in the . [Add YouTube video on running the Public Alpha as a Compute Provider.]

Service Provider on NuNet

Please read NuNets before installing any software on your devices.

Basic Steps to Run an ML Job on NuNet as a Service Provider:

Download and install the Device Management Service (DMS): on your machine, which will enable you to request ML jobs by connecting you to compute provider machines that are onboarded on the NuNet platform.
Download and install the Service Provider Dashboard (SPD): on your local machine, making it accessible for you to submit and monitor your ML jobs. No sign-up or NuNet account is needed.
Obtain compute resources: Open your preferred browser and visit localhost:9991 to specify the type of compute resources you require (CPU or GPU) and the amount of resources needed for your ML job.
Define your ML job: Specify your ML model URL, and provide any additional dependencies required for the execution of the ML job.
Set up your wallet: Connect your Cardano wallet (Eternl or Nami wallet) to the Service Provider Dashboard. Ensure that your wallet has enough ADA and NTX tokens for running the ML job and covering transaction fees. For Public Alpha we are using the PreProd Cardano Network.
Set a budget: Determine the maximum amount of NTX tokens you are willing to spend on the ML job. This budget will be locked in the smart contract on the PreProd Cardano Network as a guarantee for the compute providers.
Submit your ML job: Review your job configuration and submit the ML job to the NuNet platform. The platform will automatically match your job with suitable compute providers based on the resources needed.
Monitor your job: Track the progress of your ML job through the Service Provider Dashboard. You can check the log outputs of your job every 2 minutes since your job begins. Please wait for around 5 minutes for the first log to appear when deploying new ML jobs.
Review the results: Once the ML job is completed, you can download the output data and review the results. The locked NTX tokens will be released and distributed as a reward to the compute providers who contributed resources to your job.
Repeat the process (optional): If you have more ML jobs to run, you can follow the same steps to execute them on the NuNet platform, utilizing its decentralized computing resources.

[Add YouTube video on running the Public Alpha as a Service Provider.]

Testing Configuration

Please read NuNets before installing any software on your devices.

Before running the tests, consider the following:

If you are a compute provider, please make sure to backup any important data you might have before onboarding your machine as this is a testing phase.
We use centralized components (Oracle, Control Server, Stats Network) running on our servers. Check if any configuration changes are needed to use these components.
Use Eternl or Nami wallets for these tests. Make sure you have enabled single address mode if using Eternl.
Ensure you have mNTX and tADA in your wallet. To get mNTX on the PreProd Cardano network, follow the Guide. For tADA, follow the Guide.
For a basic outline on how to use a Cardano testnet, you can use the . The Cardano address provided during DMS onboarding should match the wallet connected to the dashboard.
Note that compute providers would have add a collateral amount of around 5 tADA to interact with the NTX smart contract when using the Computer Provider Dashboard (CPD).
To check if one machine can see the other, follow the instructions in the respective component documentation.

Nami Wallet Setup

Please read NuNets Disclaimer before installing any software on your devices.

Here's a step-by-step guide to setting up a new Nami Wallet (Chrome extension), setting it to the preprod Ada network, noting the receiving address, and backing up your mnemonics:

Step 1: Install the Nami Wallet browser extension

Open your preferred browser.
Visit the official Nami Wallet website at https://namiwallet.io/.
Click on the "Download" or "Get Started" button (or similar, depending on the site's design).
You will be redirected to the official download page for the Nami Wallet extension. Click on the download button for your specific browser (Chrome, Firefox, etc.).
Confirm the installation by following your browser's prompts.

Step 2: Set up a new Nami Wallet

Click on the Nami Wallet icon in the top-right corner of your Chrome browser.
Read and accept the terms and conditions.
You will be presented with two options: "Create new wallet" and "Restore wallet." Click on "Create new wallet."
Set a strong password for your wallet and click "Next."

Step 3: Backup your mnemonics (recovery phrase)

The wallet will now generate a 24-word recovery phrase. Write it down and store it in a safe place, as you will need it to restore your wallet if needed. Click "Next."
Verify your recovery phrase by selecting the words in the correct order, and then click "Confirm."
For extra security, consider backing up your recovery phrase in multiple secure locations, such as on paper, in a password manager, or in an encrypted file stored on a secure device or cloud storage service.

Step 4: Switch to the preprod Ada network

Click on the round icon on the top-right corner of your Nami Wallet to access the settings.
Under "Network," click on the dropdown menu and select "Preprod."
Click "Save" to confirm your selection.

Step 5: Note the receiving address

Click on the "Receive" tab in your Nami Wallet.
Your receiving address will be displayed in the form of a QR code and a text address.
Click on the "Copy" button next to the text address to copy it to your clipboard.
Save the receiving address in a secure location, as you will need it to receive Ada on the preprod network.

Step 6: Add a Collateral Amount

Make sure you have atleast 5 tADA
Select the round icon on the top right corner
Select Collateral from the drop-down menu
Add the collateral amount (around 5 tADA) in order to interact with the NTX smart contract on Cardano

You have now successfully set up a new Nami Wallet, connected it to the preprod Ada network, backed up your mnemonics, and noted the receiving address. You can now use this address to receive and send transactions on the preprod network.

Eternl Wallet Setup

Please read NuNets before installing any software on your devices.

This step-by-step guide will walk you through the process of setting up a new Eternl wallet as a Chrome extension, backing up your mnemonics, setting it to single address mode, connecting it to the preprod Ada network, and noting the receiving address.

Step 1: Install the Eternl Wallet Chrome Extension

Open your preferred web browser on your computer.
Navigate to the official Eternl Wallet website by visiting .
Follow the prompts on the page to install the Eternl Wallet.
Once the wallet is installed, you should see the Eternl Wallet icon on your browser's extension bar.
Click the icon to open the Eternl wallet.

Step 2: Connect to the preprod Ada network

In the main page, locate the "mainnet" option in purple at the bottom right corner.
Click on it and switch the network to select "Pre-Production Testnet".

Step 3: Set up a new wallet

Click on the Eternl Wallet icon in your browser to open the extension.
Select "Create a new wallet."
Enter a strong and unique password to encrypt your wallet. Make sure to remember this password, as it will be required to access your wallet.
Click "Next" and follow the on-screen instructions to complete the wallet creation process.

Step 4: Backup your mnemonics

After creating your wallet, you'll be presented with a 24-word mnemonic phrase. This phrase is essential for recovering your wallet in case you lose access to your device or need to restore your wallet on another device.
Write down the mnemonic (seed) phrase on a piece of paper and store it in a secure location, such as a safe deposit box. Alternatively, you can store it digitally in a password-protected file or encrypted storage medium.
Confirm that you've safely stored your mnemonic (seed) phrase by selecting "I have written it down" and clicking "Next."
You will be prompted to re-enter your mnemonic (seed) phrase.
Click "Save" to apply the changes.

Step 5: Note the receiving address

Click on the "Receive" tab within the Eternl Wallet extension.
You'll see your wallet's receiving address displayed as both a string of characters and a QR code. This is the address you'll use to receive funds on the preprod Ada network.
Copy the address by clicking on the copy button next to it or take a screenshot of the QR code.

Step 6: Set wallet to single address mode

Click on the gear icon in the top right corner of the Eternl Wallet name to access the settings menu.
Locate the "Single address mode" option and toggle this option to enable it (by default, it's disabled).
Enter your receiving address and click on "Save".

Step 7: Enable dApp connection

Toggle the dApp button as you see on the Eternl screen

Step 8: Add a Collateral Amount

Make sure you have at least 5 tADA
Select the gear icon on the right-side of your wallet
Select Collateral from the Settings menu
Set Collateral amount
Toggle to Enable Collateral

Your Eternl Wallet is now set up with the single address mode, connected to the preprod Ada network, and you have noted the receiving address. Remember to store your mnemonic phrase and password securely, as they will be required to access your wallet and recover it if necessary.

Cardano Testnet Guide: Basic Outline

Please read NuNets Disclaimer before installing any software on your devices.

Download and install a testnet wallet: You can choose from Nami or Eternl. Make sure to select the testnet network option during the installation process.
Get testnet ADA: You can obtain testnet ADA from a testnet faucet, which is a service that provides free testnet tokens. You can learn more about testnet faucets on the Cardano Developer Portal.
Explore the testnet: Once you have your testnet wallet set up and testnet ADA in your account, you can start exploring the testnet network. You can send and receive transactions, create test tokens, and interact with smart contracts.
Join a testnet community: You can join the Cardano testnet community on forums like Reddit or Discord to connect with other developers and users, ask questions, and share your experiences.
Test our service/compute provider dashboards: The Cardano testnet is an ideal environment to test the dashboards before deploying them on the mainnet. You can use the testnet to identify bugs, test functionalities, and optimize our applications' performance.
Keep your testnet wallet safe: Although testnet tokens have no real-world value, you should still keep your testnet wallet secure. Make sure to use a strong password and never share your private keys or seed phrases with anyone.

tADA Faucet: Get tADA for transactions

Please read NuNets Disclaimer before installing any software on your devices.

Here's a basic step-by-step guide on how to use a tADA (testnet ADA) faucet and obtain tADA tokens:

Go to the Cardano Developer Portal: The Cardano Developer Portal is the official resource for Cardano developers and contains information on how to use the Cardano testnet, including a list of tADA faucets.
Get tADA from a faucet: You can obtain tADA from the Cardano Developer Portal that includes a Cardano Faucet. For more instructions on how to use one, see our Cardano Testnet Guide.
Enter your testnet wallet address: Copy and paste your testnet wallet address into the field provided on the tADA faucet website. Make sure you are using a testnet wallet address and not a mainnet address, as the two are not interchangeable.
Solve the captcha: Some tADA faucets require you to solve a captcha or complete a task to prove you are a real person and not a bot.
Request tADA: Click on the "Request tADA" button on the faucet website. Your tADA tokens should be sent to your testnet wallet address within a few minutes.
Check your testnet wallet balance: Open your testnet wallet and check your balance to make sure the tADA tokens have been successfully deposited.

Remember, tADA tokens are not real ADA tokens and have no real-world value. They are only intended for testing purposes on the Cardano testnet. Additionally, some tADA faucets have limits on how many tokens you can request per day or per IP address, so be sure to check the faucet's rules and guidelines before requesting tokens.

mNTX Faucet: Get mNTX for testing

In order to receive mNTX you will need to be added to our whitelist head to Stage 1 or Discord to learn more. Please read NuNets Disclaimer before installing any software on your devices.

To get mNTX (mock NTX) on the PreProd Cardano network, follow the below instructions on how to use the mNTX faucet:

To use the NTX faucet on the Cardano Blockchain, follow these basic instructions:
1. Create a Cardano wallet: If you don't have a Cardano wallet already on the PreProd Cardano network, create one using Eternl or Nami wallet applications. Make sure to securely store your seed phrase, as it is crucial for recovering your wallet.
2. Find your wallet address: After creating your wallet double check you are on the preprod network, locate your Cardano wallet address. It typically starts with "addr_test1" and is followed by a long string of alphanumeric characters. in most wallets you find it by clicking on receive. Copy this address to use it in the next step.
3. Join Discord and Whitelist: Join our Discord server and head to the relevant testing stage. Fill out the Google Form linked with your wallet address, Discord Name and GitLab name. We will use the wallet address provided to reward contributions both on Discord and GitLab. *Please note this may take some hours to receive mNTX to discourage spamming.
4. Access the NTX faucet: Visit the NTX faucet's website provided by the NuNet team. https://faucet.testnet.nunet.io/ . Note only whitelisted addresses will be able to receive tokens.
5. Connect your Cardano wallet: click the connect wallet button and select the wallet you would like to use from the dropdown then click the mint mNTX button. (NOTE: please doublecheck this is the wallet you entered in the google form and that it is on the cardano preprod testnet)
6. Sign the transaction: When you click the mint mNTX button it should open your wallet and ask you to sign a transaction. (note you should have some test ADA in your wallet before you do this if you dont have it already go here)
7. Faucet response: if it all went well then you should see a transaction hash, you can copy this and check on https://preprod.cardanoscan.io/ you may also receive a message saying you are not whitelisted, if this is the case please ensure you submitted the google form and have received notification that you were successfully whitelisted in discord. If you are still getting some sort of error message please go to discord for help.
8. Check your wallet balance: After submitting your request, wait for a few minutes and then check your wallet application to confirm the receipt of mNTX tokens. It might take some time for the transaction to be processed, depending on the network's congestion.
9. Use your mNTX tokens: You can now use the mNTX tokens in your wallet for testing or participating in the NuNet platform's services.
Keep in mind that the mNTX tokens received from the faucet are intended for testing and development purposes. They may not hold any real-world value outside of the test environment or PreProd Cardano network.

Components Installation

Please read NuNets Disclaimer before installing any software on your devices.

Device Management Service (DMS): A service for both users and compute providers to manage their devices on the NuNet platform. Follow the Device Management Service documentation for installation and usage instructions. Our NuNet CLI manual would also be an essential reference.

Service Provider Dashboard (For End-Users Who Request Jobs with NTX): A web application for users to manage and monitor their ML jobs on the NuNet platform. Follow the Service Provider Dashboard documentation for installation and usage instructions.

Compute Provider Dashboard (For Compute Providers Who Receive Jobs for NTX): A web application for compute providers to manage their resources and monitor the jobs running on their devices. Follow the Compute Provider Dashboard documentation for installation and usage instructions.

Note that both service providers (end-users) and compute providers must install the DMS on their local machines before installing either the Service Provider Dashboard (SPD) or the Compute Provider Dashboard (CPD). The DMS is an installation dependency for both applications.

DMS (Device Management Service)

Please read NuNets Disclaimer before installing any software on your devices.

What is the Device Management Service (DMS)?

A device management service or DMS is a program that helps users run various computational services, including machine learning (ML) jobs, on a compute provider machine, based on an NTX token request system on NuNet. In simple terms, it connects users who want to perform computational tasks to powerful CPU/GPU enabled computers that can handle these tasks. The purpose of the DMS is to connect users on NuNet, allow them to run any service (not only ML jobs) and be rewarded for it.

The NTX token is a digital cryptographic asset available on the Cardano and Ethereum blockchain as a smart contract. However, for the current use case of running machine learning jobs, only the Cardano blockchain is being used. Users request and allocate resources for computational jobs through a Service Provider Dashboard. Compute providers receive the NTX tokens based on the jobs through a Compute Provider Dashboard.

Please note that the dashboards are not components of NuNet's core architecture. Both these components have been developed to perform the current use case that is to run ML jobs on compute providers machines.

Here's a step-by-step explanation:

Users have computational services they want to run. These services often require a lot of computing power, which may not be available on their personal devices.
Compute provider machines are powerful computers designed to handle resource-intensive tasks like machine learning jobs.
The device management service acts as a bridge, connecting users with these compute provider machines.
Users specify resources and job requirements through a webapp interface, and request access to the compute provider machines by sending mNTX tokens. mNTX acts as a digital ticket, granting users access to the resources they need.
The device management service receives the job request after verifying the authenticity of the mNTX transaction through an Oracle.
Once received, the DMS allocates the necessary resources on the compute provider machine to run the user's job.
The user's job is executed on the provider's machine, and the results are sent back to the user.

In summary, a device management service simplifies the process of running machine learning jobs on powerful computers. Users can easily request access to these resources with NTX tokens, allowing them to complete their tasks efficiently and effectively.

How to Install the Device Management Service?

Before going through the installation process, let's take a quick look at the system requirements and other things to keep in mind.

Installing via Virtual Machines or Windows Subsystem (WSL) for Linux

When using a VM or WSL, using Ubuntu 20.04 is highly recommended.

Things to keep in mind for VMs

Skip doing an unattended installation for the new Ubuntu VM as it might not add the user with administrative privileges.
Enable Guest Additions when installing the VM (VirtualBox only).
Always change the default NAT network setting to Bridged before booting the VM.
Install Extension Pack if on VirtualBox (recommended)
Install VMware Tools if on VMware (recommended)
ML on GPU jobs on VMs are not supported

Things to keep in mind for WSLs

Install WSL through the Windows Store.
Install the Update KB5020030 (Windows 10 only)
Install Ubuntu 20.04 through WSL
Enable systemd on Ubuntu WSL
ML Jobs begun on Linux cannot be resumed on WSL
Make sure virtualization is enabled in the system BIOS

Though it is possible to run ML jobs on Windows machines with WSL, using Ubuntu 20.04 natively is highly recommended. The reason being our development is completely based around the Linux operating system. Also, the system requirements when using WSL would increase by at least around 25%.

If you are using a dual boot machine, make sure you use the wsl --shutdown command before shutting down Windows and running Linux for ML jobs. Also, ensure your Windows machine is not in a hibernated state when you reboot into Linux.

CPU-only machines

Minimum System requirements

We only require for you to specify CPU (MHz x no. of cores) and RAM but your system must meet at least the following set of requirements before you decide to onboard it:

CPU - 2 GHz
RAM - 4 GB
Free Disk Space - 10 GB
Internet Download/Upload Speed - 4 Mbps / 0.5 MBps

If the above CPU has 4 cores, your available CPU would be around 8000 MHz. So if you want to onboard half your CPU and RAM on NuNet, you can specify 4000 MHz CPU and 2000 MB RAM.

Recommended System requirements

CPU - 3.5 GHz
RAM - 8-16 GB
Free Disk Space - 20 GB
Internet Download/Upload Speed - 10 Mbps / 1.25 MBps

GPU Machines

Minimum System Requirements

CPU - 3 GHz
RAM - 8 GB
NVIDIA GPU - 4 GB VRAM
Free Disk Space - 50 GB
Internet Download/Upload Speed - 50 Mbps

Recommended System requirements

CPU - 4 GHz
RAM - 16-32 GB
NVIDIA GPU - 8-12 GB VRAM
Free Disk Space - 100 GB
Internet Download/Upload Speed - 100 Mbps

Here's a step by step process to install the device management service (DMS) on a compute provider machine:

Download the DMS package:
Download the latest version with this command: (note please use the second option for now until we fix the auto linked latest version)

# do not use currently wget https://d.nunet.io/nunet-dms-latest.deb -O nunet-dms-latest.deb

wget https://d.nunet.io/nunet-dms_0.4.159_amd64.deb -O nunet-dms-latest.deb

Install DMS:
DMS has some dependencies, but they'll be installed automatically during the installation process.

Open a terminal and navigate to the directory where you downloaded the DMS package (skip this step if you used the wget command above). Install the DMS with this command:

sudo apt update && sudo apt install ./nunet-dms-latest.deb -y

If the installation fails, try these commands instead:

sudo dpkg -i nunet-dms-latest.deb
sudo apt -f install -y

If you see a "Permission denied" error, don't worry, it's just a notice. Proceed to the next step.

Check if DMS is running: Look for "/usr/bin/nunet-dms" in the output of this command:

ps aux | grep nunet-dms

If it's not running, submit a bug report with the error messages. Here are the contribution guidelines.

Uninstall DMS (if needed):

To remove DMS, use this command:

sudo apt remove nunet-dms

To download and install a new DMS package, repeat steps 1 and 2.

Completely remove DMS (if needed):

To fully uninstall and stop DMS, use either of these commands:

sudo apt purge nunet-dms

sudo dpkg --purge nunet-dms

Update DMS:

To update the DMS to the latest version, follow these steps in the given sequence:

a. Uninstall the current DMS (Step 3)

b. Download the latest DMS package (Step 1)

c. Install the new DMS package (Step 2)

NuNet CLI: For Device Onboarding

Please read NuNets Disclaimer before installing any software on your devices.

Introduction

This manual provides instructions on how to use the NuNet Command Line Interface (CLI) to onboard a device, manage resources, wallets, and interact with peers.

Getting Started

Open the terminal and run the following command to access the CLI:

nunet

Usage

The CLI provides several commands and options for managing your device on the NuNet platform. The general syntax is:

nunet [OPTIONS] COMMAND

Commands

Here's the complete list of the command line options that can be used with the CLI:

capacity: Display capacity of device resources
wallet: Get Current Wallet Address
onboard: Onboard the device to NuNet
info: Get information about the currently onboarded machine
onboard-gpu: Install NVIDIA GPU driver and Container Runtime
onboard-ml: Prepare the system for Machine Learning with GPU
resource-config: Change the configuration of onboarded device
shell: Send commands and receive answers to a vm instance via DMS
peer: Interact with currently visible peers
chat: Start, Join, or List Chat Requests
log: Returns the path of an archive containing all log files

Let's look into each of them and how they work.

Onboarding a Device

Check the available resources on your device by running the following command:

nunet capacity --pretty --available

If you don't have an existing wallet address, create a new one using either Ethereum or Cardano blockchain (We currently recommend using Cardano at the moment as this is the primary blockchain for testing and will be the focus for our Public Alpha) but have included both as NuNet is a multichain protocol and will support many chains in the future:

For Cardano (we use this the our current testing):

nunet wallet new --cardano

If you are testing do not use the new address when on-boarding the device, but instead You should use the wallet you have set up on the cardano preprod network. Also if you do use the wallet new command in production make sure you use the private key to add to a wallet and that you have access, as this is the wallet you will be claiming to.

When we support other blockchains in the future, you would simply need to change the blockchain name when creating a wallet through the above command.

Make sure you backup the mnemonic and wallet address for safe keeping. Do not share it with anyone. This is the same wallet address that you would be providing on the Compute Provider Dashboard.

Onboard your NVIDIA or AMD GPU

Note: You can skip this step if you don't have a GPU on your compute provider machine. If you are using a mining operating system such as HiveOS, only the NVIDIA container runtime would be installed, as it comes bundled with both NVIDIA and AMD drivers preinstalled.

Install the NVIDIA/AMD GPU driver and container runtime. To run this command, use the following command:

nunet onboard-gpu

For NVIDIA GPUs, the above command will work on both native Linux (Debian) and Windows Subsystem for Linux (WSL).

For AMD GPUs, the command will work only on native Linux (Debian), as there is currently no support on WSL.

NuNet's GPU onboarding also checks for Secure Boot if applicable, with the necessary messages to help the user. You can either choose to sign it with a machine owner key (MOK) if enabled, or keep it disabled in the BIOS.

After onboarding the GPU, it is recommended to reboot your machine with the following command:

reboot

Onboard your device to the NuNet platform using the following command:

nunet onboard -m <memory in MB> -c <cpu in MHz> -n nunet-test -a <address> [-C] [-l]

Replace <memory in MB>, <cpu in MHz>, and <address> with the appropriate values based on your device's resources (noted in onboarding step 1) and your wallet address.

if you a running on a local network behind a router use the -l command this will help you with peer discovery. Do not use this if you are onboarding a machine with a public IP address as it will scan the subnet you are on an could cause problems with your service provider.

For example:

For a machine with a local IP address use this syntax

nunet onboard -m 4000 -c 15000 -n nunet-test -a addr1q8pakf7kuac2fupvvwym4nq9rvu80vd5cvdtp2h0gpg8ppeetw8gxhrfckc4q3gjdg2eprnezpyx6sjauqj4mevleavql8n8kd -l

For a machine with a public IP address use this syntax

nunet onboard -m 4000 -c 15000 -n nunet-test -a addr1q8pakf7kuac2fupvvwym4nq9rvu80vd5cvdtp2h0gpg8ppeetw8gxhrfckc4q3gjdg2eprnezpyx6sjauqj4mevleavql8n8kd

The -C option is optional and allows deployment of a Cardano node. Your device must have at least 10,000 MB of memory and 6,000 MHz of compute capacity to be eligible.
The -l option is optional but important. Use -l when running the DMS on a local machine (e.g., a laptop or desktop computer) to enable advertisement and discovery on a local network address. Do not use -l when running the DMS on a machine from a datacenter.

Prepare the system for Machine Learning (For GPU machines only)

Prepare the system for machine learning with GPU. We include this step to reduce the time for starting jobs because of large-sized GPU based ML images of TensorFlow and PyTorch. To do this, use the following command:

sudo nunet onboard-ml

The above command preloads (downloads) the latest ML on GPU images for training/inferencing/computing on NuNet.

Wait a few minutes for components to start and peers to be discovered.
Check your peer information and the peers your DMS is connected to by running the following commands:

You can lookup connected peers. To list visible peers, use the following command:

nunet peer list

To know you own peer info, use:

nunet peer self

If you see other peers in the list, congratulations! Your device is successfully onboarded to NuNet. If you only see your node, don't worry. It may take time to find other peers, especially if your device is behind symmetric NAT.

At any time after onboarding, you can also check how much resources had been allocated to NuNet previously with the following command:

nunet capacity --pretty --onboarded

To check your machine's full capacity, you can always use:

nunet capacity --pretty --full

Temporarily Pausing and Unpausing Onboarding

Sometimes, you may need to temporarily pause onboarding. You may want to do this if you need to use all of your device's resources, troubleshoot or perform maintenance tasks on your machine. The steps below provide a simple explanation of how to pause and unpause the DMS onboarding process using two commands.

Pause the Device Onboarding Process

To pause the DMS onboarding process, you can use the following command:

sudo systemctl stop nunet-dms

This command will temporarily stop the onboarding process.

Unpause the Device Onboarding Process

After you've paused the onboarding process and completed any necessary tasks, you can resume the process with the following command:

sudo systemctl start nunet-dms

This command will unpause and resume onboarding, allowing your device to once again find, and be seen by other peers on NuNet.

Important Note

If you do not manually unpause the onboarding process, it will automatically resume after a reboot or when the device powers up after being shut down.

Remember to always use these commands responsibly and only when needed, as interrupting the onboarding process may lead to unexpected behavior or issues with NuNet's decentralized peer-to-peer communication system on the device.

Enter a NuNet Peer's Shell

You can also send commands and receive answers to a VM instance via DMS. To do that, use the following format:

nunet shell --node-id <node-id>

The node-id can be obtained from the nunet peer list command. For example:

nunet shell --node-id Qmd8GeqGmdkQc5arhEs4i9tPRFNJoFLLURsBZsY9Riu4Kw

Chat with Peers

To start a chat with a peer, use the following format:

nunet chat start <node-id>

For example:

nunet chat start Qmd8GeqGmdkQc5arhEs4i9tPRFNJoFLLURsBZsY9Riu4Kw

To list open chat requests:

nunet chat list

To clear open chat requests:

nunet chat clear

To join a chat stream using the request ID:

nunet chat join <request-id>

The request-id mentioned above can be obtained from the nunet chat list command stated earlier.

Collect Logs

You can return the path of an archive containing all NuNet log files. To run this command, use:

nunet log

This should return the path to the archive containing the log files, such as /tmp/nunet-log/nunet-log.tar.

Display NuNet System Configuration with ML Readiness

Get information about the currently onboarded machine. To run this command, use the following command:

nunet info

Onboard NVIDIA/AMD GPU

Install the NVIDIA/AMD GPU driver and NVIDIA container runtime. To run this command, use the following command:

nunet onboard-gpu

This command will work on both native Linux (Debian) (both NVIDIA and AMD GPUs) and WSL (NVIDIA GPUs only) machines. It also checks for Secure Boot if necessary.

Prepare the System for Machine Learning

sudo nunet onboard-ml

The above command preloads (downloads) the latest ML on GPU images for training/inferencing/computing on NuNet.

Check the GPU status in real time

To use this option, add the -gpu or the --gpu-status flag to available command like this:

nunet capacity --gpu-status

This allows you to check NVIDIA/AMD GPU utilization, memory usage, free memory, temperature and power draw when the machine is idle or busy.

Check the availability of CUDA and Tensor Cores

To use this option, add it to the capacity command like this:

nunet capacity --cuda-tensor

As a shorter alternative, you can also use -ct. To perform this check, the command leverages the NuNet PyTorch NVIDIA container used for onboard-ml.

Check the availability of ROCm and HIP

To use this option, add it to the capacity command like this:

nunet capacity --rocm-hip

As a shorter alternative, you can also use -rh. To perform this check, the command leverages the NuNet PyTorch AMD container used for onboard-ml.

More Information

Additionally, you can find NuNet Network Status Dashboard at https://stats-grafana.test.nunet.io/ for real-time statistics about computational processes executed on the network, telemetry information, and more.

Compute Provider Dashboard

Please read NuNets Disclaimer before installing any software on your devices.

What is the Compute Provider Dashboard?

The Compute Provider Dashboard (CPD) is a locally available interface that enables compute providers to monitor and manage their machines effectively. The current version of the dashboard facilitates receiving NTX tokens for computational jobs. In future, it would provide basic telemetry data, such as resource utilization, job status, and system performance, allowing providers to optimize their services.

The dashboard interface aims to offer valuable insights into token management, machine performance and resource allocation. It would empower providers to make informed decisions about their services and ensure the best possible experience for their users.

How to Install the Compute Provider Dashboard?

Prerequisites:

Before you install the Compute Provider Dashboard, make sure:

you have already installed the device management service
you are using a desktop machine (servers are not currently supported)

Installation

To install the Compute Provider Dashboard on your system, follow these steps:

Step 1: Download the nunet-cpd-latest.deb package using the wget command. Open a terminal window and enter the following command:

wget https://d.nunet.io/nunet-cpd-latest.deb -O nunet-cpd-latest.deb

This command will download the nunet-cpd-latest.deb package from the provided URL.

Step 2: Install the nunet-cpd-latest.deb package. After downloading the package, enter the following command in the terminal:

sudo apt update && sudo apt install ./nunet-cpd-latest.deb -y

This command will update your system's package index and then install the nunet-cpd-latest.deb package. The -y flag automatically accepts the installation prompts.

Step 3: Access the management dashboard. Open your preferred internet browser and visit:

localhost:9992

This URL will direct you to the Compute Provider Dashboard hosted on your local machine, where you can start using the service to manage your machine learning jobs and connect your NTX wallet.

Keep in mind that these installation instructions assume you are using a Debian-based Linux distribution, such as Ubuntu. The installation process may differ for other operating systems.

How to Use the Compute Provider Dashboard User Interface?

To use the device management service with the Cardano blockchain, follow these steps to connect your NTX wallet and integrate it with the management dashboard:

Step 1: Add your NTX wallet address, which is based on the Cardano blockchain. Your wallet address is a unique identifier that enables you to receive and send NTX tokens within the Cardano network.

Step 2: Select the Cardano blockchain by clicking on the corresponding radio button. This ensures that the device management service communicates with the appropriate blockchain network to facilitate the exchange of NTX tokens for machine learning job requests.

Step 3: Connect your wallet to the Compute Provider Dashboard by clicking the "Connect Wallet" button at the top right corner of your browser. This action will initiate a secure connection between your wallet and the dashboard, allowing for seamless token transactions.

Finally, click on "Submit" to confirm your wallet connection and complete the integration process. Once connected, you can use your NTX tokens to request resources and manage your machine learning jobs through the Compute Provider Dashboard.

Understanding Issues on the Compute Provider Dashboard

If you experience issues while using the dashboard, you can open the inspect console in your browser to get more information about the error. Here's how to do it:

Open the dashboard in your web browser.
Right-click anywhere on the page and select "Inspect" from the context menu. Alternatively, you can use the keyboard shortcut Ctrl + Shift + I on Windows/Linux or Cmd + Option + I on Mac to open the inspect console.
The inspect console will open in a separate window or panel. Look for the "Console" tab, which should be near the top of the panel.
If there are any errors, they will be displayed in the console with a red message.

When you hit the /run/request-reward endpoint, you may encounter the following errors:

404 Not Found: This error occurs when there is no active job to claim.
102 Processing: This error occurs when there is an active job but it has not finished yet. The user should wait until they request again.
500 Internal Server Error: This error occurs when the connection to the Oracle fails.
200 OK: This response indicates success. It includes the signature, oracle_message, and reward_type.

When you hit the /run/send-status endpoint, you may encounter the following errors:

400 Bad Request: This error occurs when the payload is not structured properly and cannot be read.
200 OK: This response indicates success. It includes the message "transaction status %s acknowledged", where %s is one of the transaction status sent from the webapp.

Service Provider Dashboard

Please read NuNets Disclaimer before installing any software on your devices.

What is the Service Provider Dashboard?

The Service Provider Dashboard (SPD) is an intuitive and accessible platform crafted specifically for a diverse range of users, including end-users, machine learning developers, and researchers. Its primary purpose is to streamline the process of requesting CPU/GPU machine learning jobs on compute provider machines, enabling users to focus on their projects without worrying about infrastructure complexities.

By harnessing the power of NTX tokens, users can seamlessly access cutting-edge, decentralized cloud-based computational resources for executing their machine learning projects. This innovative approach not only simplifies the process but also democratizes access to advanced computing capabilities.

The SPD is compatible with widely-used machine learning libraries, such as TensorFlow, PyTorch, and scikit-learn, ensuring that users can effortlessly integrate their preferred tools and frameworks. Moreover, the platform provides the flexibility to run jobs on either CPUs or GPUs, catering to various computational needs and budget constraints.

Designed with a user-centric approach, the SPD's simple interface allows users to easily submit their ML models, define resource usage based on their job requirements, and keep track of their job's progress in real-time. This level of transparency and control empowers users to manage their machine learning jobs effectively and efficiently, ultimately facilitating and accelerating the development & deployment of innovative AI solutions.

How to Install the Service Provider Dashboard?

Prerequisites:

Before you install the NuNet ML SPD, make sure:

you have already installed the device management service
you are using a desktop machine (servers are currently not supported)

Installation

To install the NuNet ML SPD on your system, follow these steps:

Step 1: Download the nunet-spd-latest.deb package using the wget command. Open a terminal window and enter the following command: (please the second option for now to get the latest version intil we fix the latest version links.)

# do not use theis for now // wget https://d.nunet.io/nunet-spd-latest.deb -O nunet-spd-latest.deb

wget https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-webapp/-/package_files/119213966/download -O nunet-spd-latest.deb

This command will download the nunet-spd-latest.deb package from the provided URL.

Step 2: Install the nunet-spd-latest.deb package. After downloading the package, enter the following command in the terminal:

sudo apt update && sudo apt install ./nunet-spd-latest.deb -y

This command will update your system's package index and then install the nunet-spd-latest.deb package. The -y flag automatically accepts the installation prompts.

Step 3: Access the ML SPD. Open your preferred internet browser and visit:

localhost:9991

This URL will direct you to the SPD hosted on your local machine, where you can start using the service to manage your machine learning jobs and connect your NTX wallet.

Keep in mind that these installation instructions assume you are using a Debian-based Linux distribution, such as Ubuntu. The installation process may differ for other operating systems.

How to Use the Service Provider Dashboard?

To use the Device Management Service (DMS) with Cardano, make sure your service provider machine connects with at least two DHT peers on NuNet. You can check this by using the nunet peer list command after you've set up your machine for a while.

Once confirmed, you can ask to run an ML job. This involves connecting your NTX wallet and integrating it with the Service Provider Dashboard (SPD) through the following steps:

Step 1: Connect your wallet to the SPD by clicking the "Connect Wallet" button at the top right corner of your browser. This action will initiate a secure connection between your wallet and the dashboard, allowing for seamless token transactions.

Step 2: Access the SPD on the service provider machine and navigate to the first page of the UI at localhost:9991. on your preferred browser.

Step 3: Provide the ML Model URL by pasting the link to your Python program that contains the machine learning code.

Some examples that you can use for testing it out:

Step 4: Choose the type of compute resource you want to use for the job:

Select the CPU radio button if you want to run the job on a CPU-only machine. Currently supported libraries for CPU jobs are TensorFlow, PyTorch, and scikit-learn.
Select the GPU radio button if you want to run the job on a GPU machine. Henceforth, choose either the TensorFlow or PyTorch radio button to specify the library used in the program. For GPU jobs, only TensorFlow and PyTorch are currently supported.

Step 5: Specify the resource usage type: Low, Moderate, or High - depending upon the complexity of your job.

Step 6: If your job has any dependencies, enter their names by choosing the "+" symbol. If you are using our GPU based example code mentioned in step 3, note that you would need to specify a dependency named matplotlib. The CPU example does not require any dependencies.

Step 7: Enter the expected time for your job's completion and click 'Next' to proceed to the second page of the SPD's UI.

Step 8: When you connect your wallet, the NTX wallet address field would be auto-filled, based on the Cardano blockchain. Your wallet address is a unique identifier that enables you to send and receive NTX tokens within the Cardano network.

Step 9: Select the Cardano blockchain by clicking on the corresponding radio button. This ensures that the device management service communicates with the appropriate blockchain network to facilitate the exchange of NTX tokens for machine learning job requests.

Finally, click on "Submit" to confirm your wallet connection and complete the integration process. Once connected, your NTX tokens would be used to deploy the requested job by allocating it to a compute provider machine that matches the resource requirements. You can also monitor the progress of the machine learning job through the same dashboard.

Note: Currently, the WebSocket on the SPD does not have session management. This affects users because when they reload the page after making a deployment, they will not receive any response. This would be improvised in the near future. But it's recommended not to reload the page after deployment and wait for some time. For further troubleshooting if you do not receive any response after waiting for some time following a deployment, you can try the following:

Check the network connection: The first step is to ensure that the network connection is stable and working correctly. The user can try opening a different website to confirm that the issue is not with their internet connection.
Clear the browser cache: Sometimes, the browser cache can cause issues when loading web pages. Clearing the cache can help resolve this problem. The user can try clearing their browser cache and then reloading the page.
File a bug report: If the issue persists, the user should file a bug report about the problem. They can provide details about the issue and any error messages received, which will help in diagnosing and resolving the problem.

In general, it's always a good idea to document the steps taken and any error messages received when encountering issues with a web application. This information can be helpful when seeking support or troubleshooting the problem later on.

Understanding Issues on the Service Provider Dashboard

If you experience issues while using the dashboard, you can open the inspect console in your browser to get more information about the error. Here's how to do it:

Open the dashboard in your web browser.
Right-click anywhere on the page and select "Inspect" from the context menu. Alternatively, you can use the keyboard shortcut Ctrl + Shift + I on Windows/Linux or Cmd + Option + I on Mac to open the inspect console.
The inspect console will open in a separate window or panel. Look for the "Console" tab, which should be near the top of the panel.
If there are any errors, they will be displayed in the console with a red message.

When you hit the /run/request-service endpoint, you may encounter the following errors on the browser console:

400 Bad Request: This error occurs when the JSON payload received from the webapp cannot be parsed. This is unlikely to happen due to user error.
500 Internal Server Error: This error occurs when the DMS (Device Management Service) cannot find the libp2p public key. This is probably because the compute provider has not been onboarded properly.
400 Bad Request: This error occurs when the estimated price received from the webapp is more than expected. This is very unlikely to happen because the webapp guards against it.
404 Not Found: This error occurs when the DHT (Distributed Hash Table) does not have any peers with matching specs. After analyzing and filtering machines based on the constraints section on the payload, the DMS found no matched machine.
503 Service Unavailable: This error occurs when the DMS cannot connect to the Oracle for whatever reason.
500 Internal Server Error: This error occurs when a new service cannot run because the DMS is already running a service. Only one service is supported at the moment.
500 Internal Server Error: This error occurs when the DMS was not able to access the database.
200 OK: This response indicates success. It includes the compute_provider_addr, estimated_price, signature, and oracle_message.

Telemetry Information

Please read NuNets Disclaimer before installing any software on your devices.

NuNet Network Status is live at Nunet Status Dashboard, displaying real-time statistics about all computational processes executed on the network and their telemetry information. It tracks successful and failed processes, registered consumers, compute devices, and the amount of NuNet tokens paid by service providers to compute providers.

Dials on the right show available resources and current utilization percentages. Graphs on the left display resource availability over time.

The table at the bottom on the left displays the "heartbeats" (these are updated every minute) from currently active DMS nodes. You can expand the event to see the details.

The hearbeats list the onboarded ram and cpu for each dms so can be used to calculate the currenty available resources.

RAM: Memory used by a process, measured in megabyte seconds (MBs). The metric is calculated by adding spot RAM usages sampled every second for the entire time of execution, indicating actual memory consumption.
CPU: Processor work used to complete a process, measured in MTicks (million ticks), showing CPU time used during execution.
ID: lists the peer id for the heartbeat. This can be used to filter the dashboard to show events from a specific peer, see below.

The table at the bottom on the right hand side displays telemetry metrics for each process running on NuNet. You can expand these events to reveal the details.

You can see job status, time it took to run, ram, cpu and network used and the amount of ntx.

Troubleshooting Tools

A section for the community to share useful commands and tools used for troubleshooting.

Check machine capacity

The nunet capacity command has three important options that are very helpful when using the device management service (DMS) on compute provider machines:

--full: This option displays the maximum capacity of resources on the device. This corresponds to the CPU and RAM configuration on a machine. You can also use -f.
--available: This option displays the available resources on the device. It shows how much resources are available before or after onboarding. You can also use -a.
--onboarded: This option displays the resources that has already been allocated to NuNet. It means the machine has already been onboarded. You can also use -o.

To display the available resources in a readable format, you can use the option--pretty.It formats the output of the command in a more readable way.

For example:

nunet capacity --onboarded --pretty

Note: The previous commandavailable(in older versions of DMS) that was used instead of capacity (updated name) has been deprecated and removed.

Check the version numbers of installed components

From time to time you might want to check what version number of a particualt piece of software you have installed.

sudo apt info nunet-dms

sudo apt info service-provider-dashboard

sudo apt info management-dashboard

Stop and start the various services

As nunet uses systemd you can use the stop start and restart commands to control the services.

sudo systemctl stop nunet-dms

sudo systemctl start nunet-dms

sudo systemctl restart nunet-dms

Sudo systemctl restart service-provider-dashboard

View filter and save the log files generated by components

Use journalctl command to view the DMS logfile the -f switch will keep showing new lines as and when they are written, you can press ctrl + c to quit. You can copy and paste the logfiles into chats or bug requests. You can also use the -n 100 command to only show you the last 100 lines or -n 1000 to show the last 1000 etc.

sudo journalctl -f -u nunet-dms

If you know what you are looking for you can filter the log file for certain events using the | command with the grep function

sudo journalctl -f -u nunet-dms | grep "peer"

You can also filter out specific lines using the grep -v command this with show everything except a line containg the string you write at the end

sudo journalctl -f -u nunet-dms | grep -v "traces export"

You can use this when testing the Service Provider Dashboard SPD to filter the log that has all the debug data. Some of the debug data is useful but there are quite a few things we can ignore while testing this specifc part.

sudo journalctl -f -n 10000 -u nunet-dms | grep -v "traces export" | grep -v "dial backoff" | grep -v "UpdateDHT Create Stream error:" | grep -v "Attempting to Send DHT Update to:" | grep -v "Sending DHT update to" | grep -v "dht update from:"

Enable Debug mode

From time to time you may want to see more information than the standard logging shows. You can do this by updating the service file that nunet-dms uses to run. Run this command to open the service file be careful not to edit anything else in the file.

sudo nano /etc/systemd/system/nunet-dms.service

Once the file is open you can use the cursor keys to navigate to the [Service] section paste this line into the file and save it (ctrl +x then press y and press enter)

Environment=NUNET_DEBUG=true

once you have added the line you need to reload systemd then restart the DMS service using the command below then #stop-and-start-the-various-services

sudo systemctl daemon-reload

Now when you check the log there will much more data in there. Please beware that you should not leave debugging switched on all the time as it will use extra system resources. to disable it do the reverse of what we just did.

Remove existing configuration

From time to time you might want to remove the configuration for example if you want to test installing / onboarding from scratch. After you uninstall the applications using the apt command there is a directory that contains your onboarding info / config. To remove that info and start again run the following.

cd /etc/nunet/
sudo rm nunet.db
sudo rm metadataV2.json

Web browser - developer tools / console

You are almost certainly going to need to look at the developer tools in your web browser to check what's happening behind the scenes. Here are the ways you can get to the developer tools in various browsers.

Google Chrome, firefox, brave: hold down ctrl + shift + i

When the devloper tools window opens click the console tab

Enable systemd on WSL

open the wsl config file

sudo nano /etc/wsl.conf

paste this into the file and save it hold ctrl + x then press y to save

[boot]
systemd=true

open a powershell terminal and type the following command

wsl --shutdown

(your WSL terminal will close ) Wait 30 seconds to ensure the machine has completely shutdown then open a new wsl session. Systemd should now be running.

Check symmetric NAT

You may want to check symetric NAT on your connection as that may have an impact on how many peers you can connect to or how quickly you connect to them. If you have a web browser on your machine you can use this tool to check. just open the URL it will tell you immediatley https://tomchen.github.io/symmetric-nat-test/

Unregister a WSL instance

Somethimes you may want to completely wipe a WSL instance, uninstalling the app and reinstalling it does not remove the disk of the system to when you reboot you still retain the old settings, this is kind of by design so that you dont accidently loose data. However sometimes you just want to blow it away and start again.

wsl --shutdown
wsl --list

The list command will tell you want wsl instances are registered, to unregister and WIPE the WSL Installation use the following command. It will just do it and not prompt you or ask if you are sure so be carefull with this command and dont unregister the instance if it contains data you want to keep.

wsl --unregister Ubuntu-20.04

Compute-side error vs User-end error

A compute-side error on NuNet can be identified as part of the following piece of log in the system/gist, if the job deployment fails. If you notice this as below along-with the error, please report it to the dev team:

Traceback (most recent call last):
  File "prepare_ml.py", line 19, in <module>

prepare_ml.py is what prepares and deploys the job inside a NuNet ML container.

A user-end error on NuNet can be identified as part of the following piece of log along-with the error in the system/gist, if a submitted job fails:

Traceback (most recent call last):
  File "ml_job.py", line 19, in <module>

ml_job.py is the job that runs inside a NuNet ML container. This is based on an ML model URL that was submitted through the Service Provider Dashboard (SPD).

If you are tester, you can try with a different URL that runs a separate model. If you are a developer/researcher, you would need to debug the code inside your ML/computational model URL, update it, and submit again.

Testing Campaigns - Get Involved

Please read NuNets before installing any software on your devices.

This section is used to describe the testing stages for NuNets Public Alpha Testnet.

Public Alpha Testing Campaign 1 - Core Components

Stage 1: Create wallet (using Nami or Eternl)

This page will detail Stage 1 testing campaign that will run from April 28th to May 1st.

Please read NuNets before installing any software on your devices.

Rewards for Stage 1 have stopped at 4pm UTC May 1st!

Testing Plan

1.1 Objectives.

Stage 1 will focus on testing components in relation to Creating test wallets and receiving mNTX that will be needed to perform other tests.

1.2 Testing Type

This testing campaign will be focused on component testing. We will test mNTX and Testnet wallet functions in isolation.

1.3 Scope of Testing

The following features and functionalities will be tested during this campaign

Using PreProd Cardano Network - Ensuring the community is able to create a PreProd Cardano Wallet
Adding tADA - Adding tADA to testers wallets
mNTX workflow - Testing the workflow in place to monitor and track community wallets, GitLab and Discord Names to determine rewards.
Adding mNTX - Adding mNTX to testers wallets
Test community communication - Testing communication workflows between NuNet team and community testers

1.4 Testing schedule

This campaign will run from April 28th to May 1st.

Test Environment (Testing steps)

2.1 Documentation

This section includes the relevant links to documentation needed for this test stage:

Create PreProd Cardano Wallet ( or ) and Adding tADA
(Head to Discord and Join the Stage 1 testing Group)

Community Participation

4.1 Get Involved

Please join our and head to #role-assignment click on 1️⃣ to be given access to the relevant testing channels.

In order to receive mNTX your wallet you will need to be added to our Testnet Whitelist to receive mNTX.

4.2 NTX Incentives

Testers will be rewarded for the following:
Bug discovery
Critical bugs discovered

4.3 Communication Channels

Add your wallet to the mNTX Whitelist
Stage 1 Forum
Communication Channel for testers

Contribution Guidelines

If you have any issues or bugs to report please follow our procedure outlined in our page on GitLab.

Report

NuNet is happy to share its first Public Alpha Testnet Report with our community. This report will focus on Stage 1: Create a wallet (using Nami or Eternl).

Testing scope

The following features and functionalities will be tested during this campaign

Using PreProd Cardano Network - Ensuring the community is able to create a PreProd Cardano Wallet
Adding tADA - Adding tADA to testers wallets
mNTX workflow - Testing the workflow in place to monitor and track community wallets, GitLab and Discord Names to determine rewards.
Adding mNTX - Adding mNTX to testers wallets
Test community communication - Testing communication workflows between NuNet team and community testers

Results

Assigned Tester Role: 68
Number of whitelisted wallets: 56 (Within Stage 1 Timeframe)

Using PreProd Cardano Network

Success
Some improvements in relation to documentation to remind users that we will only use test wallets for the campaign and only test wallets can receive mNTX.
Some small bugs with Eternl configuration wallet where the plugin got disconnected, testers refreshed and got it working again.
Remind users that collateral has to be set up and to connect wallets in the Dapp.

Adding tADA

Success

mNTX workflow and Adding mNTX

Success
Suggestion to create a message on the faucet that reads ‘Wallet not whitelisted’.

Test community communication

In general it worked well. Some small improvements can be made based on community feedback such as direct links to testing campaigns and instructions.

Stage 2: Onboard on NuNet

Please read NuNets before installing any software on your devices.

Rewards for Stage 2 testing have stopped May 17th at 6pm UTC

Testing Plan

1.1 Objectives.

Stage 2 will focus on testing components in relations to onboarding community devices to NuNet and the discovery of other devices on the network.

Before participating in this Staged Test please read the page.

1.2 Testing Type

This testing campaign will be focused on component testing. We will test DMS installation and onboarding devices to NuNet.

1.3 Scope of Testing

The following features and functionalities will be tested during this campaign

Installing the DMS on community devices
Running the DMS CLI - This will test onboarding devices, managing resources, wallets, and interactions with peers.
Onboarding using wallet address created in
Testing Network Status Dashboard
Peer Discovery

1.4 Testing schedule

This campaign will run from May 1st to May 8th

Test Environment (Testing Steps)

2.1 Documentation

This section includes the relevant links to documentation needed for this test stage:

- Command Line Interface (CLI) to onboard a device, manage resources, wallets, and interact with peers

2.2 Monitoring and Reporting Tools

Community Participation

4.1 Get Involved

Please join our Discord Server and head to #role-assignment click on 2️⃣ to be given access to the relevant testing channels.

In order to receive mNTX your wallet you will need to be added to our Testnet Stage 2 Whitelist to receive mNTX, you can find a link to the whitelist in the Welcome group on Discord Stage 2 category.

4.2 NTX Incentives

Testers will be rewarded for the following:
Bug discovery
Critical bugs discovered

4.3 Communication Channels

Stage 2 Category
Stage 2 Forum
Communication Channel for Stage 2 testers

Contribution Guidelines

If you have any issues or bugs to report please follow our procedure outlined in our page on GitLab

Report

Report will be added after testing is complete.

Stage 3: Act as a Service Provider

This page will detail Stage 3 testing campaign that will run from May 17th

Please read NuNets before installing any software on your devices.

Testing Plan

1.1 Objectives

Stage 3 will focus on testing deploying jobs from the Service Provider Dashboard.

1.2 Testing Type

This testing campaign will be focused on Integration testing. Verifying that different components or modules work together as expected for Service Providers.

1.3 Scope of Testing

The following features and functionalities will be tested during this campaign

Using the SPD and the wallet created on request to run a ML job (this will lock the mNTX funds)
Waiting and checking to finish the ML job and review results

1.4 Testing schedule

This campaign will run from May 22nd and from next week will work with (Acting as a Compute Provider). IMPORTANT Stage 4 Delay We are waiting until we can test / prove multiple jobs can be run by a service provider. Once we prove we have lots of jobs successfully running it will mean most users will have mNTX to claim. We expect at least one update to DMS and the SPD before we are at that point. We are expecting new releases of Compute provider dashboard next week. (Week commencing 22nd May)

Test Environment (Testing Steps)

2.1 Documentation

This section includes the relevant links to documentation needed for this test stage:

If you haven't already, complete and .
Install .
Enable debugging mode using the info in the section.
Follow the steps to use the .
Please post your successes or issues into the stage 3 Discord chat. Which will allow us to gauge how well things are working.

2.2 Monitoring and Reporting Tools

Please log any problems you come across in the

Community Participation

4.1 Get Involved

Please join our Discord Server and head to #role-assignment click on 4️⃣ to be given access to the relevant testing channels.

In order to receive mNTX your wallet you will need to be added to our Testnet Stage 4 Whitelist to receive mNTX, you can find a link to the whitelist in the Welcome group on Discord Stage 4 category.

4.2 NTX Incentives

Testers will be rewarded for the following:
Bug discovery
Critical bugs discovered

4.3 Communication Channels

Stage 3 Category
Stage 3 Forum
Communication Channel for Stage 3 testers

Contribution Guidelines

If you have any issues or bugs to report please follow our procedure outlined in our page on GitLab

Report

Report will be added after testing is complete.

Walk-through

Video Walk Through

Step 1

You must have completed stages 1 and 2 so that you have a Pre-Production Testnet wallet with some mNTX, and have onboarded your device to NuNet (everyone has to allocate some resources to NuNet, even those ordering jobs as a service provider). Check that you have some DHT peers with the "nunet peer list" command before proceeding.

Step 2

Download and install the service provider dashboard and access it using localhost:9991 in your browser.

Step 3

-Press ctrl+shift+i on localhost:9991 and click the Console tab so you can see that the process is working at all times.

-Connect your PreProd Tesnet wallet to localhost:9991

-Choose CPU

-Choose Low

-Click "Please enter any additional required dependencies" and type in matplotlib

-Enter 30 (minutes) as the estimated time

-Click Next

Step 4

On the next page of the web app

-Check that the wallet address matches your own

-Enter an amount of mNTX to spend (1000 may work)

-Submit and sign the transaction in your wallet

Step 5

Wait for the job to complete. DO NOT refresh the web page, this will prevent you from seeing if the job completes. You should see in the Console (ctrl+shift+i) when the job completes.

IMPORTANT: It is currently a known issue that you cannot order a second job, even after the first one completes. Please only order one job and do not report that a second one cannot be ordered, this issue is already being fixed.

Stage 4: Act as a Compute Provider

This page will detail Stage 4 testing campaign

Please read NuNets before installing any software on your devices.

Testing Plan

1.1 Objectives

Stage 4 will focus on NuNet Compute Providers - deploying jobs, checking if they have ran successfully and testing payments for jobs ran on Compute Providers devices.

1.2 Testing Type

This testing campaign will be focused on Integration testing. Verifying that different components or modules work together as expected for Compute Providers.

1.3 Scope of Testing

The following features and functionalities will be tested during this campaign:

Installing the Compute Provider Dashboard (CPD)
Set/change resource parameters using the DMS CLI so this machine can be chosen to run a ML job
Testing job deployment - waiting and see/monitor if any job was deployed and run successfully on this machine
Payments - Claiming the NTX reward using the CPD and the wallet created on

1.4 Testing schedule

This campaign will run when requirements of testing is met.

Test Environment (Testing Steps)

2.1 Documentation

This section includes the relevant links to documentation needed for this test stage:

2.2 Monitoring and Reporting Tools

Community Participation

4.1 Get Involved

Please join our Discord Server and head to #role-assignment click on 3️⃣ to be given access to the relevant testing channels.

In order to receive mNTX your wallet you will need to be added to our Testnet Stage 3 Whitelist to receive mNTX, you can find a link to the whitelist in the Welcome group on Discord Stage 3 category.

4.2 NTX Incentives

Testers will be rewarded for the following:
Bug discovery
Critical bugs discovered

4.3 Communication Channels

Stage 3 Category
Stage 3 Forum
Communication Channel for Stage 3 testers

Contribution Guidelines

If you have any issues or bugs to report please follow our procedure outlined in our page on GitLab

Report

Report will be added after testing is complete.

Request mNTX and Faucet

Before requesting mNTX from our Faucet please review all steps in . Please read NuNets before installing any software on your devices.

Testers must be whitelisted in order to receive mNTX. This sometimes takes a few hours so please be stay up to date on Discord to be notified of your request success. If you have already requested mNTX and would like more please refill the application form. If there is spamming of the Whitelist we will not reward testers. Whitelist form Please make sure you use your tADA single wallet address Faucet Address: Video instructions

ML Code Examples

Here's a list of some example Machine Learning code URLs (briefly explained) that you can submit to NuNet through our Service Provider Dashboard (SPD). You could also deploy and test your own ML code.

Teaching a Machine to Recognize Images with PyTorch and the CIFAR-10 Image Set (Not Checkpointed)

This Python code is for training and testing a simple convolutional neural network (CNN) using PyTorch on the CIFAR-10 dataset.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/cifar-10.py

Check for GPU availability: The code checks if a GPU is available for faster computation. If it is, the code will utilize the GPU.
Load and preprocess the images: The code retrieves the CIFAR-10 dataset and preprocesses the images, preparing them for the machine learning process.
Display random images*: The code contains a function to display a few random images from the dataset that will be used for training.
Define a Convolutional Neural Network (CNN) architecture: The code creates a CNN architecture to guide the machine in learning how to recognize images.
Prepare the machine for training: The code configures the machine to follow the CNN architecture and determines the approach to learning from mistakes.
Train the machine: The code trains the machine using the images for 100 iterations (epochs). It keeps track of the machine's performance during training.
Test the trained machine: After training, the code evaluates how well the machine can identify images it has not seen before, using a test set of images.
Evaluate the machine's performance: The code calculates the overall accuracy of the machine in identifying the test images, as well as the accuracy for each category of images in the dataset.

However, this code does not include a checkpointing system to save the machine's learning progress. If training is interrupted, the process will have to start from the beginning.

Teaching a Machine to Recognize Images with PyTorch and the CIFAR-10 Image Set (Checkpointed)

This code is a modified version of the previous one. The main changes are related to adding functionality for checkpointing and resuming training from a saved state.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/cifar-10_checkpointed.py

Here's the explanation:

Imports and configurations: The code imports necessary libraries and sets up the device for using GPU or CPU, based on availability.
Data preparation: It loads and preprocesses the CIFAR10 dataset for training and testing purposes.
Visualizing images*: The code provides a function to display images from the dataset.
Defining the network architecture: The code defines a Convolutional Neural Network (CNN) with two convolutional layers, two pooling layers, and three fully connected layers.
Checkpoint and resume functionality: The code reads and writes the epoch count from a text file and saves the model weights in a file. If a checkpoint file exists, it resumes training from the last saved state.
Training the model: The training loop is modified to include saving the current model state after each epoch, and resuming from the saved state if needed.
Evaluating the model: The code tests the model's performance on the test dataset, calculating overall accuracy and accuracy per class.

Overall, this version of the code is designed for checkpointing and resuming the training process, making it more convenient when training is interrupted or needs to be paused.

Building a User-friendly Chatbot with T5 and Gradio

This code sets up a chatbot using the Gradio library for the interface and the T5 large model from Hugging Face's Transformers library as the backbone.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/flan-t5-large_chatbot-multi-gpu.py

Here's an overview of the code:

Import libraries: Gradio, PyTorch, and Transformers are imported.
Hyperparameters: The code defines various hyperparameters for the model's text generation process, such as the maximum sequence and output lengths, number of beams for beam search, length penalty, and more.
Load the model and tokenizer: The pre-trained T5 large model and its corresponding tokenizer are loaded. The model is set up to run on GPU(s) if available.
Define the chatbot function: The chatbot() function takes user input, tokenizes it, feeds it to the T5 model, generates a response, and decodes the output tokens back into text.
Create a Gradio interface: The Gradio library is used to create a simple user interface for interacting with the chatbot. A text input box and a text output box are provided, along with a title and description.
Launch the Gradio interface: The Gradio interface is launched, and a shareable link is created.

This code sets up a chatbot using the T5 large model, providing an easy-to-use interface for users to ask questions and receive responses.

Chatbot that Remembers Conversations and Saves Them to a File

This code sets up a chatbot that can remember the conversation and save it to a file. It uses the T5 model from the Transformers library and Gradio for a user-friendly interface. The chatbot will work on your computer's GPU if available.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/flan-t5-large_chatbot-multi-gpu_checkpointed.py

Here's what the code does:

It imports the needed tools (Gradio, PyTorch, os, and Transformers).
The code sets some rules (hyperparameters) to help the model give better answers.
The T5 model and tokenizer are loaded to understand and process text. The model will use your computer's GPU if possible, making it work faster.
The code checks if a file named "conversation.txt" exists. If not, it creates one to save the conversation.
A chatbot function is created that opens the conversation file, reads the previous conversation, and adds new input and output to the file. It also processes the input and generates a response using the T5 model.
Using Gradio, a simple chat window is created for you to ask questions and see the answers. The chatbot will remember the conversation and display the updated conversation after each response.
Finally, the chat window is launched, and you can share it with others if you want.

This code helps you set up a chatbot that can remember and save conversations, making it fun and easy to interact with the T5 model while keeping track of the discussion.

Simple Chatbot Using T5 Model and Gradio Interface

This code is similar to previous two. It sets up a basic chatbot using the T5 large model from the Transformers library and Gradio for a user-friendly interface. The chatbot will run on your computer's GPU if available.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/flan-t5-large_chatbot.py

Here's what the code does:

It imports the required tools (Gradio, PyTorch, and Transformers).
The T5 model and tokenizer are loaded, which helps the chatbot understand and process text. The model will use your computer's GPU if possible, making it work faster.
A chatbot function is created that processes user input by tokenizing it and converting it into a PyTorch tensor. It then generates a response using the T5 model with specified settings.
The generated response is decoded, and any special tokens are removed before it's returned.
Using Gradio, a simple chat window is created for users to input text and see the chatbot's response.
Finally, the chat window is launched, allowing users to interact with the T5 model and see its responses.

This code helps you set up a simple and interactive chatbot, making it easy to use the T5 model to generate responses for user inputs.

Simple Chatbot Using GPT2 and Gradio

This is a similar chatbot like the previous ones above.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/gpt2-large_chatbot.py

Here's what it does:

This code creates a chatbot using a smart model called GPT-2 and a tool named Gradio that makes it easy to talk to the chatbot.
The code uses tools from the Transformers library to load the GPT-2 model and set it up.
A special token is added to help the model understand when a message starts and ends.
The code has a chatbot function that changes your message into a form the model understands, makes the model think of a reply, and then changes the reply back to normal text.
A simple Gradio chatbot interface is made, so you can ask questions and get answers from the chatbot.

Comparing this code with the previous ones, this one uses a model named GPT-2 instead of another model called T5. The way the model is loaded is a little different, but the overall idea of creating a chatbot using Gradio remains the same.

Training a Machine Learning Model with Dummy Data - Testing on multiple GPUs (Checkpointed)

This code uses nn.DataParallel to enable the model to be trained on multiple GPUs, thereby accelerating the training process through distributed computing.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/multi-gpu-test_checkpointed.py

The code checks whether a GPU is available and displays information about the available devices.
Defines a PyTorch model to be trained with an example linear function.
Initializes the model and moves it to GPU(s) if available.
Defines the loss function and optimizer.
Generates dummy input data for the model.
Attempts to read the epoch count from a file, if the file doesn't exist, creates it with a default value of 0.
Defines two helper functions to save and load the epoch count to/from a file.
Defines a function to initialize the model's weights.
Tries to load the model's weights from the last saved checkpoint. If there are no saved checkpoints, initializes the model's weights.
Defines the total number of epochs to run and the interval for printing the loss.
Trains the model for a specified number of epochs, printing the loss every few epochs.
Saves the model's weights and epoch count to a file after each epoch.

Please note that the sole purpose of this code was to test simultaneous execution of the process on multiple GPUs.

Train and Generate Text with PaLM Model using PyTorch (Not Checkpointed)

This code is a PyTorch implementation of the PaLM model for text generation. The code is written by Phil Wang and licensed under MIT license.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/palm-rlhf.py

It trains and generates text by following the steps:

The code uses PyTorch and Palm-rlhf-pytorch library to implement PaLM model for training and inference.
The enwik8 dataset (a subset of Wikipedia), is used for training the model, which is downloaded using curl command and saved locally in the data directory.
A TextSamplerDataset is defined to create train and validation datasets from the enwik8 dataset.
A PaLM model is instantiated with hyperparameters and moved to the device (GPU) using accelerator.
An optimizer is created to optimize the PaLM model using Adam optimization algorithm
The training is done for a fixed number of batches, where the loss is calculated using PaLM model on training dataset, the gradients are accumulated and backpropagated, and the parameters are updated using the optimizer.
Validation loss is also calculated every few batches to check the performance of the model on the validation dataset.
The model is also used for generating text after every few batches.
After training, the model is used for inference by getting user input, generating text using the trained PaLM model, and printing the output.

Summarizing, this code trains a neural network using the PaLM architecture to generate text similar to a given dataset. The dataset used here is enwik8, which is an 100 MB dump of the English Wikipedia. The code defines a PaLM model and uses PyTorch DataLoader to feed the data to the model. It uses an accelerator library to distribute the computation across available devices, such as GPUs. Finally, it allows the user to test the generated model by inputting a sentence, and the model responds with a predicted output.

Train and Generate Text with PaLM Model using PyTorch (Checkpointed)

This code is a language model, PaLM, trained on enwik8 dataset. It trains the model on GPU by using PyTorch's DataParallel library, allowing distributed computing on GPUs. Additionally, the code also implements checkpointing which allows resuming from the last checkpoint instead of restarting the training from the beginning.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/palm-rlhf_checkpointed.py

Revisiting the points with checkpointing:

This code uses PyTorch's PaLM model and Accelerator library for distributed computing on GPUs.
The code downloads the enwik8 dataset and divides it into training and validation sets.
The code uses TextSamplerDataset to load data into PyTorch DataLoader, which is then used for training.
It uses the Adam optimizer for training and also employs learning rate scheduler.
The code implements checkpointing to save model weights and optimizer states at a defined interval and allows resuming from the last checkpoint.
The code trains the model for a defined number of batches and validates after a defined interval.
It generates a sample text after a defined interval and also provides an option for the user to enter a prompt for generating text.
The training and generation logs are displayed using the tqdm library.

Summarizing, this code trains a PaLM language model on the enwik8 dataset using PyTorch and Accelerator libraries. It implements checkpointing to resume training from the last checkpoint and uses PyTorch's DataParallel library for distributed computing on GPUs. The model is trained for a defined number of batches and generates sample text after a defined interval. The user can also enter a prompt for generating text. The training and generation logs are displayed using the tqdm library.

Fashion MNIST Image Classification using TensorFlow and Keras (Not Checkpointed)

This code is an implementation of image classification using Fashion MNIST dataset with TensorFlow and Keras. The dataset consists of images of clothing items such as shirts, shoes, trousers, etc. The model is trained to classify these images into different categories.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/tensorflow/fashion-mnist.py

Here's how it works:

The code imports TensorFlow and Keras libraries.
Fashion MNIST dataset is loaded using Keras.
The images are shown using matplotlib.
The images are normalized to 0-1 range.
A sequential model is created using Keras with two dense layers.
The model is compiled using adam optimizer and sparse categorical crossentropy loss.
The model is trained using the training data.
The test loss and accuracy are evaluated using test data.
The model is used to predict the labels for the test data.
Functions are defined to plot the images and the predicted labels.
Plots are generated to show the predicted labels and true labels for some test images.
An individual image is selected and its label is predicted using the model.

In summary, this code trains a machine learning model to classify images of clothing from the Fashion-MNIST dataset. The code first loads and preprocesses the data, then builds and trains a sequential neural network model using the TensorFlow library. The trained model is used to make predictions on test data and visualize its performance through plots of images and their corresponding predicted and true labels. Finally, the model is used to predict the class of a single image.

Fashion MNIST Image Classification using TensorFlow and Keras (Checkpointed)

This code trains a neural network classifier on the Fashion-MNIST dataset and uses checkpointing to save and restore the model weights. Checkpointing allows training to be interrupted and resumed without losing progress. The checkpoint is saved after every epoch, and the number of epochs completed before the training was interrupted is recorded in a text file.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/tensorflow/fashion-mnist_checkpointed.py

Here is a breakdown of the code:

The necessary libraries are imported.
The Fashion-MNIST dataset is loaded and preprocessed. The class names are also defined.
The first image in the training set is displayed using Matplotlib.
The images in the training set are normalized to values between 0 and 1.
The first 25 images in the training set are displayed using Matplotlib.
The neural network model is defined using Keras.
A checkpoint callback is created to save the weights after each epoch.
If the checkpoint file exists, the model weights are loaded, and training is resumed. Otherwise, a new checkpoint file is created.
A custom callback is defined to update the epoch counter in the text file at the end of each epoch.
The model is trained with the fit() method, using the checkpoint and counter callbacks.
The model is evaluated on the test set.
The predictions are computed for the test set.
Two functions are defined to display the predicted labels and confidence scores for each test image.
The predicted labels and confidence scores for two test images are displayed using Matplotlib.
The predicted labels and confidence scores for several test images are displayed using Matplotlib.
An individual test image is displayed, and its predicted label and confidence score are computed and displayed using Matplotlib.

In summary, this code trains a neural network classifier on the Fashion-MNIST dataset, using checkpointing to save and restore model weights and a custom callback to update the epoch counter. The predicted labels and confidence scores for test images are displayed using Matplotlib.

Training a Convolutional Neural Network on CIFAR-10 dataset using PyTorch (for CPU-only machines)

Introduction: This code trains a convolutional neural network on the CIFAR-10 dataset using PyTorch. It includes loading and preprocessing the dataset, defining the neural network, training the model, and evaluating its performance on the test set.

https://gitlab.com/nunet/ml-on-gpu/ml-on-cpu-service/-/raw/develop/examples/cifar-10_cpu_checkpointed.py

How it works:

The code imports PyTorch and torchvision modules.
It still checks if a GPU is available and sets the device accordingly.
It normalizes the CIFAR-10 dataset using torchvision.transforms.
It loads the training and test data using torchvision.datasets.CIFAR10 and creates dataloaders for them using torch.utils.data.DataLoader.
It defines the class names for the CIFAR-10 dataset.
It defines a function to display an image from the dataset using matplotlib.
It displays a few random images from the training set and their labels.
It defines the neural network using nn.Module and initializes its weights using a custom function.
It checks if a checkpoint file exists and loads the model's weights from it if it does.
It defines the loss function, optimizer, and the number of epochs to train for.
It trains the model for the specified number of epochs using the training set and the defined optimizer and loss function.
It saves the model's weights and the current epoch count to a file after each epoch.
It evaluates the performance of the model on the test set and prints the accuracy.
It calculates the accuracy of the model for each class in the dataset and prints it.

Summary: This code trains a convolutional neural network on the CIFAR-10 dataset using PyTorch. It loads and preprocesses the data, defines the neural network, trains the model, and evaluates its performance on the test set. It also saves the model's weights and epoch count to a file after each epoch and calculates the accuracy of the model for each class in the dataset.

Building and Evaluating a Decision Tree Classifier on the Iris Dataset with scikit-learn (CPU-only)

The Iris dataset is a classic example in the field of machine learning used for classification tasks. In this code, we will use the scikit-learn CPU-only library to build a Decision Tree Classifier on the Iris dataset. The code will train the classifier on 80% of the data and test it on the remaining 20% of the data. Finally, the code will evaluate the model's performance using accuracy, classification report, and confusion matrix.

https://gitlab.com/nunet/ml-on-gpu/ml-on-cpu-service/-/raw/develop/examples/cpu-ml-test-scikit-learn.py

What it does:

Load necessary libraries such as numpy, Scikit-learn's load_iris, train_test_split, DecisionTreeClassifier, accuracy_score, classification_report, and confusion_matrix.
Load the Iris dataset and separate input features (X) and output labels (y).
Split the dataset into train and test sets (80% training, 20% testing) using train_test_split.
Create a Decision Tree Classifier and fit it to the training data using DecisionTreeClassifier and fit methods.
Make predictions on the test set using predict method.
Evaluate the model's performance using accuracy_score, classification_report, and confusion_matrix methods.
Print the accuracy of the model on the test set.
Print the classification report, which shows precision, recall, f1-score, and support for each class.
Print the confusion matrix, which shows the number of true positives, false positives, true negatives, and false negatives for each class.

So in this code, we used the CPU-only Scikit-learn library to build a Decision Tree Classifier on the Iris dataset. The code split the dataset into 80% training and 20% testing sets, trained the classifier on the training set, and tested it on the test set. Finally, we evaluated the model's performance using accuracy, classification report, and confusion matrix. The accuracy of the model on the test set was printed, and the classification report and confusion matrix were shown to provide additional insights into the model's performance.

To Understand ML Job Error Handling on NuNet

This ML code snippet will give an error. So it can be used for testing the workflow to understand how we handle failed jobs:

https://gitlab.com/-/snippets/2523096/raw/main/tensor-shape-pytorch-error.py

When you try to run this program, you will receive the following message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: The size of tensor a (5) must match the size of tensor b (3) at non-singleton dimension 0

This error occurs because we're trying to perform an operation (in this case, addition) on two tensors that don't have the same shape. In PyTorch, element-wise operations require the tensors to have the same shape or be broadcastable to a common shape. In this case, the shapes (5, 2) and (3, 2) are not compatible because the sizes along the first dimension do not match, and they can't be broadcasted to a common shape.

There are of course numerous other types of errors for other ML/Computational code that you might encounter while working with PyTorch (or any ML/Computational library for that matter), including but not limited to:

TypeError: This could happen when you pass arguments of the wrong type to a function. For example, passing a list where a tensor is expected.
ValueError: This could occur when you pass arguments of the correct type but incorrect value to a function. For example, passing negative integers to a function that expects positive integers.
IndexError: You may encounter this when you try to index a tensor using an invalid index.
MemoryError: This occurs when the system runs out of memory, often when trying to allocate very large tensors.
RuntimeError: This is a catch-all for various kinds of errors that can occur during execution. The tensor shape mismatch error we discussed earlier is a type of RuntimeError.

Here's an example of a code snippet that will give a TypeError:

import torch

# Create a list
list1 = [1, 2, 3, 4, 5]

# Try to perform a tensor operation on the list
result = torch.tanh(list1)

When you run this program, you'll receive a TypeError with the following message:

tanh(): argument 'input' (position 1) must be Tensor, not list

The error handling approach varies depending on the type of error. For this TypeError, you can handle it by converting the list to a tensor before performing the operation:

import torch

# Create a list
list1 = [1, 2, 3, 4, 5]

# Convert the list to a tensor
tensor1 = torch.tensor(list1)

# Perform the tensor operation
result = torch.tanh(tensor1)

From an ML developer's or researcher's perspective, it's always good practice to anticipate potential errors and handle them gracefully in the ML/Computational code.

Deploying any GPU-based Python Project on NuNet

This tutorial will guide you through the process of deploying any GPU-based Python project on NuNet, which is our base platform that allows running machine learning or computational jobs. We will use a Python file URL and pip dependencies through the dashboard interface.

Please note that this tutorial assumes that your Python project is structured as a command-line-interface (CLI) based project with a requirements.txt file for specifying dependencies.

Prerequisites

Access to the service provider dashboard interface.
A GPU-based Python project hosted on a platform like GitLab or GitHub, with the main script and requirements.txt file accessible via URLs.

Steps

Prepare Your Python Script: Modify your main Python script to programmatically install dependencies from the requirements.txt file. The following sample code demonstrates how you might do this:

import subprocess
import os
import urllib.request

# Define the URL of your requirements.txt file
requirements_url = 'https://raw.githubusercontent.com/yourusername/yourrepo/master/requirements.txt'

# Download the requirements.txt file
urllib.request.urlretrieve(requirements_url, 'requirements.txt')

# Use pip to install the requirements
subprocess.check_call(["python", '-m', 'pip', 'install', '-r', 'requirements.txt'])

# Rest of your code follows

# Define the home directory assuming your code saves all checkpoints/models/datasets in this location
home_dir = os.path.expanduser('~')

# Define the tar file name
tar_file = 'archive.tar.gz'

# Create a tar file of the home directory
subprocess.check_call(['tar', '-czf', tar_file, '-C', home_dir, '.'])

# Define the URL of your repository
repo_url = 'https://oauth2:{token}@gitlab.com/yourusername/yourrepo.git'.format(token=os.getenv('GITLAB_TOKEN'))

# Clone the repository
subprocess.check_call(['git', 'clone', repo_url])

# Move the tar file into the repository
subprocess.check_call(['mv', tar_file, 'yourrepo'])

# Change the current directory to the repository
os.chdir('yourrepo')

# Add the tar file to the repository
subprocess.check_call(['git', 'add', tar_file])

# Commit the changes
subprocess.check_call(['git', 'commit', '-m', 'Add tar file'])

# Push the changes
subprocess.check_call(['git', 'push'])

This script will:

Download and install the dependencies from the requirements.txt file.
Execute your main script's logic (which you would add after the "# Rest of your code follows" comment).
Tar the entire contents of the /home/$LOGNAME directory. (assuming your code saves all checkpoints/models/dataset at this location)
Push this tar file to the specified GitLab or GitHub repository.

Please replace the placeholders 'https://raw.githubusercontent.com/yourusername/yourrepo/master/requirements.txt' and 'https://oauth2:{token}@gitlab.com/yourusername/yourrepo.git' with the actual URLs of your requirements.txt file and GitLab or GitHub repository, respectively.

Remember that {token} should be replaced with your Personal Access Token. This script expects the Personal Access Token to be stored in an environment variable named GITLAB_TOKEN. You can create this token temporarily and set it to expire around the same time your job is presumed to be finished.

Navigate to the NuNet Dashboard: Launch the service provider dashboard via localhost:9991 on your preferred browser and navigate to the dashboard interface.
Enter the Python File URL: In the appropriate field, enter the URL of your modified Python script.
Specify the Dependencies: In the dependencies field, you might need to specify any dependencies that your script needs beyond those specified in the requirements.txt file. If all dependencies are included in the requirements.txt file, this field can be left blank.
Fill up the remaining relevant fields: Apart from the above two fields, also fill up the other fields as described in the service provider dashboard usage guidelines.
Execute the Job: Click the appropriate button to execute the job. The NuNet platform will download your script, install the necessary dependencies, and run the script.

Considerations

Complex Dependencies: If your project requires dependencies that cannot be installed with pip, you may need to find a workaround, such as including the installation process within your Python script.
Data Dependencies: If your project requires access to specific data files, you may need to modify your Python script to download or access these files.
Security: Only use this approach with trusted scripts and dependencies, as the platform will execute your script and install the specified packages without further confirmation.
Long-Running Processes: If your project initiates long-running processes, you'll need to manage and monitor these within the constraints of the NuNet platform.

Following these steps should allow you to deploy and run a wide variety of GPU-based Python projects on the NuNet platform. While this method may not work for every project without some adjustments, it provides a flexible starting point for deploying projects on NuNet.

Research Papers

Describing our innovations based on novel algorithms and ideas from our developmental code.

Extending GPU Container Support to AMD and Intel: A Developer Approach for Decentralized Scaling

Extending GPU Container Support to AMD and Intel: A Developer Approach for Decentralized Scaling

Avimanyu Bandyopadhyay¹ Santosh Kumar² Tewodros Kederalah³ Dagim Sisay⁴ Dr. Kabir Veitas⁵

1. Systems Scientist, NuNet [avimanyu.bandyopadhyay@nunet.io]

Researcher, GizmoQuest Computing Lab [avimanyu@gizmoquest.com]

PhD Scholar, Heritage Institute of Technology [avimanyu.bandyopadhyay@heritageit.edu.in]

2. Full Stack Developer, NuNet [santosh.kumar@nunet.io]

3. Software Developer, NuNet [tewodros@nunet.io]

4. Tech Lead, NuNet [dagim@nunet.io]

5. Chief Executive Officer, NuNet [kabir@nunet.io]

Corresponding author: Dr. Kabir Veitas, CEO, NuNet

Abstract

The utilization of Graphics Processing Units (GPUs) has significantly enhanced the speed and efficiency of machine learning and deep learning applications. Docker containers, developed in a programming language called Go, are becoming more and more favored for guaranteeing the reproducibility and scalability of these applications. Nevertheless, Docker's built-in GPU compatibility is restricted to Nvidia GPUs, creating an obstacle to leveraging AMD and Intel GPUs. In this document, we introduce an innovative method, employing Go, to expand Docker's GPU compatibility to include AMD and Intel GPUs. This approach offers a vendor-neutral solution at the development level.

Introduction

The rise of machine learning and deep learning applications in a variety of fields such as computer vision, natural language processing, and bioinformatics, among others, has necessitated the use of high-performance computing resources. Graphics Processing Units (GPUs), initially designed to accelerate graphics rendering for gaming, have emerged as powerful accelerators for these data-intensive applications due to their massively parallel architectures.

In the realm of high-performance computing, reproducibility and portability of applications are essential. Docker, a platform that leverages containerization technology, provides a solution to these challenges. Docker containers encapsulate applications along with their dependencies, allowing them to run uniformly across different computational environments. Moreover, Docker's lightweight nature compared to traditional virtual machines makes it a preferable choice for deploying scalable and efficient applications.

However, while Docker has built-in support for Nvidia GPUs, it lacks the same native support for AMD and Intel GPUs. This discrepancy limits the full exploitation of the diverse GPU hardware landscape, creating vendor lock-in and potentially hindering the scalability and versatility of GPU-accelerated applications.

Docker is primarily developed in the Go programming language. Go, with its simplicity, strong concurrency model, and powerful standard library, provides a unique blend of high-level and low-level programming capabilities. This makes it an ideal candidate for developing solutions that require detailed system-level control, such as the interaction with GPUs, while maintaining an accessible and maintainable codebase.

This paper presents a novel approach for extending Docker's GPU support to AMD and Intel GPUs using the Go programming language. By addressing this problem at the development level, we aim to contribute to the open-source Docker project and pave the way for truly vendor-agnostic, scalable, and efficient GPU-accelerated applications. This development-level solution contrasts with the existing system-level workarounds and has the potential to eliminate unnecessary complexity for end-users, promoting more widespread adoption of AMD and Intel GPUs in high-performance computing.

Several works have addressed the challenges associated with GPU-accelerated computing in containerized environments. The most prominent solution is the `--gpus` option provided by Docker, which offers native support for Nvidia GPUs. This feature leverages the Nvidia Container Toolkit, an open-source project that provides a command-line interface for Docker to recognize Nvidia GPUs and allocate necessary resources to containers.

However, the current support is vendor-specific, and while it works seamlessly for Nvidia GPUs, it does not provide an out-of-the-box solution for other GPU vendors like AMD and Intel. Thus, the existing solutions focus on system-level workarounds to enable the use of AMD GPUs with Docker. AMD provides a deep learning stack that uses ROCm, a platform that allows deep learning frameworks to run on AMD GPUs. ROCm-enabled Linux kernel and the ROCk driver, along with other required kernel modules, need to be installed on the host that runs Docker containers.

Despite these advances, the present solutions do not address the issue at the development level within Docker. They require users to perform additional system-level configurations which increase the complexity and could potentially discourage users from adopting non-Nvidia GPUs for their applications. Furthermore, these solutions do not provide a unified, vendor-agnostic way to leverage GPUs in Docker, limiting the flexibility and scalability of GPU-accelerated applications in a diverse hardware landscape. This highlights the need for a development-level solution that is integrated within Docker itself, ensuring ease of use and true vendor-agnosticism.

Proposed Solution

Our innovative suggestion hinges on the application of a vendor-neutral technique to GPU backing within Docker at the coding stage. The end goal is to employ the Go programming language, Docker's foundational language, to help Docker organically detect and control not just Nvidia GPUs, but also AMD and Intel GPUs.

The initial part of the plan involves empowering Docker to attach the `/dev/dri` device within the container, ignoring vendor specificity. This can be achieved by altering the `devices` parameter within the `hostConfig` structure of Docker's `containerd` module. This change is made within Docker's Go language codebase, enabling Docker to detect and mount `/dev/dri` without the requirement of explicit bind mounting instructions during execution.

The second segment is to incorporate backing for AMD and Intel GPUs in the `resources` parameter of the `hostConfig` structure. At present, Docker acknowledges Nvidia GPUs via the `--gpus` flag, which internally changes the `resources` parameter. Our plan is to broaden this backing to AMD and Intel GPUs by allowing the `resources` parameter to identify these GPUs and allocate necessary resources to containers. This necessitates the enhancement of Docker's Go language codebase to incorporate the essential drivers and software stacks for these GPUs.

To encapsulate, the innovative proposal intends to present a consolidated and fluid method to utilize GPUs within Docker, regardless of the manufacturer. The prime benefit of this approach is its operation at the coding level, avoiding the necessity for intricate system-level configurations and offering a more user-centric experience. Moreover, it sets the stage for the inclusion of future GPUs from diverse manufacturers, boosting the scalability and flexibility of GPU-boosted applications in Docker.

Methodology

The adaptation of the projected answer demands alterations to the Go programming backbone of Docker, specifically pinpointing the `hostConfig` structure in the `containerd` module. Here is an elaborate guide on the amendments executed:

1. Annexing the `/dev/dri` apparatus: In the existing Docker formula, the `/dev/dri` the device must be directly linked to the container during operation. We tweaked the `devices` attribute in the `hostConfig` structure to incorporate `/dev/dri` by default through binding. This adjustment enables Docker to automatically append the `/dev/dri` apparatus inside the container, eradicating the need for explicit link creation commands.

2. Amplifying GPU attachment in the `resources` attribute: The `resources` attribute in the `hostConfig` structure is accountable for administering resources for Nvidia GPUs via the `--gpus` flag. We broadened this service to comprise support for AMD and Intel GPUs. The enactment needed enriching the Go programming foundation to incorporate the essential drivers and software piles for AMD and Intel GPUs. This transformation would permit Docker to acknowledge AMD and Intel GPUs and allocate requisite resources to containers using the `--gpus` flag.

The aforementioned adaptations were implemented in the Go coding language, adhering to the axioms and ideal practices of the language. Go's potent static typing and emphasis on directness and simplicity were notably advantageous in preserving the legibility and sustainability of the Docker programming base.

Trials of the projected answer were successfully conducted on hardware with AMD GPUs (/dev/dri with /dev/kfd) to validate its precision and potency. The trials involved operating GPU-accelerated software within Docker containers and authenticating their operation and resource consumption through NuNet's Device Management Service - https://gitlab.com/nunet/device-management-service. The findings indicate that the projected answer can successfully empower Docker to support AMD and Intel GPUs in a fluent and user-friendly mode.

In conclusion, the projected answer was enacted and tried successfully in Docker's Go programming base. The adaptation particulars corroborate the viability of the answer and its capability to revolutionize the mode Docker accommodates GPUs from assorted manufacturers.

Reflection and Forward Directions

The proposed methodology has effectively resolved the original challenge of delivering an intuitive, hassle-free experience for AMD and Intel GPU users interacting with Docker. By tweaking the Docker codebase, which is written in Go, Docker now has the capability to independently recognize and employ AMD and Intel GPUs, eliminating the necessity for explicit bind mounts or personalized scripts.

The benefits of this methodology include:

1. Simplicity: Users are exempted from the need to manually bind mount the GPU device or write personalized scripts for the container set-up.

2. Interoperability: The methodology is supplier-neutral and is compatible with Nvidia, AMD, and Intel GPUs, thereby enhancing Docker container compatibility across diverse systems.

3. Expandability: The methodology enables superior hardware utilization in broad, multi-GPU contexts, such as those in high-performance computing clusters.

Nonetheless, there remain opportunities for further enhancement and research:

1. Expanded Validation: While preliminary trials have yielded encouraging outcomes, exhaustive testing on diverse hardware and software setups is necessary to confirm resilience and compatibility.

2. Efficiency Enhancement: At present, the methodology prioritizes functionality over optimal performance. Prospective efforts could explore strategies to boost the efficiency of GPU-boosted applications operating within Docker containers.

3. Inclusion of Additional Devices: The current methodology concentrates on GPUs, but the approach could be broadened to incorporate other hardware devices that could profit from a similar approach, such as FPGAs or TPUs.

4. Synchronization with Management Tools: A vital future trajectory would be to amalgamate the devised methodology with container management tools like Kubernetes, boosting the scalability and ease of management of GPU-boosted workloads in distributed systems.

This study has paved the way for more advancements and investigations. By enhancing Docker's compatibility with AMD and Intel GPUs, we have made significant progress in making GPU-boosted computing more attainable and efficient for a broader set of users and applications. Anticipated future endeavors in this field hold the potential for more upgrades and breakthroughs.

Conclusion

The adoption of Docker container technology in machine learning and data science operations has been progressively prevalent, largely due to its reproducibility, portability, and scalability attributes. However, an obstacle arises from Docker's lack of broad-ranging GPU support, particularly problematic for users utilizing GPUs other than Nvidia's. In order to rectify this, we introduced a novel approach that adapts Docker's Go foundation to automatically detect and harness AMD and Intel GPUs, thereby eliminating the necessity for explicit bind mounts or bespoke scripts.

This innovative approach not only resolves the immediate predicament but also propels further investigation and progress in the realm of container-centric computing. It forms a solid foundation for more efficient hardware usage in expansive, multi-GPU environments, akin to those encountered in high-performance computing networks. Additionally, it encourages subsequent efforts towards performance refinement, aid for varied hardware entities, and amalgamation with container orchestration tools.

In summary, our scholarly pursuits signify a substantial stride in democratizing GPU-intensified computing, making it more approachable and effective for an expanded spectrum of users and use-cases. By augmenting Docker's compatibility with AMD and Intel GPUs, we have broadened the reach of container-based computing, thereby nurturing a more varied and inclusive computational research landscape.

References

1. Docker GitHub Repository (2023). Docker. Available at: https://github.com/docker/docker-ce (Accessed: June 28, 2023).

2. NVIDIA (2023). CUDA Toolkit Documentation. Available at: https://docs.nvidia.com/cuda/index.html (Accessed: June 28, 2023).

3. NVIDIA (2023). NVIDIA cuDNN. Available at: https://developer.nvidia.com/cudnn (Accessed: June 28, 2023).

4. Zaidi, S. (2023). Vendor-agnostic Setup for Running ML & DL Experiments with GPU Support. Towards AI. Available at: https://towardsai.net/p/machine-learning/how-to-setup-a-vendor-agnostic-gpu-accelerated-deep-learning-environment (Accessed: June 28, 2023).

5. AMD Community (2023). The AMD Deep Learning Stack Using Docker. Available at: https://community.amd.com/t5/blogs/the-amd-deep-learning-stack-using-docker/ba-p/416431 (Accessed: June 28, 2023).

Disclaimer

PLEASE READ THIS DISCLAIMER CAREFULLY BEFORE INSTALLING OR USING NUNET PUBLIC ALPHA SOFTWARE. BY INSTALLING OR USING THE SOFTWARE, YOU ACKNOWLEDGE THAT YOU HAVE READ, UNDERSTOOD, AND AGREE TO BE BOUND BY THE TERMS AND CONDITIONS SET FORTH IN THIS DISCLAIMER. NuNet Public Alpha software is very early stage test software undergoing heavy development, testing and multiple security audits. It should only be installed on a computer that does not contain any sensitive or valuable data. It should not be installed on any computer that is used to make financial transactions of any sort. Do not install NuNet Public Alpha components on any computer that you are not willing to reformat / reload an operating system on.

By installing or using the Software, you acknowledge and agree that:

You are voluntarily participating in the testing of the Software and understand the potential risks associated with its use.
You will not install the Software on any computer that contains sensitive or valuable data or that is used for financial transactions or other critical purposes.
You accept the responsibility for any potential loss of data, system damage, or other issues that may arise from the installation or use of the Software.
You are willing and able to reformat or reload the operating system on the computer on which the Software is installed, if necessary.

NuNet DISCLAIMS ANY WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT, WITH RESPECT TO THE SOFTWARE AND ITS INSTALLATION AND USE. In no event shall NuNet be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, loss of use, data, or profits, or business interruption) arising from your installation or use of the Software, regardless of whether NuNet has been advised of the possibility of such damages.

By installing or using the Software, you agree to indemnify, defend, and hold harmless NuNet, its affiliates, officers, directors, employees, agents, licensors, and suppliers from and against any and all claims, losses, expenses, damages, and costs (including, but not limited to, direct, incidental, consequential, exemplary, and indirect damages), as well as reasonable attorneys' fees, resulting from or arising out of your installation or use of the Software, your violation of this Disclaimer, or your violation of any rights of any third party.

ML Code Examples

Teaching a Machine to Recognize Images with PyTorch and the CIFAR-10 Image Set (Not Checkpointed)

This Python code is for training and testing a simple convolutional neural network (CNN) using PyTorch on the CIFAR-10 dataset.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/cifar-10.py

Check for GPU availability: The code checks if a GPU is available for faster computation. If it is, the code will utilize the GPU.
Load and preprocess the images: The code retrieves the CIFAR-10 dataset and preprocesses the images, preparing them for the machine learning process.
Display random images*: The code contains a function to display a few random images from the dataset that will be used for training.
Define a Convolutional Neural Network (CNN) architecture: The code creates a CNN architecture to guide the machine in learning how to recognize images.
Prepare the machine for training: The code configures the machine to follow the CNN architecture and determines the approach to learning from mistakes.
Train the machine: The code trains the machine using the images for 100 iterations (epochs). It keeps track of the machine's performance during training.
Test the trained machine: After training, the code evaluates how well the machine can identify images it has not seen before, using a test set of images.
Evaluate the machine's performance: The code calculates the overall accuracy of the machine in identifying the test images, as well as the accuracy for each category of images in the dataset.

However, this code does not include a checkpointing system to save the machine's learning progress. If training is interrupted, the process will have to start from the beginning.

Teaching a Machine to Recognize Images with PyTorch and the CIFAR-10 Image Set (Checkpointed)

This code is a modified version of the previous one. The main changes are related to adding functionality for checkpointing and resuming training from a saved state.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/cifar-10_checkpointed.py

Here's the explanation:

Imports and configurations: The code imports necessary libraries and sets up the device for using GPU or CPU, based on availability.
Data preparation: It loads and preprocesses the CIFAR10 dataset for training and testing purposes.
Visualizing images*: The code provides a function to display images from the dataset.
Defining the network architecture: The code defines a Convolutional Neural Network (CNN) with two convolutional layers, two pooling layers, and three fully connected layers.
Checkpoint and resume functionality: The code reads and writes the epoch count from a text file and saves the model weights in a file. If a checkpoint file exists, it resumes training from the last saved state.
Training the model: The training loop is modified to include saving the current model state after each epoch, and resuming from the saved state if needed.
Evaluating the model: The code tests the model's performance on the test dataset, calculating overall accuracy and accuracy per class.

Overall, this version of the code is designed for checkpointing and resuming the training process, making it more convenient when training is interrupted or needs to be paused.

Building a User-friendly Chatbot with T5 and Gradio

This code sets up a chatbot using the Gradio library for the interface and the T5 large model from Hugging Face's Transformers library as the backbone.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/flan-t5-large_chatbot-multi-gpu.py

Here's an overview of the code:

Import libraries: Gradio, PyTorch, and Transformers are imported.
Hyperparameters: The code defines various hyperparameters for the model's text generation process, such as the maximum sequence and output lengths, number of beams for beam search, length penalty, and more.
Load the model and tokenizer: The pre-trained T5 large model and its corresponding tokenizer are loaded. The model is set up to run on GPU(s) if available.
Define the chatbot function: The chatbot() function takes user input, tokenizes it, feeds it to the T5 model, generates a response, and decodes the output tokens back into text.
Create a Gradio interface: The Gradio library is used to create a simple user interface for interacting with the chatbot. A text input box and a text output box are provided, along with a title and description.
Launch the Gradio interface: The Gradio interface is launched, and a shareable link is created.

This code sets up a chatbot using the T5 large model, providing an easy-to-use interface for users to ask questions and receive responses.

Chatbot that Remembers Conversations and Saves Them to a File

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/flan-t5-large_chatbot-multi-gpu_checkpointed.py

Here's what the code does:

It imports the needed tools (Gradio, PyTorch, os, and Transformers).
The code sets some rules (hyperparameters) to help the model give better answers.
The T5 model and tokenizer are loaded to understand and process text. The model will use your computer's GPU if possible, making it work faster.
The code checks if a file named "conversation.txt" exists. If not, it creates one to save the conversation.
A chatbot function is created that opens the conversation file, reads the previous conversation, and adds new input and output to the file. It also processes the input and generates a response using the T5 model.
Using Gradio, a simple chat window is created for you to ask questions and see the answers. The chatbot will remember the conversation and display the updated conversation after each response.
Finally, the chat window is launched, and you can share it with others if you want.

This code helps you set up a chatbot that can remember and save conversations, making it fun and easy to interact with the T5 model while keeping track of the discussion.

Simple Chatbot Using T5 Model and Gradio Interface

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/flan-t5-large_chatbot.py

Here's what the code does:

It imports the required tools (Gradio, PyTorch, and Transformers).
The T5 model and tokenizer are loaded, which helps the chatbot understand and process text. The model will use your computer's GPU if possible, making it work faster.
A chatbot function is created that processes user input by tokenizing it and converting it into a PyTorch tensor. It then generates a response using the T5 model with specified settings.
The generated response is decoded, and any special tokens are removed before it's returned.
Using Gradio, a simple chat window is created for users to input text and see the chatbot's response.
Finally, the chat window is launched, allowing users to interact with the T5 model and see its responses.

This code helps you set up a simple and interactive chatbot, making it easy to use the T5 model to generate responses for user inputs.

Simple Chatbot Using GPT2 and Gradio

This is a similar chatbot like the previous ones above.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/gpt2-large_chatbot.py

Here's what it does:

This code creates a chatbot using a smart model called GPT-2 and a tool named Gradio that makes it easy to talk to the chatbot.
The code uses tools from the Transformers library to load the GPT-2 model and set it up.
A special token is added to help the model understand when a message starts and ends.
The code has a chatbot function that changes your message into a form the model understands, makes the model think of a reply, and then changes the reply back to normal text.
A simple Gradio chatbot interface is made, so you can ask questions and get answers from the chatbot.

Training a Machine Learning Model with Dummy Data - Testing on multiple GPUs (Checkpointed)

This code uses nn.DataParallel to enable the model to be trained on multiple GPUs, thereby accelerating the training process through distributed computing.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/multi-gpu-test_checkpointed.py

The code checks whether a GPU is available and displays information about the available devices.
Defines a PyTorch model to be trained with an example linear function.
Initializes the model and moves it to GPU(s) if available.
Defines the loss function and optimizer.
Generates dummy input data for the model.
Attempts to read the epoch count from a file, if the file doesn't exist, creates it with a default value of 0.
Defines two helper functions to save and load the epoch count to/from a file.
Defines a function to initialize the model's weights.
Tries to load the model's weights from the last saved checkpoint. If there are no saved checkpoints, initializes the model's weights.
Defines the total number of epochs to run and the interval for printing the loss.
Trains the model for a specified number of epochs, printing the loss every few epochs.
Saves the model's weights and epoch count to a file after each epoch.

Please note that the sole purpose of this code was to test simultaneous execution of the process on multiple GPUs.

Train and Generate Text with PaLM Model using PyTorch (Not Checkpointed)

This code is a PyTorch implementation of the PaLM model for text generation. The code is written by Phil Wang and licensed under MIT license.

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/palm-rlhf.py

It trains and generates text by following the steps:

The code uses PyTorch and Palm-rlhf-pytorch library to implement PaLM model for training and inference.
The enwik8 dataset (a subset of Wikipedia), is used for training the model, which is downloaded using curl command and saved locally in the data directory.
A TextSamplerDataset is defined to create train and validation datasets from the enwik8 dataset.
A PaLM model is instantiated with hyperparameters and moved to the device (GPU) using accelerator.
An optimizer is created to optimize the PaLM model using Adam optimization algorithm
The training is done for a fixed number of batches, where the loss is calculated using PaLM model on training dataset, the gradients are accumulated and backpropagated, and the parameters are updated using the optimizer.
Validation loss is also calculated every few batches to check the performance of the model on the validation dataset.
The model is also used for generating text after every few batches.
After training, the model is used for inference by getting user input, generating text using the trained PaLM model, and printing the output.

Train and Generate Text with PaLM Model using PyTorch (Checkpointed)

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/pytorch/palm-rlhf_checkpointed.py

Revisiting the points with checkpointing:

This code uses PyTorch's PaLM model and Accelerator library for distributed computing on GPUs.
The code downloads the enwik8 dataset and divides it into training and validation sets.
The code uses TextSamplerDataset to load data into PyTorch DataLoader, which is then used for training.
It uses the Adam optimizer for training and also employs learning rate scheduler.
The code implements checkpointing to save model weights and optimizer states at a defined interval and allows resuming from the last checkpoint.
The code trains the model for a defined number of batches and validates after a defined interval.
It generates a sample text after a defined interval and also provides an option for the user to enter a prompt for generating text.
The training and generation logs are displayed using the tqdm library.

Fashion MNIST Image Classification using TensorFlow and Keras (Not Checkpointed)

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/tensorflow/fashion-mnist.py

Here's how it works:

The code imports TensorFlow and Keras libraries.
Fashion MNIST dataset is loaded using Keras.
The images are shown using matplotlib.
The images are normalized to 0-1 range.
A sequential model is created using Keras with two dense layers.
The model is compiled using adam optimizer and sparse categorical crossentropy loss.
The model is trained using the training data.
The test loss and accuracy are evaluated using test data.
The model is used to predict the labels for the test data.
Functions are defined to plot the images and the predicted labels.
Plots are generated to show the predicted labels and true labels for some test images.
An individual image is selected and its label is predicted using the model.

Fashion MNIST Image Classification using TensorFlow and Keras (Checkpointed)

https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/tensorflow/fashion-mnist_checkpointed.py

Here is a breakdown of the code:

The necessary libraries are imported.
The Fashion-MNIST dataset is loaded and preprocessed. The class names are also defined.
The first image in the training set is displayed using Matplotlib.
The images in the training set are normalized to values between 0 and 1.
The first 25 images in the training set are displayed using Matplotlib.
The neural network model is defined using Keras.
A checkpoint callback is created to save the weights after each epoch.
If the checkpoint file exists, the model weights are loaded, and training is resumed. Otherwise, a new checkpoint file is created.
A custom callback is defined to update the epoch counter in the text file at the end of each epoch.
The model is trained with the fit() method, using the checkpoint and counter callbacks.
The model is evaluated on the test set.
The predictions are computed for the test set.
Two functions are defined to display the predicted labels and confidence scores for each test image.
The predicted labels and confidence scores for two test images are displayed using Matplotlib.
The predicted labels and confidence scores for several test images are displayed using Matplotlib.
An individual test image is displayed, and its predicted label and confidence score are computed and displayed using Matplotlib.

Training a Convolutional Neural Network on CIFAR-10 dataset using PyTorch (for CPU-only machines)

https://gitlab.com/nunet/ml-on-gpu/ml-on-cpu-service/-/raw/develop/examples/cifar-10_cpu_checkpointed.py

How it works:

The code imports PyTorch and torchvision modules.
It still checks if a GPU is available and sets the device accordingly.
It normalizes the CIFAR-10 dataset using torchvision.transforms.
It loads the training and test data using torchvision.datasets.CIFAR10 and creates dataloaders for them using torch.utils.data.DataLoader.
It defines the class names for the CIFAR-10 dataset.
It defines a function to display an image from the dataset using matplotlib.
It displays a few random images from the training set and their labels.
It defines the neural network using nn.Module and initializes its weights using a custom function.
It checks if a checkpoint file exists and loads the model's weights from it if it does.
It defines the loss function, optimizer, and the number of epochs to train for.
It trains the model for the specified number of epochs using the training set and the defined optimizer and loss function.
It saves the model's weights and the current epoch count to a file after each epoch.
It evaluates the performance of the model on the test set and prints the accuracy.
It calculates the accuracy of the model for each class in the dataset and prints it.

Building and Evaluating a Decision Tree Classifier on the Iris Dataset with scikit-learn (CPU-only)

https://gitlab.com/nunet/ml-on-gpu/ml-on-cpu-service/-/raw/develop/examples/cpu-ml-test-scikit-learn.py

What it does:

Load necessary libraries such as numpy, Scikit-learn's load_iris, train_test_split, DecisionTreeClassifier, accuracy_score, classification_report, and confusion_matrix.
Load the Iris dataset and separate input features (X) and output labels (y).
Split the dataset into train and test sets (80% training, 20% testing) using train_test_split.
Create a Decision Tree Classifier and fit it to the training data using DecisionTreeClassifier and fit methods.
Make predictions on the test set using predict method.
Evaluate the model's performance using accuracy_score, classification_report, and confusion_matrix methods.
Print the accuracy of the model on the test set.
Print the classification report, which shows precision, recall, f1-score, and support for each class.
Print the confusion matrix, which shows the number of true positives, false positives, true negatives, and false negatives for each class.

To Understand ML Job Error Handling on NuNet

This ML code snippet will give an error. So it can be used for testing the workflow to understand how we handle failed jobs:

https://gitlab.com/-/snippets/2523096/raw/main/tensor-shape-pytorch-error.py

When you try to run this program, you will receive the following message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: The size of tensor a (5) must match the size of tensor b (3) at non-singleton dimension 0

TypeError: This could happen when you pass arguments of the wrong type to a function. For example, passing a list where a tensor is expected.
ValueError: This could occur when you pass arguments of the correct type but incorrect value to a function. For example, passing negative integers to a function that expects positive integers.
IndexError: You may encounter this when you try to index a tensor using an invalid index.
MemoryError: This occurs when the system runs out of memory, often when trying to allocate very large tensors.
RuntimeError: This is a catch-all for various kinds of errors that can occur during execution. The tensor shape mismatch error we discussed earlier is a type of RuntimeError.

Here's an example of a code snippet that will give a TypeError:

import torch

# Create a list
list1 = [1, 2, 3, 4, 5]

# Try to perform a tensor operation on the list
result = torch.tanh(list1)

When you run this program, you'll receive a TypeError with the following message:

tanh(): argument 'input' (position 1) must be Tensor, not list

The error handling approach varies depending on the type of error. For this TypeError, you can handle it by converting the list to a tensor before performing the operation:

import torch

# Create a list
list1 = [1, 2, 3, 4, 5]

# Convert the list to a tensor
tensor1 = torch.tensor(list1)

# Perform the tensor operation
result = torch.tanh(tensor1)

From an ML developer's or researcher's perspective, it's always good practice to anticipate potential errors and handle them gracefully in the ML/Computational code.

Deploying any GPU-based Python Project on NuNet

Please note that this tutorial assumes that your Python project is structured as a command-line-interface (CLI) based project with a requirements.txt file for specifying dependencies.

Prerequisites

Access to the service provider dashboard interface.
A GPU-based Python project hosted on a platform like GitLab or GitHub, with the main script and requirements.txt file accessible via URLs.

Steps

Prepare Your Python Script: Modify your main Python script to programmatically install dependencies from the requirements.txt file. The following sample code demonstrates how you might do this:

import subprocess
import os
import urllib.request

# Define the URL of your requirements.txt file
requirements_url = 'https://raw.githubusercontent.com/yourusername/yourrepo/master/requirements.txt'

# Download the requirements.txt file
urllib.request.urlretrieve(requirements_url, 'requirements.txt')

# Use pip to install the requirements
subprocess.check_call(["python", '-m', 'pip', 'install', '-r', 'requirements.txt'])

# Rest of your code follows

# Define the home directory assuming your code saves all checkpoints/models/datasets in this location
home_dir = os.path.expanduser('~')

# Define the tar file name
tar_file = 'archive.tar.gz'

# Create a tar file of the home directory
subprocess.check_call(['tar', '-czf', tar_file, '-C', home_dir, '.'])

# Define the URL of your repository
repo_url = 'https://oauth2:{token}@gitlab.com/yourusername/yourrepo.git'.format(token=os.getenv('GITLAB_TOKEN'))

# Clone the repository
subprocess.check_call(['git', 'clone', repo_url])

# Move the tar file into the repository
subprocess.check_call(['mv', tar_file, 'yourrepo'])

# Change the current directory to the repository
os.chdir('yourrepo')

# Add the tar file to the repository
subprocess.check_call(['git', 'add', tar_file])

# Commit the changes
subprocess.check_call(['git', 'commit', '-m', 'Add tar file'])

# Push the changes
subprocess.check_call(['git', 'push'])

This script will:

Download and install the dependencies from the requirements.txt file.
Execute your main script's logic (which you would add after the "# Rest of your code follows" comment).
Tar the entire contents of the /home/$LOGNAME directory. (assuming your code saves all checkpoints/models/dataset at this location)
Push this tar file to the specified GitLab or GitHub repository.

Navigate to the NuNet Dashboard: Launch the service provider dashboard via localhost:9991 on your preferred browser and navigate to the dashboard interface.
Enter the Python File URL: In the appropriate field, enter the URL of your modified Python script.
Specify the Dependencies: In the dependencies field, you might need to specify any dependencies that your script needs beyond those specified in the requirements.txt file. If all dependencies are included in the requirements.txt file, this field can be left blank.
Fill up the remaining relevant fields: Apart from the above two fields, also fill up the other fields as described in the service provider dashboard usage guidelines.
Execute the Job: Click the appropriate button to execute the job. The NuNet platform will download your script, install the necessary dependencies, and run the script.

Considerations

Complex Dependencies: If your project requires dependencies that cannot be installed with pip, you may need to find a workaround, such as including the installation process within your Python script.
Data Dependencies: If your project requires access to specific data files, you may need to modify your Python script to download or access these files.
Security: Only use this approach with trusted scripts and dependencies, as the platform will execute your script and install the specified packages without further confirmation.
Long-Running Processes: If your project initiates long-running processes, you'll need to manage and monitor these within the constraints of the NuNet platform.