Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Last updated: 2025-01-22 01:10:54.633953 File source: link on GitLab
This package is responsible for starting the whole application. It also contains various core functionality of DMS:
Onboarding compute provider devices
Job orchestration and management
Resource management
Actor implementation for each node
Here is quick overview of the contents of this pacakge:
README: Current file which is aimed towards developers who wish to use and modify the dms functionality.
dms: This file contains code to initialize the DMS by loading configuration, starting REST API server etc
init: This file creates a new logger instance.
sanity_check: This file defines a method for performing consistency check before starting the DMS. proposed
Note that the functionality of this method needs to be developed as per refactored DMS design.
Subpackages
jobs: Deals with the management of local jobs on the machine.
node: Contains implementation of Node
as an actor.
onboarding: Code related to onboarding of compute provider machines to the network.
orchestrator: Contains job orchestration logic.
resources: Deals with the management of resources on the machine.
proposed
: All files with *_test.go
naming convention contain unit tests with respect to the specific implementation.
The class diagram for the dms
package is shown below.
Source file
Rendered from source file
TBD
Note: the functionality of DMS is being currently developed. See the proposed section for the suggested design of interfaces and methods.
Supervision
TBD as per proposed implementation
Supervisor
SupervisorStrategy
Statistics
TBD
Note: the functionality of DMS is being currently developed. See the proposed section for the suggested data types.
proposed
Refer to *_test.go
files for unit tests of different functionalities.
List of issues
All issues that are related to the implementation of dms
package can be found below. These include any proposals for modifications to the package or new functionality needed to cover the requirements of other packages.
Interfaces & Methods
proposed
Capability_interface
add
method will combine capabilities of two nodes. Example usage - When two jobs have to be run on a single machine, the capability requirements of each will need to be combined.
subtract
method will subtract two capabilities. Example usage - When resources are locked for a job, the available capability of a machine will need to be reduced.
Data types
proposed
dms.Capability
The Capability
struct will capture all the relevant data that defines the capability of a node to perform the job. At the same time this will be used to define capability requirements that a job requires from a node.
An initial data model for Capability
is defined below.
proposed
dms.Connectivity
type Connectivity struct {
}
proposed
dms.PriceInformation
proposed
dms.TimeInformation
type TimeInformation struct { // Units holds the units of time ex - hours, days, weeks Units string
}
Last updated: 2025-01-22 01:10:54.906042 File source:
The DMS Behaviors are a set of functionalities associated with a hierarchical namespace that the DMS performs if requested by an actor that has the necessary capabilities. Since capabilities are hierarchical, they can be applied either as exact match to a behavior or be implied by a top level capability. For example, the /dms/node/peers/ping
behavior can be accessed by an actor with the /dms/node/peers/ping
capability when specific or the /dms/node
capability that implies everything below it.
DMS Node Namespace: /dms/node
Description: Everything related to the management of a DMS node. Any capability that is implied by this namespace will be able to access all the behaviors below it. These should only be allowed to the controller/user of the DMS. Normally, it would be done by anchoring the DID of the controller user to the root anchor of the dms which allows the controller unlimited root access.
For a fine grained control, the following capabilities can be used:
The following Peer capabilities are directly associated with peer behaviors under the same path. They allow the controller of a DMS to request it to perform these actions. For example, when the PeerSelf behavior is invoked by the controller, it's the peer address of the DMS node that is returned.
PeerPingBehavior: /dms/node/peers/ping
Description: Ping a peer to check if it is alive.
PeersListBehavior: /dms/node/peers/list
Description: List peers visible to the node.
PeerSelfBehavior: /dms/node/peers/self
Description: Get the peer id and listening address of the node.
PeerDHTBehavior: /dms/node/peers/dht
Description: Get the peers in DHT of the node along with their DHT parameters.
PeerConnectBehavior: /dms/node/peers/connect
Description: Connect to a peer.
PeerScoreBehavior: /dms/node/peers/score
Description: Get the libp2p pubsub peer score of peers
The following capabilities deal with onboarding the DMS node as compute provider on the network. Onboarding a node involves setting a specific amount of compute resources for the node to allocate for incoming jobs.
OnboardBehavior: /dms/node/onboarding/onboard
Description: Onboard the node as a compute provider.
OffboardBehavior: /dms/node/onboarding/offboard
Description: Offboard the node as a compute provider.
OnboardStatusBehavior: /dms/node/onboarding/status
Description: Get the onboarding status. Whether the node is onboarded or not and errors if any.
NewDeploymentBehavior: /dms/node/deployment/new
Description: This node behavior is invoked by the controller to start a new deployment on the node. It takes an ensemble config as input and returns the deployment id.
DeploymentListBehavior: /dms/node/deployment/list
Description: List all the deployments orchestrated by the node.
DeploymentLogsBehavior: /dms/node/deployment/logs
Description: Get the logs of a particular deployment.
DeploymentStatusBehavior: /dms/node/deployment/status
Description: Get the status of a deployment.
DeploymentManifestBehavior: /dms/node/deployment/manifest
Description: Get the manifest of a deployment.
DeploymentShutdownBehavior: /dms/node/deployment/shutdown
Description: Shutdown a deployment.
ResourcesAllocatedBehavior: /dms/node/resources/allocated
Description: The behavior returns the amount of resources allocated to Allocations running on the node. Allocated resources should always be less than or equal to the onboarded resources
ResourcesFreeBehavior: /dms/node/resources/free
Description: The behavior returns the amount of resources that are free to be allocated on the node. Free resources should always be less than or equal to the onboarded resources
ResourcesOnboardedBehavior: /dms/node/resources/onboarded
Description: The behavior returns the amount of resources the node is onboarded with.
HardwareSpecBehavior: /dms/node/hardware/spec
Description: The behavior returns the hardware resource specification of the machine.
HardwareUsageBehavior: /dms/node/hardware/usage
Description: The behavior returns the full resource usage on the machine including usage by other processes.
LoggerConfigBehavior: /dms/node/logger/config
Description: Configure the logger/observability config of the node.
The following capabilities are associated with the deployment of jobs on a DMS node. These capabilities and behaviors allow the controller to deploy services on nodes, list the deployed ensembles, get the logs from deployed allocations, get the status of the deployment, get the manifest of the deployment, and shutting down the deployment.
During regular use, it's recommended that compute providers delegate the /dms/deployment
capability to orchestrators.
BidRequestBehavior: /dms/deployment/request
Description: The behavior and capability that will need to be invoked by an orchestrator and delegated from a compute provider to the orchestrator. It allows the orchestrator to request a bid from the compute provider for a specific ensemble.
BidReplyBehavior: /dms/deployment/bid
Description: The behavior and capability that will need to be invoked by a compute provider and delegated from an orchestrator to the compute provider. It allows the compute provider to reply to a bid request from an orchestrator.
CommitDeploymentBehavior: /dms/deployment/commit
Description: The associated behavior with this capability allows an orchestrator to temporarily commit the resources the provider bid on until full allocation.
AllocationDeploymentBehavior: /dms/deployment/allocate
Description: The associated behavior with this capability allows an orchestrator to allocate the resources the provider bid on after having committed it temporarily.
RevertDeploymentBehavior: /dms/deployment/revert
Description: The associated behavior with this capability allows an orchestrator to revert any commit or allocation done during a deployment.
Capability behaviors allow remote nodes to configure capability tokens on the node. The receiver node needs to have delegated the /dms/cap
capability to the invoking node.
CapListBehavior: /dms/cap/list
Description: The behavior and associated capability allow getting a list of all the capabilities another node had. The capability should be delegated to the node that needs to get the list of capabilities.
CapAnchorBehavior: /dms/cap/anchor
Description: Allows anchoring capability tokens on another node.
The following capabilities are associated with public behaviors that can be invoked by any actor on the network. These capabilities are normally granted to all actors on the network that are KYC'd by NuNet with the /public
capability. However, some nodes may choose to restrict these capabilities to specific actors and may not reply to invocations.
PublicHelloBehavior: /public/hello
Description: A public hello behavior where any actor can invoke it on a specific node/actor and get a hello message back if public capability has been granted.
PublicStatusBehavior: /public/status
Description: Invoking this behavior on a node will cause it to reply with its total resource amount it has on the machine along with an error message if any.
BroadcastHelloBehavior: /broadcast/hello
Description: A public hello broadcast in which any actor/node that receives it will reply with a hello message along with its DID.
Allocation capabilities are normally granted to orchestrators once a deployment starts running to allow the orcestrator to manage the allocations it deployed. These capabilities are normally granted temporarily since the allocations themselves are ephemeral and live only for the duration of the deployment.
AllocationStartBehavior: /dms/allocation/start
Description: Start an allocation after a deployment.
AllocationRestartBehavior: /dms/allocation/restart
Description: Restart an allocation after a deployment has been started.
RegisterHealthcheckBehavior: /dms/actor/healthcheck/register
Description: Register a new healthcheck mechanism for an allocation.
These too are associate with allocations and are granted to orchestrators once a deployment starts running. These capabilities allow the orchestrator to manage the subnet of the allocations it deployed in order to allow allocations to communicate with an ip layer on top of the p2p network.
SubnetAddPeerBehavior: /dms/allocation/subnet/add-peer
Description: Add a peer to a subnet.
SubnetRemovePeerBehavior: /dms/allocation/subnet/remove-peer
Description: Remove a peer from a subnet.
SubnetAcceptPeerBehavior: /dms/allocation/subnet/accept-peer
Description: Accept a peer in a subnet.
SubnetMapPortBehavior: /dms/allocation/subnet/map-port
Description: Map a port in a subnet. The mapping will be between the subnet ip and the port on the executor.
SubnetUnmapPortBehavior: /dms/allocation/subnet/unmap-port
Description: Unmap a port in a subnet.
SubnetDNSAddRecordsBehavior: /dms/allocation/subnet/dns/add-records
Description: Add DNS records to a subnet. Normally these records identify the allocations within the subnet. Each Allocation can have a dns_name parameter that can be used to identify the allocation but if not provided, the allocation name will be used instead. DNS names have a .internal suffix but can be used without them since the resolver within the executor will add it automatically if it supports it.
SubnetDNSRemoveRecordBehavior: /dms/allocation/subnet/dns/remove-record
Description: Remove a DNS record from a subnet.
Allocation Ensemble Capabilities are dynamic type namespaces that are created when an ensemble is deployed on a node. These capabilities are granted to orchestrators once a deployment starts running to allow the orchestrator to manage the allocations it deployed. These capabilities are normally granted temporarily and live only as long as the ensemble.
EnsembleNamespace: /dms/ensemble/%s
Description: A dynamic namespace that allows the controller to interact with ensembles on the node. The %s
will be replaced by the ensemble id once the deployment is running.
AllocationLogsBehavior: /dms/ensemble/%s/allocation/logs
Description: Get the logs of an allocation in an ensemble.
AllocationShutdownBehavior: /dms/ensemble/%s/allocation/shutdown
Description: Shutdown an allocation in an ensemble.
SubnetCreateBehavior:
DynamicTemplate: /dms/ensemble/%s/node/subnet/create
Static: /dms/node/subnet/create
Description: Create a new subnet for an ensemble. This request is supposed to be received by the node of the compute provider and created for the allocations it creates for the ensemble.
SubnetDestroyBehavior:
DynamicTemplate: /dms/ensemble/%s/node/subnet/destroy
Static: /dms/node/subnet/destroy
Description: Destroy a subnet for an ensemble. This request is supposed to be received by the node of the compute provider.
Last updated: 2025-01-22 01:10:56.864290 File source:
This whole package is proposed
status and therefore documentation is missing, save for the proposed functionality part.
TBD
TBD
Source
Rendered from source file
TBD
TBD
TBD
List of issues
All issues that are filed in GitLab related to the implementation of dms/orchestrator
package can be found below. These include any proposals for modifications to the package or new functionality needed to cover the requirements of other packages.
Proposed functionalities
TBD
Data types
proposed
LocalNetworkTopology
more complex deployments may need a data structure, which considers local network topology of a node / dms -- i.e. for reasoning about speed of connection (as well as capabilities) between neighbors.
Related research blogs
TBD
Last updated: 2025-01-22 01:10:56.343328 File source:
This file explains the onboarding functionality of Device Management Service (DMS). This functionality is catered towards compute providers who wish provide their hardware resources to Nunet for running computational tasks as well as developers who are contributing to platform development.
Here is quick overview of the contents of this directory:
The class diagram for the onboarding
package is shown below.
Source file
Rendered from source file
Onboard
signature: Onboard(ctx context.Context, config types.OnboardingConfig) error
input #1: Context object
input #2: types.OnboardingConfig
output (error): Error message
Onboard
function executes the onboarding process for a compute provider based on the configuration provided.
signature: Offboard(ctx context.Context) error
input #1: Context object
output: None
output (error): Error message
Offboard
removes the resources onboarded to Nunet.
signature: IsOnboarded(ctx context.Context) (bool, error)
input #1: Context object
output #1: bool
output #2: error
IsOnboarded
checks if the compute provider is onboarded.
signature: Info(ctx context.Context) (types.OnboardingConfig, error)
input #1: Context object
output #1: types.OnboardingConfig
output #2: error
Info
returns the configuration of the onboarding process.
types.OnboardingConfig
: Holds the configuration for onboarding a compute provider.
List of issues
All issues that are related to the implementation of dms
package can be found below. These include any proposals for modifications to the package or new functionality needed to cover the requirements of other packages.
Last updated: 2025-01-22 01:10:55.800895 File source:
proposed
DescriptionThis package is responsible for creation of a Node
object which is the main actor residing on the machine as long as DMS is running. The Node
gets created when the DMS is onboarded.
The Node
is responsible for:
Communicating with other actors (nodes and allocations) via messages. This will include sending bid requests, bids, invocations, job status etc
Checking used and free resource before creating allocations
Continuous monitoring of the machine
Here is quick overview of the contents of this pacakge:
The class diagram for the node
package is shown below.
Source file
Rendered from source file
TBD
TBD
proposed
Refer to *_test.go
files for unit tests of different functionalities.
List of issues
All issues that are related to the implementation of dms
package can be found below. These include any proposals for modifications to the package or new functionality needed to cover the requirements of other packages.
Interfaces & Methods
proposed
Node_interface
getAllocation
method retrieves an Allocation
on the machine based on the provided AllocationID
.
checkAllocationStatus
method will retrieve status of an Allocation
.
routeToAllocation
method will route a message to the Allocation
of the job that is running on the machine.
benchmarkCapability
method will perform machine benchmarking
setRegisteredCapability
method will record the benchmarked Capability of the machine into a persistent data store for retrieval and usage (mostly in job orchestration functionality)
getRegisteredCapability
method will retrieve the benchmarked Capability of the machine from the persistent data store.
setAvailableCapability
method changes the available capability of the machine when resources are locked
getAvailableCapability
method will return currently available capability of the node
lockCapability
method will lock certain amount of resources for a job. This can happen during bid submission. But it must happen once job is accepted and before invocation.
getLockedCapabilities
method retrieves the locked capabilities of the machine.
setPreferences
method sets the preferences of a node as dms.orchestrator.CapabilityComparator
getPreferences
method retrieves the node preferences as dms.orchestrator.CapabilityComparator
getRegisteredBids
method retrieves list of bids receieved for a job.
startAllocation
method will create an allocation based on the invocation received.
Data types
proposed
dms.node.Node
An initial data model for Node
is defined below.
proposed
dms.node.NodeID
: Current file which is aimed towards developers who wish to modify the onboarding functionality and build on top of it.
: This is main file where the code for onboarding functionality exists.
: This file houses functions to generate Cardano wallet addresses along with its private key.
: This file houses functions to test the address generation functions defined in .
: This file houses functions to get the total capacity of the machine being onboarded.
: This files initializes the loggers associated with onboarding package.
All the tests for the onboarding package can be found in the file.
: Current file which is aimed towards developers who wish to use and modify the DMS functionality.
Note: the functionality of DMS is being currently developed. See the section for the suggested design of interfaces and methods.
Note: the functionality of DMS is being currently developed. See the section for the suggested data types.
Last updated: 2025-01-22 01:10:55.174703 File source: link on GitLab
The hardware package is responsible for handling the hardware related functionalities of the DMS.
Here is quick overview of the contents of this package:
cpu: This package contains the functionality related to the CPU of the device.
ram.go: This file contains the functionality related to the RAM.
disk.go: This file contains the functionality related to the Disk.
gpu: This package contains the functionality related to the GPU of the device.
GetMachineResources()
signature: GetMachineResources() (types.MachineResources, error)
input: None
output: types.MachineResources
output(error): error
GetCPU()
signature: GetCPU() (types.CPU, error)
input: None
output: types.CPU
output(error): error
GetRAM()
signature: GetRAM() (types.RAM, error)
input: None
output: types.RAM
output(error): error
GetDisk()
signature: GetDisk() (types.Disk, error)
input: None
output: types.Disk
output(error): error
The hardware types can be found in the types package.
The tests can be found in the *_test.go
files in the respective packages.
Last updated: 2025-01-22 01:10:57.288939 File source: link on GitLab
resources
deals with resource management for the machine. This includes calculation of available resources for new jobs or bid requests.
Here is quick overview of the contents of this pacakge:
README: Current file which is aimed towards developers who wish to use and modify the DMS functionality.
init: Contains the initialization of the package.
resource_manager: Contains the resource manager which is responsible for managing the resources of dms.
usage_monitor: Contains the implementation of the UsageMonitor
interface.
store: Contains the implementation of the store
for the resource manager.
All files with *_test.go
contains unit tests for the corresponding functionality.
The class diagram for the resources
package is shown below.
Source file
Rendered from source file
Manager Interface
The interface methods are explained below.
AllocateResources
signature: AllocateResources(context.Context, ResourceAllocation) error
input: Context
output (error): Error message
AllocateResources
allocates the resources to the job.
DeallocateResources
signature: DeallocateResources(context.Context, string) error
input: Context
output (error): Error message
DeallocateResources
deallocates the resources from the job.
GetTotalAllocation
signature: GetTotalAllocation() (Resources, error)
input: Context
output: types.Resource
output (error): Error message
GetTotalAllocation
returns the total resources allocated to the jobs.
GetFreeResources
signature: GetFreeResources() (FreeResources, error)
input: None
output: FreeResources
output (error): Error message
GetFreeResources
returns the available resources in the allocation pool.
GetOnboardedResources
signature: GetOnboardedResources(context.Context) (OnboardedResources, error)
input: Context
output: OnboardedResources
output (error): Error message
GetOnboardedResources
returns the resources onboarded to dms.
UpdateOnboardedResources
signature: UpdateOnboardedResources(context.Context, OnboardedResources) error
input: Context
input: OnboardedResources
output (error): Error message
UpdateOnboardedResources
updates the resources onboarded to dms.
UsageMonitor
signature: UsageMonitor() types.UsageMonitor
input: None
output: types.UsageMonitor
instance
output (error): None
UsageMonitor
returns the types.UsageMonitor
instance.
This interface defines methods to monitor the system usage. The methods are explained below.
GetUsage
signature: GetUsage(context.Context) (types.Resource, error)
input: Context
output: types.Resource
output (error): Error message
GetUsage
returns the resources currently used by the machine.
types.Resources
: resources defined for the machine.
types.AvailableResources
: resources onboarded to Nunet.
types.FreeResources
: resources currently available for new jobs.
types.ResourceAllocation
: resources allocated to a job.
types.MachineResources
: resources available on the machine.
types.GPUVendor
: GPU vendors available on the machine.
types.GPU
: GPU details.
types.GPUs
: A slice of GPU
.
types.CPU
: CPU details.
types.RAM
: RAM details.
types.Disk
: Disk details.
types.NetworkInfo
: Network details.
Last updated: 2025-01-22 01:10:55.452245 File source: link on GitLab
Table of Contents
In NuNet, compute workloads are structured as compute ensembles. Here, we discuss how an ensemble can be created, deployed, and supervised in the NuNet network.
An ensemble is a collection of logical nodes and allocations. Nodes represent the hardware where the compute workloads run. Allocations are the individual compute jobs that comprise the workload. Each allocation is assigned to a node, and a node can have multiple allocations assigned to it.
All allocations in the ensemble are assigned a private IP address in the 10/8 range and are connected with a virtual private network, implemented using IP over libp2p. All allocations can reach each other through the VPN. Allocation IP addresses can be discovered internally in the ensemble using DNS: each allocation has a name and a DNS name, which by default is just the allocation name in the .internal
domain.
Allocation and Node names within an ensemble must be unique. The ensemble as a whole has a globally unique ID (a randomn UUID).
In order to deploy an ensemble, the user must specify its structure and constraints; this is done with a YAML file encoding the ensemble configuration data structure; the fields of the configuration structure are described in detail in this reference.
Fundamentally the ensemble configuration has the following structure:
A map of allocations, mapping allocation names to configuration for individual allocations.
A map of nodes, mapping node names to configuration for individual nodes.
A list of edges between nodes, encoding specific logical edge constraints.
There are additional fields in the data structure which allows us to include ssh keys and scripts in the configuration, as well as supervision strategies policies.
An allocation's configuration has the following structure:
The name of the allocation executor; this is the environment in which the actual compute job is executed. We currently support docker and firecracker VMs, but we plan to also support WASM and generally any sandbox/VM that makes sense for users.
The resources required to run the allocation, such as memory, cpu cores, gpus, and so on.
The execution details, which encodes the executor specific configuration of the allocation.
The DNS name for internal name resolution of the allocation. This can be omitted, in which case the allocation's name becomes the DNS name.
The list of ssh keys ton drop in the allocation, so that administrators can ssh into the allocation.
The list of scripts to execute during provisioning, in execution order.
Finally, the user can also specify the application specific health check to be performede by the supervisor, so that the health of the application can be ascertained and failures detected.
A node's configuration has the following structure:
The list of allocations that are assigned to the node
The configuration of mapping public ports to ports in allocations
The Location constraints for the node
An optional field for explicitly specifying the peer on which the node should be assigned, allowing users and organizations to bring their own nodes into the mix, for instance for hosting sensitive data.
In the near future, we also plan to support directly parsing kubernetes job description files. We also plan to provide a declarative format for specifying large ensembles so that it is possible to succinctly describe a 10k GPU ensemble for training an LLM and so on.
It is worth reiterating that ensembles carry with the constraints, as specified by the user. This allows the user to have finegrained control of their ensemble deployment and ensure that certain requirements are met.
In DMS v0.5 we support the following constraints:
Resources for an allocation, such as memory, core count, gpu details, and so on.
Location for nodes; the user can specify the region, city, etc all the way to choosing a particular ISP. Location constraints can also be negative, so that a node will not be deployed in certain locations e.g. because of regulatory considerations such as GPDR.
Edge Constraints, which specify the relationship between nodes in the allocation in terms of available bandwidth and round trip time.
In subsequent releases we plan to add additional constraints (e.g. existence of a contract, price range, explicit datacenter placement, energy sources and so on) and generalize the constraint expression language as graphs.
Given an ensemble specification, the core functionality of the NuNet network is to find and assign peers to nodes that satisfies the constraints of the ensemble. The system treats the deployment as a constraint satisfaction problem over permutations of available peers (compute nodes) on which the user is authorized to deploy. The process of deploying an ensemble is called orchestration. In the following we summarize how deployment orchestration is performed.
Ensemble deployment is initiated with a user invoking the /dms/node/deployment/new
behavior on the node which is willing to run an orchestrator for them; this can be just the user's private DMS running on his laptop. The node accepting the invocation creates the orchestrator actor inside its process space, initiates the deployment orchestration, and return to the user the ensemble identifier. The user can use this identifier to poll the status of the deployment and control of the ensemble through the orchestrator actor. The user also specifies a timeout on how long the deployment process should take before declaring failure. This is simply the expiration on the message that invokes /dms/node/deployment/new
.
The orchestrator then proceeds to request bids for each node in the ensemble. This is accomplished by broadcasting a message to the /dms/deployment/request
behavior in the /nunet/deployment
broadcast topic. The deployment request contains a mapping of node names in the ensemble, together with their aggregate (for all allocations to be assigned in the node) resource constraints, together with location and other constraints that can restrict the search space.
In order for this to proceed, the orchestrator must have the appropriate capabilities; only provider nodes that accept the user's capabilities will respond to the broadcast message. The response to the bid request is a bid for a node in the ensemble, by sending a message to the /dms/deployment/bid
behavior in the orchestrator. This also implies that the nodes that submit such bids must have appropriate capabilities accepted by the orchestrator.
Given the appropriate capabilities, the orchestrator collects bids until it has a sufficient number of bids or a timeout that ensures prompt progress in the deployment. If the orchestrator doesn't have bids for all nodes, then it rebroadcasts its bid request, excluding peers that have already submitted a bid. This continues until there are bids for all nodes or the deployment times out, at which point a deployment failure is declared.
Note that in the case of node pinning, where a specific peer is assigned to an ensemble node in advance (ie when a user brings their own nodes into the ensemble), bid requests are not broadcast but rather directly invoked on the peer.
Next, the orchestrator generates permutations of assignments of peers to nodes and evaluates the constraints. Some constraints can be directly rejected without measurement, for instance round trip latency constraints can be rejected by using speed of light calculations that provide a lower bound on physically realizable latency. We plan to do the same with bandwidth constraints, given the node measured link capacity and the throughput bound equation that governs TCP's behavior given bottleneck bandwidth and RTT.
Once a candidate assignment is deemed viable, the orchestrator proceeds to measure specific constraints for satisfiability. This involves measuring round trip time and bandwidth between node pairs, and is accomplished by invoking the /dms/deployment/constraint/edge
behavior.
If a candidate assignment satisfies the constraints, the orchestrator proceeds with committing and provisioning the deployment. This is done with a two phase commit process: first the orchestrator sends a commit message to all peers to ensure that the resources are still available (nodes don't lock resources when submitting a bid), by invoking the /dms/deployment/commit
behavior. If any node fails to commit, the candidate deployment is reverted and the orchestrator starts anew; revert happens with the /dms/deployment/revert
behavior.
If all nodes successfully commit, the orchestrator proceeds to provision the deployment by sending allocation details to the relevant nodes and creating the VPN. This is initiated by invoking the /dms/deployment/allocate
behavior on the provider nodes, which creates a new allocation actor. Subsequently, the orchestrator assigns IP addresses to allocations and creates the VPN (what we call the subnet) by invoking the appropriate behaviors on the allocation actors, and then starts the allocations. Once all nodes provision, the deployment is now considered running and enters supervision.
The deployment will keep running until the user shuts it down, as long as the user's agreement with the provider is active; in the near future we will also support explicitly specifying durations for running ensembles, and the ability to modify running ensembles in order to support mechanisms like auto scaling.
TODO
In order to discuss authorization flow for deployment in the NuNet network, we need to distinguish certain actors in the system in the course of an ensembles lifetime.
Specifically, we introduce the following notation:
Let's call U
, the user as an actor.
Let's call O
the orchestrator, which is an actor living inside a DMS instance (node) for which the user is authorized to initiate a deployment. We call the node where the orchestrator runs N_o
. Note that the DID of the orchestrator actor will be the same as the DID of the node on which it runs, but it will have an ephemeral actor ID.
Let's call P_i
the set of compute providers that are willing to accept deployment requests from U
.
Let's call N_{P_i,j}
the DMS nodes controlled by the providers that are willing to accept deployments from users.
And finally let's call A_i
the allocation actor for each running allocation. The DID of each allocation actor will be the same as the DID of the node on which the allocation is running, but it will have an ephemeral actor ID.
Also note that we have certain identifiers pertaining to these actors; let's define the following notation:
DID(x)
is the DID of actor x
; in general this is the DID that identifies the node on which the actor is running.
ID(x)
is the ID of actor x
; this is generally ephemeral, except for node root actors which have persistent identities matching their DID.
Peer(x)
is the peer ID of a node/actor x
.
Root(x)
is the DID of the root anchor of trust for the node/actor x
.
Using the notation above we can enumerate the behavior namespaces and requisite capabilities for deployment of an ensemble:
Invocations from U
to N_o
are in the /dms/node/deployment
namespace
Invocations from O
to N_{P_i,j}
for deployment bids:
broadcast /dms/deployment/request
via the /nunet/deployment
topic
unicast /dms/deployment/request
for pinned ensemble nodes
Messages from N_{P_i,j}
to O
:
/dms/deployment/bid
as the reply to a bid request
Invocations from O
to N_{P_i,j}
for deployment control are in the /dms/deployment
namespace.
Invocations from O
to A_i
are in the /dms/allocation
namespace and are dynamically granted programmatically.
Invocations from O
to N_{P_i,j}
for allocation control are in the dynamic /dms/ensemble/<ensemble-id>
namespace and are dynamically granted programatically.
This creates the following structure:
U
must be authorized with /dms/node/deployment
capability in N_o
N_o
must be authorized with /dms/deployment
capability in N_{P_i,j}
so that the orchestrator can make the appropriate invocations.
N_{P_i,j}
must be authorized with /dms/deployment/bid
capability on N_o
so that it can submit bids to the orchestrator.
Note that the decentralized structure and fine grained capability model of the NuActor system allows for very tight access control. This ensures that:
Orchestrators can only run on DMS instances where the user is authorized to initiate deployment.
Bid requests will only be accepted by provider DMS instances where the user is authorized to deploy.
Bids will only be accepted by provider DMS instances whom the user has authorized.
In the following we examine common functional scenarios on how to set up the system so that deployments are properly authorized.
TODO
TODO
TODO
TODO
Last updated: 2025-01-22 01:10:56.606238 File source:
The orchestrator is responsible for job scheduling and management (manages jobs on other DMSs).
A key distinction to note is the option of two types of orchestration mechanisms: push
and pull
. Broadly speaking pull
orchestration works on the premise that resource providers bid for jobs available in the network, while push
orchestration works when a job is push
ed directly to a known resource provider -- constituting to a more centralized orchestration. push
orchestration develops on the idea that users choose from the available providers and their resources. However, given the decentralized and open nature of the platform, it may be required to engage the providers to get their current (latest) state and preferences. This leads to an overlap with the pull
orchestration approach.
The default setting is to use pull
based orchestration, which is developed in the present proposed specification.
proposed
Job Orchestration
The proposed lifecyle of a job on Nunet platform consists of various operations from job posting to settlement of the contract. Below is a brief explanation of the steps involved in the job orchestration:
Job Posting: The user posts a job request to the DMS. The job request is validated and a Nunet job is created in the DMS.
Search and Match:
a. The Service provider DMS requests for bids from other nodes in the network.
b. DMS on compute provider compares the capability of the available resources against job requirements. If all the requirements are met, it then decides whether to submit a bid.
c. The received bids are assessed and the best bid is selected.
Job Request: In case the shortlisted compute provider has not locked the resources while submitting the bid, the job request workflow is executed. This requires the compute provider DMS to lock the necessary resources required for the job and re-submit the bid. Note that at this stage compute provider can still decline the job request.
Contract Closure: The service provider and the shortlisted compute provider verify that the counterparty is a verified entity and approved by Nunet Solutions to participate in the network. This in an important step to establish trust before any work is performed.
If job does not require any payment (Volunteer Compute), contract is generated by both Service Provider and Compute Provider DMS. This is then verified by Contract-Database
. Otherwise, proof of contract needs to be received from the Contract-Database
before start of work.
Invocation and Allocation: When the contract closure workflow is completed, both the service provider and compute provider DMS have an agreement and proof of contract with them. Then the service provider DMS will send an invocation to the compute provider DMS which results in job allocation being created. Allocation can be understood as an execution space / environment on actual hardware that enables a Job to be executed.
Job Execution: Once allocation is created, the job execution starts on the compute provider machine.
Contract Settlement: After job is completed, service provider DMS verifies the work done. If the work is correct, the Contract-Database
makes the necessary transactions to settle the the contract.
Here is quick overview of the contents of this directory:
Subpackages
Source
Rendered from source file
TBD
TBD
TBD
List of issues
All issues that are related to the implementation of dms
package can be found below. These include any proposals for modifications to the package or new functionality needed to cover the requirements of other packages.
Interfaces & Methods
proposed
Orchestrator interface
publishBidRequest
: sends a request for bid to the network for a particular job. This will depend on the network
package for propagation of the request to other nodes in the network.
compareCapability
: compares two capabilities and returns a CapabilityComparison
object. Expected usage is to compare capability required in a job with the available capability of a node.
acceptJob
: looks at the comparison between capabilities and preferences of a node in the form of CapabilityComparator
object and decides whether to accept a job or not.
sendBid
: sends a bid to the node that propagated the BidRequest
.
selectBestBid
: looks at all the bids received and selects the best one.
sendJobRequest
: sends a job request to the shortlisted node whose bid was selected. The compute provider node needs to accept the job request and lock its resources for the job. In case resources are already locked while submitting the bid, this step may be skipped.
sendInvocation
: sends an invocation request (as a message) to the node that accepted the job. This message should have all the necessary information to start an Allocation
for the job.
orchestrateJob
: this will be called when a job is received via postJob endpoint. It will start the orchestration process. It is also possible that this method could be called via a timer for jobs scheduled in the future.
proposed
Actor interface
sendMessage
: sends a message to another actor (Node / Allocation).
processMessage
: processes the message received and decides on what action to take.
proposed
Mailbox interface
receiveMessage
: receives a message from another Node and converts it into a telemetry.Message
object.
handleMessage
: processes the message received.
triggerBehavior
: this is where actions taken by the actor based on the message received will be defined.
getKnownTopics
: retrieves the gossip sub topics known to the node.
getSubscribedTopics
: retrieves the gossip sub topics subscribed by the node.
subscribeToTopic
: subscribes to a gossip sub topic.
unsubscribeFromTopic
: un-subscribes from a gossip sub topic.
proposed
Other methods
Methods for job request functionality a. check whether resources are locked b. lock resources c. accept job request
Methods for contract closure a. validate other node as a registered entity b. generate contract c. kyc validation
Methods for job exeuction a. handle job updates
Methods for contract settlement a. job verification
Note that the above methods not an exhaustive list. These are to be considered as suggestions. The developer implementing the orchestrator functionality is free to make modifications as necessary.
Data types
proposed
dms.orchestrator.Actor
: Actor has a identifier and a mailbox to send/receive messages.
proposed
dms.orchestrator.Bid
: Consists of information sent by the compute provider node to the requestor node as a bid for the job broadcasted to the network.
proposed
dms.orchestrator.BidRequest
: A bid request is a message sent by a node to the network to request for bids.
proposed
dms.orchestrator.PriceBid
: Contains price related information of the bid.
proposed
dms.orchestrator.TimeBid
: Contains time related information of the bid.
proposed
dms.orchestrator.CapabilityComparator
: Preferences of the node which has an influence on the comparison operation.
TBD
proposed
dms.orchestrator.CapabilityComparison
: Result of the comparison operation.
TBD
proposed
dms.orchestrator.Invocation
: An invocation is a message sent by the orchestrator to the node that accepted the job. It contains the job details and the contract.
proposed
dms.orchestrator.Mailbox: A mailbox is a communication channel between two actors. It uses network
package functionality to send and receive messages.
proposed
Other data types
Data types related to allocation, contract settlement, job updates etc are currently omitted. These should be added as applicable while implementation.
Orchestration steps research blogs
The orchestrator functionality of DMS is being developed based on the research done in the following blogs:
See section for research blogs with more details on this topic.
: Current file which is aimed towards developers who wish to use and modify the orchestrator
functionality.
: Directory containing package specifications, including package class diagram.
: Defines and implements interfaces of Graph logic for network topology awareness (proposed).
Note: the functionality of DMS is being currently developed. See the section for the suggested design of interfaces and methods.
Note: the functionality of DMS is being currently developed. See the section for the suggested data types.