functional_tests

Last updated: 2024-12-12 01:10:10.668602 File source: link on GitLab

functional_tests

Tests each API call as defined by the NuNet Open API of the respective version that is being tested. The goal of this stage is to make sure that the released versions of the platform fully correspond to the released Open APIs, which will be used by core team, community developers and app integrators to build further.

Implemented: https://gitlab.com/nunet/nunet-infra/-/blob/develop/ci/templates/Jobs/Functional-Tests.gitlab-ci.yml

Pre-requisites

  • python 3.11+

    • older python versions might work though

  • python-venv

  • DMS:

    • a native installation locally or using docker

    • using the project dms-on-lxd

Usage

For detailed instructions on setting up and running the functional tests, please refer to the Quick Setup Guide which provides step-by-step instructions for:

  • Setting up the LXD environment

  • Running standalone and distributed tests

  • Common test scenarios and examples

  • Environment cleanup

Feature environment

This section documents the development guidelines of functional tests targeting the feature environment.

Introduction

The feature environment is a particular instance of an isolated network environment that has multiple DMS instances deployed. It uses the project dms-on-lxd to manage the virtual machines and network hosting DMS nodes. A full explanation of the feature environment architecture can be seen at the feature environment architecture documnetation.

There are conceptually two types of tests that will use the feature environment, standalone and distributed.

Standalone tests are subset of functional tests that don't explicitly test network integration, while distributed tests aims to produce particular outcomes when interacting with multiple DMS nodes in coordination.

Standalone tests will test things like hardware support, OS support, system resources footprint to name a few. It tries to answer questions like "can this particular ARM CPU run all the functionalities provided by DMS interface?", "can DMS be deployed on Ubuntu (24.04, 22.04, 20.04), Debian (Bookworm, Bullseye), Arch, etc...?", "are the minimum requirements for running DMS valid in practice?"...

Distributed tests will test things like peer to peer functionality, graph traversal and so forth. It tries to answer things like "can each DMS node in the graph see each other node?", "how long does it take for a node to be visibile to other nodes when joining the network?", "given multiple DMS nodes, can I successfully send files and messages from each node to another?", "given three DMS nodes, where A can only communicate with B through C, can I successfully interact with C from A?"...

Having this distinction in mind we can explore the interfaces of the feature environment and explore how they relate to the implementation of the functional tests.

Standalone API tests

The standalone API tests are structured in a way that they try to communicate with port 9999 using localhost and the http protocol. They can be used as is leveraging ssh tunneling.

Lets use the feature set described in device_api.feature as an example. Given we have a DMS installed locally, we can just run them:

behave features/device-management-service/api-tests/device_api.feature

However, in the context of the feature environment, the machine that run the tests and those that effectively execute the required commands and queries are different. Therefore we need to tunnel port 9999 to where we are running behave.

First we have to make sure that nothing is running bound to port 9999. For this we can use lsof to verify programs listenting to that port:

sudo lsof -i :9999

This command should not produce any output if there isn't anything listening to port 9999. If, however, there indeed is, that program should be interrupted before attempting to create the tunnel.

Once we made sure port 9999 is free to use, we can open the tunnel:

ssh-keygen -R $VM_IPV4
nohup ssh \
    -4 -N -L 9999:localhost:9999 \
    -o IdentitiesOnly=yes \
    -o StrictHostKeyChecking=no \
    -i $PROJECT_DIR/infrastructure/dms-on-lxd/lxd-key \
    root@$VM_IPV4 &
tunnel_pid=$!

Where PROJECT_DIR is the root of this project and VM_IPV4 is the IP of the target virtual machine we want to run the API test.

The first command uses ssh-keygen to update the known_hosts file with the updated signature of the virtual machine that has $VM_IPV4 attached to it. This is to make sure that ssh won't complain about signature changes and prevent us from opening the tunnel. Since the target virtual machines are ephemeral, this is a problem that can happen often. ssh-keygen in this context is safe to use because we are the ones provisioning the virtual machines, therefore the man-in-the-middle warning are known to be false alarms.

The second command uses ssh to create an IPv4 tunnel using port 9999. nohup combined with & is a bash idiom that will run ssh in the background without halting it, freeing your terminal to be used to run further commands. For more information see this stackoverflow answer.

The last command saves the process id of the tunnel in the variable tunnel_pid so it can be later used to destroy the tunnel.

Now we can just run behave again and it will use the local port 9999 but the connection will be redirected to the target host.

Once we are done, we can close the tunnel using the process ID we saved before:

kill $tunnel_pid

Standalone CLI tests

CLI tests don't have the same flexibility as API testing using http. They must be piped to the remote host using ssh directly, or at least there isn't a known way to pipe these commands transparently while running a python runtime locally.

Therefore there needs to be an effort employed to refactor the way the tests are structured so that if we pass a list of IPV4 addresses, a username (defaulting to root) and a key, it will run the necessary CLI commands over ssh, otherwise running the CLI commands locally.

The proposed way to do this is to run behave passing this information using -D:

behave \
  -D ssh_host=$target_ip \
  -D ssh_key_file=$PROJECT_DIR/infrastructure/dms-on-lxd/lxd-key \
  -D ssh_username=root \
  features/device-management-service/cli-tests/nunet_cli.feature

Note that this command uses files that are produced by the dms-on-lxd project. This assumes that the target IP which will run the commands is stored in the variable $target_ip. For more information about lxd-key and other files produces by dms-on-lxd, see dms-on-lxd documentation.

How exactly we implement this is up for debate, but there is a proof of concept that can be used as an example. For more information refer to the feature POC.

Distributed tests

To compose tests that require a certain level of coordination, the proposed way of doing it is through the implementation of the gherkin features using python, delegating to the behave framework and the python implementation the responsibility of coodinating these interactings and hiding them under high level functionality descriptions.

For this, take the Feature POC and its implementation as an example.

In it there are general features described, but each scenario is run on all nodes before moving to the next. This way, we can test that all nodes can onboard, that all nodes can see each other in the peers list, that all nodes can send messages to all other nodes, and that all of them can offboard, either using the CLI over SSH or the API using sshtunnel.

The code won't be repeated here to avoid risking them becoming obsolete in the future when we either change the POC code or remove them altogether.

SP and CP

It's not hard to imagine an extension of the POC for the scenario of a service provider and a compute provder.

Let's imagine the service provider has a workload that require GPU offloading but no GPU, while the compute provider has a GPU available for such workloads. In this scenario we can have a preparation step in behave that queries the remote hosts for resources, using lspci over ssh for instance, to identify machines that can serve as the service provider and the compute provider.

Doing this, we can have a test that describes exactly that, and we can implement the feature test in a way that will use the elected service provider (with or without GPU) to post a job specifically for the node that has GPU capabilities and will serve as a compute provider.

Last updated