Introduction
The Vollo Trees SDK is designed for low latency streaming inference of decision tree models on FPGA platforms.
🔎 See Vollo for the parent product designed for low latency streaming inference on Neural Networks.
You can estimate the latency of your model using the Vollo Trees SDK without needing an FPGA or a Vollo license, see Getting Started for details.
This document outlines the following:
- Installation
- Key features of Vollo Trees
- Steps to get started with Vollo Trees
- The Vollo Trees Compiler API
- The Vollo Runtime API
- Hardware requirements and setup
Installation
The latest SDK is available for download from https://github.com/MyrtleSoftware/vollo-trees-sdk/releases.
Download the vollo-trees-sdk-<version>.run
self-extractable archive and execute it
to extract the Vollo Trees SDK contents to the current directory.
chmod +x vollo-trees-sdk-<version>.run
./vollo-trees-sdk-<version>.run
Key Features
Vollo Trees accelerates machine learning inference for low latency streaming models typically found in financial trading or fraud detection systems such as:
- Market predictions
- Risk analysis
- Anomaly detection
- Portfolio optimisation
Key characteristics of Vollo Trees are:
- Low latency inference decision tree models, typically between 1.9-2.4μs.
- High density processing in a 1U server form factor suitable for co-located server deployment.
- Compiles decision tree models for use on the accelerator.
Getting Started
You can get started with evaluating your decision tree model's performance on Vollo Trees using the Vollo Trees compiler which doesn't require an FPGA accelerator.
When you are ready, you can run inferences with your model on a Vollo FPGA accelerator using an evaluation license.
Performance estimation and model design with the Vollo Trees compiler
You can use the Vollo Trees compiler to compile and estimate the performance of your model in an ML user's environment without any accelerator.
The Vollo Trees compiler execution time is typically on the order of seconds, enabling fast model iteration for tuning models to meet a latency target.
To estimate performance of your model with the Vollo SDK:
-
Download and extract the Vollo SDK.
-
Install the Vollo Trees compiler Python libraries.
-
Compile your model using the Vollo-rtees compiler and evaluate the compiled program on inference data to generate a compute latency estimate that will be achieved with Vollo Trees.
See Vollo Trees compiler Example for a fully worked example of this including performance estimation.
-
Iterate on your model architecture to meet your combined latency and accuracy requirements.
Validating inference performance using the Vollo Trees FPGA accelerator
When you are ready to run inferences with your models on a Vollo Trees accelerator, you will need a compatible FPGA based PCIe accelerator card and a Vollo Trees license.
Evaluation licenses can be provided free of charge by contacting vollo@myrtle.ai.
To validate inference performance on Vollo Trees:
-
Compile your model and save it as a
.vollo
program file using the Vollo Trees compiler.See Vollo Trees compiler Example for a fully worked example.
-
Run and benchmark your model on the accelerator using the Vollo runtime C example.
Make sure to pass the example application the path to your saved
.vollo
program when you invoke it on the command line.
Note that the Vollo Trees SDK includes prebuilt FPGA bitstreams for selected PCIe accelerator cards so no FPGA compilation or configuration is required after initial accelerator setup. As a result loading user models to run on Vollo takes under a second, enabling fast onboard iteration and evaluation of different models.
Vollo Compiler
The vollo-trees-compiler
Python library can compile an ONNX TreeEnsembleRegressor
model
to a Vollo program (.vollo
file). It also provides functionality to estimate the performance of
the Vollo program.
The Vollo Runtime section describes how to run a Vollo program on a Vollo accelerator.
API Reference
This chapter walks through examples of how to use the Vollo compiler that should cover the most commonly used parts of the API.
A more complete API reference can be found here.
Installation
Set up Vollo environment variables by sourcing
setup.sh
in bash
.
Install the wheel file for the Vollo Trees compiler library. It's recommended that you install this into a virtual environment.
Note: the packaged wheel only supports python 3.7 or greater
python3 -m venv vollo-venv
source vollo-venv/bin/activate
pip install --upgrade pip
pip install "$VOLLO_SDK"/python/*.whl
Supported Models
The vollo-trees-compiler converts ONNX models containing a TreeEnsembleRegressor
node into Vollo programs.
The skl2onnx
and onnxmltools
python libraries provide functionality for converting
decision tree regressors from various machine learning libraries into ONNX.
For example, to convert a scikit-learn RandomForestRegressor
into a Vollo program:
import numpy as np
import vollo_trees_compiler as vtc
from sklearn.ensemble import RandomForestRegressor
from skl2onnx.common.data_types import FloatTensorType
from skl2onnx import convert_sklearn
n_estimators = 256
max_depth = 8
n_features = 256
n_samples = 2**max_depth
X = np.random.rand(n_samples, n_features)
y = np.random.rand(n_samples)
random_forest = RandomForestRegressor(
n_estimators=n_estimators, max_depth=max_depth
)
# Fit some given data X, y
random_forest.fit(X, y)
# Convert the model to ONNX
initial_type = [("input", FloatTensorType([1, n_features]))]
onnx_model = convert_sklearn(
random_forest,
initial_types=initial_type,
target_opset=12
)
with open("sklearn_model.onnx", "wb") as f:
f.write(onnx_model.SerializeToString())
config = vtc.Config.ia420f_u128()
forest = vtc.Forest.from_onnx("sklearn_model.onnx")
program = forest.to_program_bf16(config)
program.save("sklearn_model.vollo")
See the sklearn-onnx documentation for details on converting from LightGBM
, XGBoost
and CatBoost
to ONNX.
Example
The Vollo Trees compiler expects an ONNX model with a TreeEnsembleRegressor
node as input.
import numpy as np
import vollo_trees_compiler as vtc
from sklearn.ensemble import RandomForestRegressor
from skl2onnx.common.data_types import FloatTensorType
from skl2onnx import convert_sklearn
n_estimators = 256
max_depth = 8
n_features = 256
n_samples = 2**max_depth
X = np.random.rand(n_samples, n_features)
y = np.random.rand(n_samples)
random_forest = RandomForestRegressor(
n_estimators=n_estimators, max_depth=max_depth
)
# Fit some given data X, y
random_forest.fit(X, y)
# Convert the model to ONNX
initial_type = [("input", FloatTensorType([n_features]))]
onnx_model = convert_sklearn(
random_forest,
initial_types=initial_type,
target_opset=12
)
with open("sklearn_model.onnx", "wb") as f:
f.write(onnx_model.SerializeToString())
The first stage of compiling a model is to lower it to a vollo_trees_compiler.Forest
.
This is the Vollo Trees compiler's intermediate representation for representing decision tree ensembles.
import numpy as np
import vollo_trees_compiler as vtc
from sklearn.ensemble import RandomForestRegressor
from skl2onnx.common.data_types import FloatTensorType
from skl2onnx import convert_sklearn
n_estimators = 256
max_depth = 8
n_features = 256
n_samples = 2**max_depth
X = np.random.rand(n_samples, n_features)
y = np.random.rand(n_samples)
random_forest = RandomForestRegressor(
n_estimators=n_estimators, max_depth=max_depth
)
# Fit some given data X, y
random_forest.fit(X, y)
# Convert the model to ONNX
initial_type = [("input", FloatTensorType([1, n_features]))]
onnx_model = convert_sklearn(
random_forest,
initial_types=initial_type,
target_opset=12
)
with open("example_model.onnx", "wb") as f:
f.write(onnx_model.SerializeToString())
model_path = "example_model.onnx"
import vollo_trees_compiler as vtc
forest = vtc.Forest.from_onnx(model_path)
The Forest
can be compiled to a Vollo program given a vollo_trees_compiler.Config
accelerator configuration.
config = vtc.Config.ia420f_u128()
program_bf16 = forest.to_program_bf16(config)
Save the program to a file so that it can be used for inference by the Vollo runtime.
program_bf16.save('example.vollo')
Simulation
The Vollo Trees compiler can be used to evaluate a program on a given input which can be used to
- Estimate the performance of a model. Optionally, a cycle count can be returned with the evaluation output.
- Verify the correctness of the compilation stages, including the effect of quantisation.
A vollo_trees_compiler.Forest
can instead be converted to a f32
program. This way, the comparators and inputs will not
be quantized to bf16
, which can be useful for testing against other inference measures (e.g. onnxruntime
). Note however that the f32
program cannot be used with the Vollo runtime.
The program can then be evaluated on an input to determine the output value and estimate the cycle count.
program_f32 = forest.to_program_f32(config)
# Note that even though the ONNX model expects a multi-dimensional input,
# Program evaluation (and the vollo-runtime) expects a flattened input
input = np.random.rand(n_features)
out_value, est_cycle_count = program_f32.eval_with_cycle_estimate(input)
print(f"f32 Program output: {out_value}")
print(f"Estimated cycle count: {est_cycle_count}")
You can also obtain a pessimistic cycle estimate which assumes that the maximum depth branch is taken in each tree.
pessimistic_estimate = program_f32.pessimistic_cycle_estimate()
print(f"Pessimistic cycle estimate: {pessimistic_estimate}")
This evaluation can also be performed on the bf16
quantized version of the program.
Note there will be some discrepancy between the estimated cycle count and the true
cycle count.
Also note that this estimate does not model the latency of the communication between
the host and the Vollo accelerator. The single-decision-t1-d1-f32
benchmark the
round-trip latency for the smallest possible program. This can be added to the cycle count estimate
(accounting for the FPGA clock rate of 400mhz) to give an estimate for the overall latency of the model.
Benchmarks
This section provides benchmarks for the Vollo Trees accelerator for a variety decision tree models.
Performance figures are given for two configurations of the Vollo accelerator. A 256-unit configuration which is provided for the IA-840F accelerator card and a 128-unit configuration which is provided for the IA-420F accelerator card. If you require a different configuration, please contact us at vollo@myrtle.ai.
All these performance numbers can be measured using the vollo-trees-sdk
with the correct accelerator card
by running the provided benchmark script.
IA-840F: 256 units
Raw buffer API
This is using buffers allocated with vollo_rt_get_raw_buffer
which lets the runtime skip IO copy.
model | num trees | max depth | input features | fully populated | mean latency (us) | 99th percentile latency (us) |
---|---|---|---|---|---|---|
single-decision-t1-d1-f32 | 1 | 1 | 32 | No | 1.7 | 1.9 |
example-t1000-d5-f128 | 1000 | 5 | 128 | No | 1.9 | 2.0 |
example-t1000-d5-f128-full | 1000 | 5 | 128 | Yes | 1.8 | 2.0 |
example-t1000-d8-f128 | 1000 | 8 | 128 | No | 1.9 | 2.1 |
example-t1000-d8-f128-full | 1000 | 8 | 128 | Yes | 1.9 | 2.1 |
example-t512-d8-f512 | 512 | 8 | 512 | No | 1.9 | 2.1 |
example-t1024-d8-f1024 | 1024 | 8 | 1024 | No | 2.0 | 2.2 |
example-t4096-d8-f1024-full | 4096 | 8 | 1024 | Yes | 2.2 | 2.3 |
User buffers
model | numtrees | max depth | input features | fully populated | mean latency (us) | 99th percentile latency (us) |
---|---|---|---|---|---|---|
single-decision-t1-d1-f32 | 1 | 1 | 32 | No | 1.8 | 2.0 |
example-t1000-d5-f128 | 1000 | 5 | 128 | No | 1.9 | 2.1 |
example-t1000-d5-f128-full | 1000 | 5 | 128 | Yes | 1.9 | 2.1 |
example-t1000-d8-f128 | 1000 | 8 | 128 | No | 2.0 | 2.2 |
example-t1000-d8-f128-full | 1000 | 8 | 128 | Yes | 2.0 | 2.1 |
example-t512-d8-f512 | 512 | 8 | 512 | No | 2.1 | 2.3 |
example-t1024-d8-f1024 | 1024 | 8 | 1024 | No | 2.3 | 2.5 |
example-t4096-d8-f1024-full | 4096 | 8 | 1024 | Yes | 2.5 | 2.7 |
IA-420F
Raw buffer API
model | num trees | max depth | input features | fully populated | mean latency (us) | 99th percentile latency (us) |
---|---|---|---|---|---|---|
single-decision-t1-d1-f32 | 1 | 1 | 32 | No | 1.7 | 1.9 |
example-t1000-d5-f128 | 1000 | 5 | 128 | No | 1.9 | 2.1 |
example-t1000-d5-f128-full | 1000 | 5 | 128 | Yes | 1.9 | 2.1 |
example-t1000-d8-f128 | 1000 | 8 | 128 | No | 1.9 | 2.1 |
example-t1000-d8-f128-full | 1000 | 8 | 128 | Yes | 1.9 | 2.1 |
example-t512-d8-f512 | 512 | 8 | 512 | No | 1.9 | 2.1 |
example-t1024-d8-f1024 | 1024 | 8 | 1024 | No | 2.0 | 2.2 |
example-t4096-d8-f1024-full | 4096 | 8 | 1024 | Yes | 2.5 | 2.6 |
User buffers
model | numtrees | max depth | input features | fully populated | mean latency (us) | 99th percentile latency (us) |
---|---|---|---|---|---|---|
single-decision-t1-d1-f32 | 1 | 1 | 32 | No | 1.8 | 1.9 |
example-t1000-d5-f128 | 1000 | 5 | 128 | No | 2.0 | 2.2 |
example-t1000-d5-f128-full | 1000 | 5 | 128 | Yes | 2.0 | 2.1 |
example-t1000-d8-f128 | 1000 | 8 | 128 | No | 2.1 | 2.2 |
example-t1000-d8-f128-full | 1000 | 8 | 128 | Yes | 2.0 | 2.2 |
example-t512-d8-f512 | 512 | 8 | 512 | No | 2.1 | 2.3 |
example-t1024-d8-f1024 | 1024 | 8 | 1024 | No | 2.3 | 2.5 |
example-t4096-d8-f1024-full | 4096 | 8 | 1024 | Yes | 2.8 | 3.0 |
Setting up the Vollo accelerator
This section describes how to program your accelerator card with the Vollo Accelerator upon first use and how to reprogram your accelerator card with updated versions of the Vollo Accelerator. It also describes how to obtain a Vollo license which you will need to use the Vollo accelerator.
Environment Variable Setup
The initial setup instructions should be run in the Vollo SDK directory.
cd vollo-trees-sdk-<VERSION>
When using Vollo, you should also have the setup.sh
script sourced in bash
to set up environment variables used by Vollo:
source setup.sh
System Requirements
CPU Requirements
The minimum CPU specification for the system is shown below.
- Single Socket 6 core Intel Xeon CPU at 2.0 GHz, equivalent AMD processor or better.
- 8 GB RAM
Accelerator Card Requirements
The SDK runs on a server CPU with PCIe FPGA accelerator cards. It currently supports the following accelerator cards:
Accelerator Card | FPGA |
---|---|
BittWare IA-420f | Intel Agilex AGF014 |
BittWare IA-840f | Intel Agilex AGF027 |
Operating System Requirements
Vollo is compatible with Ubuntu 20.04 and later.
Programming the FPGA
Programming the FPGA via JTAG
If your FPGA is not already programmed with the Vollo accelerator or the Vollo Trees accelerator then please follow these instructions to load the bitstream into the accelerator card's flash memory.
This requires a USB cable to be connected to the accelerator card and Quartus programmer to be installed on the system so that the device can be programmed over JTAG.
If the FPGA card already has a Vollo Accelerator Bitstream or a Vollo Trees accelerator bitstream, it can be updated over PCIe by following the steps in the section Program the FPGA via PCIe below. Note that you only need to update the bitstream if updating to an incompatible version of the Vollo SDK. Programming over PCIe is faster than programming over JTAG, and does not require a USB programming cable or for Quartus Programmer to be installed.
-
Download and install the latest Quartus Programmer:
- Navigate to https://www.intel.com/content/www/us/en/software-kit/782411/intel-quartus-prime-pro-edition-design-software-version-23-2-for-linux.html.
- Select
Additional Software
and scroll down to find the Programmer. - Follow the instructions for installation.
-
Add Quartus programmer to your path:
export QUARTUS_DIR=<path to qprogrammer install> export PATH=$QUARTUS_DIR/qprogrammer/quartus/bin:$PATH
-
Start the jtag daemon:
sudo killall jtagd sudo jtagd
-
Run
jtagconfig
from the Quartus install, you should see the device(s):$ jtagconfig 1) IA-840F [1-5.2] 0341B0DD AGFB027R25A(.|R0)
-
Navigate to the directory containing the
jic
file:cd <vollo-sdk>/bitstream
-
Set the JTAG clock frequency of the device you want to program to 16 MHz. Specify the device by providing the name returned by
jtagconfig
:jtagconfig --setparam "IA-840F [1-5.2]" JtagClock 16M
-
Start the programming operation on the chosen device. This takes around 20 minutes. For the IA840F:
quartus_pgm -c "IA-840F [1-5.2]" -m JTAG -o "ipv;vollo-ia840f-u256d8192.jic"
Or for IA420F:
quartus_pgm -c "IA-420F [1-5.2]" -m JTAG -o "ipv;vollo-ia420f-u128d8192.jic"
-
Go back to 6 and program any other devices.
-
Power off the system and start it back up. The bitstream will now be loaded onto the FPGA.
⚠️ For the configuration process to be triggered the board has to register the power being off. It is recommended to turn the power off and then wait a few seconds before turning the power back on to ensure this happens.
-
Check a Vollo bitstream is loaded:
$ lspci -d 1ed9:766f 51:00.0 Processing accelerators: Myrtle.ai Device 766f (rev 01)
Check the correct Vollo bitstream is loaded:
cd <vollo-sdk> bin/vollo-tool bitstream-check bitstream/<bitstream-name>.json
Programming the FPGA via PCIe
NOTE: this can only be done with an FPGA that is already programmed with a Vollo bitstream.
-
Load the kernel driver:
sudo ./load-kernel-driver.sh
-
Check the current bitstream information:
bin/vollo-tool bitstream-info
-
Check that the device is set up for remote system updates by running the command below, with
device index
representing the index of the device you want to update, in the order shown in the previous command, starting from 0. It should print ajson
string to the terminal showing the device status.bin/vollo-tool fpga-config rsu-status <device index>
-
Update the
USER_IMAGE
partition of the flash with the new bitstream image contained in therpd
archive. This should take around 5 minutes. Do not interrupt this process until it completes.sudo ./load-kernel-driver.sh bin/vollo-tool fpga-config overwrite-partition <device index> <.rpd.tar.gz file> USER_IMAGE
-
Repeat step 4 for any other devices you wish to update.
-
Power off the system and start it back up.
⚠️ For the configuration process to be triggered the board has to register the power being off. It is recommended to turn the power off and then wait a few seconds before turning the power back on to ensure this happens.
-
Repeat steps 1, 2 and 3. The
bitstream-info
command should show that the updated bitstream has been loaded (e.g. a newer release date), and the output of thersu-status
command should show all zeroes for theerror_code
andfailing_image_address
fields. -
Check the correct Vollo bitstream is loaded:
sudo ./load-kernel-driver.sh bin/vollo-tool bitstream-check bitstream/<bitstream-name>.json
Licensing
Vollo is licensed on a per-device basis.
Redeeming licenses with vollo-tool
You will receive a purchase-token
with your Vollo purchase. The purchase-token
can be used to redeem Vollo licenses for a set number of devices.
To see the number of credits (i.e. the number of devices which can be redeemed) on your purchase-token
, run:
bin/vollo-tool license num-remaining-devices -t <purchase-token>
To redeem devices on your purchase token:
-
Load the kernel driver if you haven't already done so:
sudo ./load-kernel-driver.sh
-
Run
vollo-tool device-ids
. This will enumerate all Vollo accelerators and output their device IDs.bin/vollo-tool device-ids | tee vollo.devices
-
Run
vollo-tool license redeem-device
, passing the device IDs you wish to generate licenses for. This will print a breakdown of which devices will consume credits on thepurchase-token
.bin/vollo-tool license redeem-device -t <purchase-token> --device-ids <device IDs>
Alternatively you can pass the
vollo.devices
output from the previous step if you wish to redeem licenses for all devices.bin/vollo-tool license redeem-device -t <purchase-token> --device-id-file <device ID file>
-
When you have confirmed which devices will consume credits on the
purchase-token
, runvollo-tool license redeem-device --consume-credits
to generate the licenses. The licenses will be printed tostdout
.bin/vollo-tool license redeem-device -t <purchase-token> --device-ids <device IDs> --consume-credits | tee vollo.lic
The licenses redeemed on a purchase token can be viewed at any time by running vollo-tool license view-licenses
:
bin/vollo-tool license view-licenses -t <purchase-token> | tee vollo.lic
Installing a license
-
The license file location should be set in the environment variable
MYRTLE_LICENSE
.export MYRTLE_LICENSE=<license file>
-
Check that the license for your device(s) is being recognised.
bin/vollo-tool license-check
If successful, the output should look like this:
Ok: found 2 devices with valid licenses
Running an example
The Vollo Trees SDK contains a trivial program for each accelerator to check if the accelerator is working.
-
Ensure you have run the setup steps:
cd <vollo-sdk> sudo ./load_kernel_driver.sh source setup.sh export MYRTLE_LICENSE=<your-license-file>
-
Compile the C runtime example:
(cd example; make)
-
Run the example.
For a block-size 64 accelerator such as
vollo-ia840f-u256.jic
:./example/vollo-example example/single-decision-u256.vollo
For a block-size 32 accelerator such as
vollo-ia420f-u128d8192.jic
:./example/vollo-example example/single-decision-u128.vollo
You should see an output similar to the following:
Using program: "example/single-decision-u256.vollo" Using vollo-rt version: 20.0.0 Using Vollo accelerator with 256 tree unit(s) Program metadata for model 0: 1 input with shape: [32] 1 output with shape: [1] Starting 10000 inferences Ran 10000 inferences in 0.018070 s with: mean latency of 1.790225 us 99% latency of 1.942000 us throughput of 553402.910468 inf/s Done
Running the benchmark
The release comes with a benchmark script that can be used to measure the performance of the accelerator for a variety of models. The script uses the Vollo Trees compiler to compile the models for your accelerator and then runs the models on the accelerator to measure the performance.
-
Install the script dependencies:
sudo apt install python3-venv jq
Note, the compiler requires python 3.7 or later.
-
Ensure you have run the setup steps:
cd <vollo-trees-sdk> sudo ./load_kernel_driver.sh source setup.sh export MYRTLE_LICENSE=<your-license-file>
-
Run the benchmark:
$VOLLO_TREES_SDK/example/benchmark.sh
-
You can cross reference your numbers with those in the benchmarks section of the documentation.
Vollo Runtime
The Vollo runtime provides a low latency asynchronous inference API for timing critical inference requests on the Vollo accelerator.
A couple of example C programs that use the Vollo runtime API have been included in the
installation in the example/
directory.
In order to use the Vollo runtime you need to have an accelerator set up:
- A programmed FPGA
- A loaded kernel driver and an installed license
- Environment set up with
source setup.sh
Python API
The Vollo SDK includes Python bindings for the Vollo runtime. These can be more convenient than the C API for e.g. testing Vollo against PyTorch models.
The API for the Python bindings can be found here.
A small example of using the Python bindings is provided here.
C API
The Vollo runtime API is a C API with simple types and functions in order to be straight forward to use from any language with a C FFI.
- Header file:
$VOLLO_SDK/include/vollo-rt.h
- Dynamic library:
$VOLLO_SDK/lib/libvollo_rt.so
- Static library:
$VOLLO_SDK/lib/libvollo_rt.a
It links against GLIBC (from version 2.27), contact us if you have other requirements.
To compile against Vollo RT with a standard C compiler, you can use the following flags:
-I $VOLLO_SDK/include -L $VOLLO_SDK/lib -lvollo_rt
These are the main steps (in order) a program using vollo_rt
will follow:
- Initialise the Vollo runtime using
vollo_rt_init
- Add Vollo accelerators to the runtime using
vollo_rt_add_accelerator
(Note: the current release only supports one accelerator) - Load a Vollo program onto the Vollo accelerators with
vollo_rt_load_program
- Optionally, inspect the metadata about the models in the program using API calls such as
vollo_rt_num_models
andvollo_rt_model_num_inputs
- Queue and run inference jobs by first calling
vollo_rt_add_job_bf16
(orvollo_rt_add_job_fp32
) and then polling in a loop for their completion usingvollo_rt_poll
. You can queue several jobs before callingvollo_rt_poll
or add extra jobs at any point. - Finally call
vollo_rt_destroy
to release resources.
The API is designed to explicitly return errors when it can to let the user handle them as they see fit. The metadata functions will instead error out themselves if any of the documented pre-conditions they rely on aren't met. Any other crash is considered a bug and we would be very grateful if you could tell us about it.
Initialisation
A vollo context is created by calling vollo_rt_init
.
Add an accelerator by using the vollo_rt_add_accelerator
function.
/**
* Initialise the vollo-rt context. This must be called before any other vollo-rt functions.
*
* Logging level can be configured by setting the environment variable `VOLLO_RT_LOG` to one of:
* "error", "warn", "info", "debug", or "trace"
*/
vollo_rt_error_t vollo_rt_init(vollo_rt_context_t* context_ptr);
/**
* Destroy vollo-rt context, releasing its associated resources.
*/
void vollo_rt_destroy(vollo_rt_context_t vollo);
/**
* Add an accelerator.
* The accelerator is specified by its index. The index refers to an accelerator in the sorted list
* of PCI addresses. This should be called after `vollo_rt_init` but before `vollo_rt_load_program`
*/
vollo_rt_error_t vollo_rt_add_accelerator(vollo_rt_context_t vollo, size_t accelerator_index);
Loading a program
A program is loaded onto the Vollo accelerator using the vollo_rt_load_program
function.
/**
* Load a program onto the Vollo accelerators.
* This should be called after `vollo_rt_add_accelerator`
*
* A Vollo program is generated by the Vollo compiler, it is typically named
* "<program_name>.vollo".
* The program is intended for a specific hardware config (number of accelerators,
* cores and other configuration options), this function will return an
* error if any accelerator configuration is incompatible with the program.
* Once loaded, the program provides inference for several models concurrently.
*
* Note: This should only be called once per `vollo_rt_context_t`, as such if
* a program needs to be changed or reset, first `vollo_rt_destroy` the current
* context, then start a new context with `vollo_rt_init`.
*/
vollo_rt_error_t vollo_rt_load_program(vollo_rt_context_t vollo, const char* program_path);
Model metadata
Once a program is loaded, it provides inference for one or more models. Metadata about
a model is obtained with vollo_rt_model_*
functions.
Each model can have multiple distinct inputs and outputs. Each input and each output has a multi-dimensional shape associated with it. All of the metadata is defined by the program as supplied by the Vollo compiler. All the shapes are statically defined.
Some models can be compiled as streaming statefully over a dimension, that dimension is then erased from the inference shape but its possition can be recovered in the model metadata.
/**
* Inspect the number of models in the program loaded onto the vollo.
*
* Programs can contain multiple models, a `model_index` is used to select a
* specific model
*/
size_t vollo_rt_num_models(vollo_rt_context_t vollo);
/**
* Get the number of inputs of a model
*
* Each input has its own distinct shape
*
* Requirements (panics otherwise):
* - a program was loaded with `vollo_rt_load_program`
* - `model_index < vollo_rt_num_models`
*/
size_t vollo_rt_model_num_inputs(vollo_rt_context_t vollo, size_t model_index);
/**
* Get the number of outputs of a model
*
* Each output has its own distinct shape
*
* Requirements (panics otherwise):
* - a program was loaded with `vollo_rt_load_program`
* - `model_index < vollo_rt_num_models`
*/
size_t vollo_rt_model_num_outputs(vollo_rt_context_t vollo, size_t model_index);
/**
* Get the shape for input at a given index
*
* The return value is a 0 terminated array of dims containing the input shape
* The value lives for as long as the model
*
* Requirements (panics otherwise):
* - a program was loaded with `vollo_rt_load_program`
* - `model_index < vollo_rt_num_models`
* - `input_index < vollo_rt_model_num_inputs`
*/
const size_t* vollo_rt_model_input_shape(
vollo_rt_context_t vollo, size_t model_index, size_t input_index);
/**
* Get the shape for output at a given index
*
* The return value is a 0 terminated array of dims containing the output shape
* The value lives for as long as the model
*
* Requirements (panics otherwise):
* - a program was loaded with `vollo_rt_load_program`
* - `model_index < vollo_rt_num_models`
* - `output_index < vollo_rt_model_num_outputs`
*/
const size_t* vollo_rt_model_output_shape(
vollo_rt_context_t vollo, size_t model_index, size_t output_index);
/**
* Get the number of elements for input at a given index
*
* This is simply the product of the dimensions returned by `vollo_rt_model_input_shape`,
* it is provided to make it easier to allocate the correct number of elements.
*
* Requirements (panics otherwise):
* - a program was loaded with `vollo_rt_load_program`
* - `model_index < vollo_rt_num_models`
* - `input_index < vollo_rt_model_num_inputs`
*/
size_t vollo_rt_model_input_num_elements(
vollo_rt_context_t vollo, size_t model_index, size_t input_index);
/**
* Get the number of elements for output at a given index
*
* This is simply the product of the dimensions returned by `vollo_rt_model_output_shape`,
* it is provided to make it easier to allocate the correct number of elements.
*
* Requirements (panics otherwise):
* - a program was loaded with `vollo_rt_load_program`
* - `model_index < vollo_rt_num_models`
* - `output_index < vollo_rt_model_num_outputs`
*/
size_t vollo_rt_model_output_num_elements(
vollo_rt_context_t vollo, size_t model_index, size_t output_index);
/**
* In a streaming model, the streaming dimension is not part of the shape.
*
* - It returns -1 when there is no streaming dimension
* - It otherwise returns the dim index
* For example, for a shape `(a, b, c)` and streaming dim index 1, the full shape is:
* `(a, streaming_dim, b, c)`
*
* Requirements (panics otherwise):
* - a program was loaded with `vollo_rt_load_program`
* - `model_index < vollo_rt_num_models`
* - `input_index < vollo_rt_model_num_inputs`
*/
int vollo_rt_model_input_streaming_dim(
vollo_rt_context_t vollo, size_t model_index, size_t input_index);
/**
* In a streaming model, the streaming dimension is not part of the shape.
*
* - It returns -1 when there is no streaming dimension
* - It otherwise returns the dim index
* For example, for a shape `(a, b, c)` and streaming dim index 1, the full shape is:
* `(a, streaming_dim, b, c)`
*
* Requirements (panics otherwise):
* - a program was loaded with `vollo_rt_load_program`
* - `model_index < vollo_rt_num_models`
* - `output_index < vollo_rt_model_num_outputs`
*/
int vollo_rt_model_output_streaming_dim(
vollo_rt_context_t vollo, size_t model_index, size_t output_index);
Running inference
The interface returns results asynchronously so that inference requests can be made as fast
as the system can support, without blocking on output data being returned. This way, it also
supports running multiple requests concurrently.
Before any compute is started a job with associated input and output buffers needs to be
registered with the runtime using one of vollo_rt_add_job_bf16
or vollo_rt_add_job_fp32
.
The bf16
variant uses bfloat16
which is effectively a cropped version of single precision floating point format
fp32
(same exponent, smaller mantissa).
Note: do NOT use C floating point literals for bf16
as it is simply a uint16_t
in the API
A fp32
variant is also provided despite the Vollo accelerator expecting its
inputs and outputs to be in fp16
. If you are working with fp32
, prefer
this version instead of the bf16
variant as it is able to make the conversion
while copying to/from DMA buffers, avoiding an extra copy.
/**
* Sets up a computation on the vollo accelerator where the inputs and outputs are in brain-float 16
* format.
*
* Note: The computation is only started on the next call to vollo_rt_poll. This way it is possible
* to set up several computations that are kicked off at the same time.
*
* - vollo:
* the context that the computation should be run on
* - model_index:
* the model to run
* - user_ctx:
* a user context that will be returned on completion. This can be used to disambiguate when
* multiple models are running concurrently.
* NOTE: the jobs for a single model are guaranteed to come back in order, but the jobs for
* different models are not.
* - input_data:
* a pointer to the start of an array with pointers to the start of the data to each input the
* number of inputs is given by `vollo_rt_model_num_inputs` each input length is the product of
* the shape given by `vollo_rt_model_input_shape`
* (or more convenient: `vollo_rt_model_input_num_elements`)
* lifetime:
* - The outer array only needs to live until `vollo_rt_add_job_bf16` returns
* - The input buffers need to live until `vollo_rt_poll` returns with the completion for
* this job
* - output_data:
* a pointer to the start of an array with pointers to the start of the data to each output
* buffer the number of outputs is given by `vollo_rt_model_num_outputs` each output length is
* the product of the shape given by `vollo_rt_model_output_shape`
* (or more convenient: `vollo_rt_model_output_num_elements`)
* lifetime:
* - The outer array only needs to live until `vollo_rt_add_job_bf16` returns
* - The output buffers need to live until `vollo_rt_poll` returns with the completion for
* this job
*/
vollo_rt_error_t vollo_rt_add_job_bf16(
vollo_rt_context_t vollo,
size_t model_index,
uint64_t user_ctx,
const bf16* const* input_data,
bf16* const* output_data);
vollo_rt_error_t vollo_rt_add_job_fp32(
vollo_rt_context_t vollo,
size_t model_index,
uint64_t user_ctx,
const float* const* input_data,
float* const* output_data);
To actually start and later complete an inference you must use the vollo_rt_poll
function
multiple times. It is typically called in a loop (with a timeout) until some or all of the jobs
are completed.
/**
* Poll the vollo accelerator for completion.
*
* Note: Polling also initiates transfers for new jobs, so poll must be called
* before any progress on these new jobs can be made.
*
* num_completed: out: the number of completed user_ctx returned
* returned_user_ctx: buffer for the returned user_ctx of completed jobs, this will only be
* valid until the next call to vollo_rt_poll.
*/
vollo_rt_error_t vollo_rt_poll(
vollo_rt_context_t vollo, size_t* num_completed, const uint64_t** returned_user_ctx);
Vollo RT Example
The full code for this example can be found in example/single-decision.c
.
Here we will work through it step by step.
First we need to get hold of a Vollo RT context:
//////////////////////////////////////////////////
// Init
vollo_rt_context_t ctx;
EXIT_ON_ERROR(vollo_rt_init(&ctx));
Note: throughout this example we use EXIT_ON_ERROR
, it is just a convenient way to handle errors
Then we need to add accelerators, the accelerator_index
refers to the index of
the Vollo accelerator in the sorted list of PCI addresses, simply use 0
if you
have a single accelerator, or just want to use the first one.
//////////////////////////////////////////////////
// Add accelerators
size_t accelerator_index = 0;
EXIT_ON_ERROR(vollo_rt_add_accelerator(ctx, accelerator_index));
This step will check the accelerator license and make sure the bitstream is the correct version and compatible with this version of the runtime.
Then we load a program:
//////////////////////////////////////////////////
// Load program
// Program for a block_size 64 accelerator
const char* vollo_program_path = "./single-decision-u256.vollo";
EXIT_ON_ERROR(vollo_rt_load_program(ctx, vollo_program_path));
Here we're using a relative path (in the example
directory) to one of the
example Vollo program, a program that computes a simple single decision with an input of size 32.
The program is specifically for a 256-unit version of the accelerator such as the
default configuration for the IA840F
FPGA.
Then we setup some inputs and outputs for a single inference:
//////////////////////////////////////////////////
// Setup inputs and outputs
size_t model_index = 0;
// Assert model only has a single input and a single output tensor
assert(vollo_rt_model_num_inputs(ctx, model_index) == 1);
assert(vollo_rt_model_num_outputs(ctx, model_index) == 1);
assert(vollo_rt_model_input_num_elements(ctx, model_index, 0) == 32);
assert(vollo_rt_model_output_num_elements(ctx, model_index, 0) == 1);
float input_tensor[32];
float output_tensor[1];
for (size_t i = 0; i < 32; i++) {
input_tensor[i] = 3.0;
}
We check that the program metadata matches our expectations and we setup an input and output buffer.
Then we run a single inference:
//////////////////////////////////////////////////
// Run an inference
single_shot_inference(ctx, input_tensor, output_tensor);
Where we define a convenience function to run this type of simple synchronous inference on top of the asynchronous Vollo RT API:
// A small wrapper around the asynchronous Vollo RT API to block on a single inference
// This assume a single model with a single input and output tensor
static void single_shot_inference(vollo_rt_context_t ctx, const float* input, float* output) {
size_t model_index = 0;
const float* inputs[1] = {input};
float* outputs[1] = {output};
// user_ctx is not needed when doing single shot inferences
// it can be used when doing multiple jobs concurrently to keep track of which jobs completed
uint64_t user_ctx = 0;
// Register a new job
EXIT_ON_ERROR(vollo_rt_add_job_fp32(ctx, model_index, user_ctx, inputs, outputs));
// Poll until completion
size_t num_completed = 0;
const uint64_t* completed_buffer = NULL;
size_t poll_count = 0;
while (num_completed == 0) {
EXIT_ON_ERROR(vollo_rt_poll(ctx, &num_completed, &completed_buffer));
poll_count++;
if (poll_count > 1000000) {
EXIT_ON_ERROR("Timed out while polling");
}
}
}
This function does 2 things. First it registers a new job with the Vollo RT context and then it polls in a loop until that job is complete.
For a more thorough overview of how to use this asynchronous API to run multiple
jobs concurrently take a look at example/example.c
And finally we print out the newly obtained result and cleanup the Vollo RT context:
//////////////////////////////////////////////////
// Print outputs
printf("Output value: [%.1f]\n", output_tensor[0]);
//////////////////////////////////////////////////
// Release resources / Cleanup
vollo_rt_destroy(ctx);
Vollo RT Python Example
The Vollo RT Python bindings are provided for convenience, the runtime performance of this API is not a priority.
Here is a minimal way to use the Vollo RT Python bindings:
import vollo_rt
import torch
import os
with vollo_rt.VolloRTContext() as ctx:
ctx.add_accelerator(0)
if ctx.accelerator_num_cores(0) == 128:
ctx.load_program(f"{os.environ["VOLLO_TREES_SDK"]}/example/single-decsision-u128.vollo")
else:
ctx.load_program(f"{os.environ["VOLLO_TREES_SDK"]}/example/single-decision-u256.vollo")
input = torch.rand(*ctx.model_input_shape()).bfloat16()
output = ctx.run(input)
torch.testing.assert_close(input, output)
print("Success!")
Versions
Version Compatibility
The Vollo Trees SDK follows a semantic versioning scheme. If the left-most non-zero component (major/minor/patch) of the version number is unchanged between two releases of the Vollo Trees SDK, it should be possible to update to the newer version without modifying your existing code, e.g. updating from 0.1.0 to 0.1.1.
Additionally, the FPGA bitstream is stable between patch releases, so you do not have to reprogram the FPGA with an updated Vollo Trees bitstream unless the minor or major version numbers have changed.
Documentation for Previous Versions
Documentation for previous versions of the Vollo Trees SDK can be found in this listing:
Release Notes
0.1.0
- First release of decision tree accelerator