Tensors

augpy’s CudaTensor class is a backward compatible extension to the DLPack specification. This allows trivial conversion to and from DLPack tensors and thus exchange of tensors between frameworks.

Currently, only GPU tensors are supported.

Data types

CudaTensors can have the following data types, defined as DLDataType.

Note

Only scalar data types are supported, so lanes is always 1.

const DLDataType augpy::dldtype_int8 = {kDLInt, 8, 1}

8 bit signed integer.

const DLDataType augpy::dldtype_uint8 = {kDLUInt, 8, 1}

8 bit unsigned integer.

const DLDataType augpy::dldtype_int16 = {kDLInt, 16, 1}

16 bit signed integer.

const DLDataType augpy::dldtype_uint16 = {kDLUInt, 16, 1}

16 bit unsigned integer.

const DLDataType augpy::dldtype_int32 = {kDLInt, 32, 1}

32 bit signed integer.

const DLDataType augpy::dldtype_uint32 = {kDLUInt, 32, 1}

32 bit unsigned integer.

const DLDataType augpy::dldtype_int64 = {kDLInt, 64, 1}

64 bit signed integer.

const DLDataType augpy::dldtype_uint64 = {kDLUInt, 64, 1}

64 bit unsigned integer.

const DLDataType augpy::dldtype_float16 = {kDLFloat, 16, 1}

16 bit (half precision) float.

Note

not yet supported

const DLDataType augpy::dldtype_float32 = {kDLFloat, 32, 1}

32 bit (single precision) float.

const DLDataType augpy::dldtype_float64 = {kDLFloat, 64, 1}

64 bit (double precision) float.

template<typename scalar_t>
DLDataType augpy::get_dldatatype()

Returns the corresponding DLDataType for type scalar_t

Template Parameters
  • scalar_t: input type

bool augpy::dldatatype_equals(DLDataType t1, DLDataType t2)

Returns true if both data types are the same.

CudaTensor

DLTENSOR_MAX_NDIM

Maximum number of dimensions a CudaTensor can have. Currently 6.

struct augpy::CudaTensor : public DLManagedTensor

Augpy’s tensor class. It is a backwards compatible extension to the DLPack. specification.

See DLPack for the full documentation.

It supports all the usual operations you would expect from a full-featured tensor class, like complex indexing and slicing.

Copy, math, and comparison operations are provided as separate functions to call on tensors.

Public Functions

CudaTensor(int64_t *shape, int ndim, DLDataType dtype, int device_id)

Create a new tensor with the given shape, dtype, on a specific device.

Parameters
  • shape: Pointer to a shape array

  • ndim: number of dimensions, i.e., length of the shape array

  • dtype: data type of the new tensor

  • device_id: Cuda GPU device id where tensor memory is allocated

CudaTensor(std::vector<int64_t> shape, DLDataType dtype, int device_id)

Alias for CudaTensor(int64_t*, int, DLDataType, int) called with shape.data() and shape.size().

CudaTensor(CudaTensor *parent, int ndim, int64_t *shape)

Create a new tensor that borrows memory from a parent tensor, but has a different shape

Parameters
  • parent: Parent tensor to borrow memory from

  • ndim: number of dimensions of new tensor

  • shape: shape of new tensor, array of length ndim

CudaTensor(CudaTensor *parent, int ndim, int64_t *shape, int64_t *strides, int64_t byte_offset)

Create a new tensor that borrows memory from a parent tensor, but has a different shape, may stride, and start at a different offset.

Parameters
  • parent: Parent tensor to borrow memory from

  • ndim: number of dimensions of new tensor

  • shape: shape of new tensor, array of length ndim

  • strides: stride distances of the tensor, array of length ndim

  • byte_offset: start position in parent memory in bytes

CudaTensor(CudaTensor *parent)

Create an exact copy of the parent tensor, borrowing its memory.

CudaTensor(DLManagedTensor *parent)

Wrap a DLManagedTensor inside a CudaTensor, borrowing its memory.

~CudaTensor () noexcept(false)

Delete this CudaTensor. Calls the DLManagedTensor::deleter function if DLManagedTensor::manager_ctx is also set.

The managed_allocation will be marked as orphaned/ready for reuse if this tensor is the last remaining tensor that references it.

void *ptr()

Return a pointer to the first element in this tensor. Resolves DLTensor::byte_offset.

void record()

Mark this tensor to be in use by calling CudaEvent::record on its event.

cudaEvent_t get_event()

Return the Cuda event used to record

bool is_contiguous()

Returns true if the tensor is contiguous, i.e., elements are located next to each other in memory and in dimensions are not reversed.

CudaTensor *index(ssize_t i)

Index this tensor in the first dimension at index i. Behaves like numpy indexing, i.e, index from the back if i is negative where -1 refers to the last element.

CudaTensor *slice_simple(py::slice slice)

Slice this tensor in the first dimension. Behaves like numpy slicing, i.e, start, stop, and step may be negative.

CudaTensor *slice_complex(py::tuple slices)

Slice this tensor in up to DLTensor::ndim dimensions. Behaves like numpy slicing, i.e, start, stop, and step may be negative.

void setitem_index(ssize_t index, CudaTensor *src)

Read items from src and write them into the tensor at positions referenced by an index.

void setitem_simple(py::slice slice, CudaTensor *src)

Read items from src and write them into this tensor at positions referenced by a slice.

void setitem_complex(py::tuple slices, CudaTensor *src)

Read items from src and write them into this tensor at positions referenced by a number of slices.

CudaTensor *fill_index(ssize_t index, double scalar)

Fill this tensor with the given scalar value at positions referenced by an index. Supports broadcasting.

CudaTensor *fill_simple(py::slice slice, double scalar)

Fill this tensor with the given scalar value at positions referenced by a slice. Supports broadcasting.

CudaTensor *fill_complex(py::tuple slices, double scalar)

Fill this tensor with the given scalar value at positions referenced by a number of slices. Supports broadcasting.

CudaTensor *reshape(std::vector<int64_t> shape)

Returns a new tensor with the given shape that borrows memory from this tensor. Number of elements cannot change and this tensor must be contiguous.

std::string repr()

Returns a string representation of this tensor, e.g., <CudaTensor shape=(1, 2, 3), device=0, dtype=uint8>.

py::tuple pyshape()

Returns the shape of this tensor as a Python tuple.

py::tuple pystrides()

Returns the strides of this tensor as a Python tuple.

CudaTensor *augpy::copy(CudaTensor *src, CudaTensor *dst, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 0)

Copy dst into dst. Supports broadcasting.

Return

dst

CudaTensor *augpy::fill(double scalar, CudaTensor *dst, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 0)

Fill dst with the given scalar value.

Return

dst

CudaTensor *augpy::cast_tensor(CudaTensor *tensor, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 0)

Read values from tensor, cast them to the data type of out and store them there. tensor and out must have the same shape.

CudaTensor *augpy::cast_type(CudaTensor *tensor, DLDataType dtype, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 0)

Create a new tensor with values from tensor cast to the given data type dtype.

CudaTensor *augpy::empty_like(CudaTensor *tensor)

Create a new empty tensor with the same shape and dtype on the same device.

typedef array<int64_t, DLTENSOR_MAX_NDIM> augpy::ndim_array

int64 array of length DLTENSOR_MAX_NDIM. Used to store shape or strides.

Tensor Math

For these functions, the result parameter is optional. If result is NULL a new tensor of appropriate size is created and returned. If result is not NULL, use the given tensor as output and return NULL.

For basic math functions all inputs and the result tensor must have the same data type.

For comparison functions uint8 is used as result. A value of 1 means the condition is fulfilled, otherwise it is 0.

Unless otherwise stated, all functions support all data type, broadcasting, and work with strided tensors.

The blocks_per_sm and num_threads control the kernel launch parameters. The defaults are probably fine, but they can be used to get some more speed if you optimize for specific hardware.

CudaTensor *augpy::add_scalar(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)

Add a scalar value to a tensor.

CudaTensor *augpy::add_tensor(CudaTensor *tensor1, CudaTensor *tensor2, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)

Add tensor2 to tensor1.

CudaTensor *augpy::sub_scalar(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)

Subtract a scalar value from a tensor.

CudaTensor *augpy::rsub_scalar(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)

Subtract a tensor from a scalar value.

CudaTensor *augpy::sub_tensor(CudaTensor *tensor1, CudaTensor *tensor2, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)

Subtract tensor2 from tensor1.

CudaTensor *augpy::mul_scalar(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)

Multiply a tensor by a scalar value.

CudaTensor *augpy::mul_tensor(CudaTensor *tensor1, CudaTensor *tensor2, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)

Multiply tensor1 by tensor2.

CudaTensor *augpy::div_scalar(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)

Divide a tensor by a scalar value.

CudaTensor *augpy::rdiv_scalar(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)

Divide a scalar value by a tensor.

CudaTensor *augpy::div_tensor(CudaTensor *tensor1, CudaTensor *tensor2, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)

Divide tensor1 by tensor2.

CudaTensor *augpy::fma(double scalar, CudaTensor *tensor1, CudaTensor *tensor2, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 0)

Compute a fused multiply-add on a scalar and two tensors, i.e., \(r = s \cdot t_1 \cdot t_2\).

If tensor1 has an unsigned integer data type, then tensor2 must have the signed version of the same type, e.g., a uint8 tensor must be paired with a int8 tensor.

CudaTensor *augpy::gemm(CudaTensor *A, CudaTensor *B, CudaTensor *C, double alpha, double beta)

Uses CuBLAS to Calculate the matrix multiplication of two 2D tensors. More specifically calculates

\[ C = A \times (\alpha \cdot B) + \beta \cdot C \]

Only float and double data types are supported and all tensors must have the same data type. All tensors must be contiguous.

Returns a new tensor if C is NULL, otherwise C is returned.

CudaTensor *augpy::lt_scalar(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)

tensor < scalar.

CudaTensor *augpy::lt_tensor(CudaTensor *tensor1, CudaTensor *tensor2, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)

tensor1 < tensor2.

CudaTensor *augpy::le_scalar(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)

tensor <= scalar.

CudaTensor *augpy::le_tensor(CudaTensor *tensor1, CudaTensor *tensor2, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)

tensor1 <= tensor2.

CudaTensor *augpy::gt_scalar(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)

tensor > scalar.

CudaTensor *augpy::ge_scalar(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)

tensor >= scalar.

CudaTensor *augpy::eq_scalar(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)

tensor == scalar.

CudaTensor *augpy::eq_tensor(CudaTensor *tensor1, CudaTensor *tensor2, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)

tensor1 == tensor2.

Tensor Management

Converting from and to arrays or tensors, exporting augpy’s CudaTensors to other frameworks, and importing existing tensors from other frameworks without copying.

py::array *augpy::tensor_to_array1(CudaTensor *tensor)

Copy a given tensor to a new numpy array. This initiates an asynchronous copy from device to host memory.

py::array *augpy::tensor_to_array2(CudaTensor *tensor, py::buffer *array)

Copy a given tensor to a numpy array created from the given buffer array. This initiates an asynchronous copy from device to host memory.

CudaTensor *augpy::array_to_tensor1(py::buffer *array, int device_id)

Copy a Python buffer into a new tensor on the specified GPU device. This initiates an asynchronous copy from host to device memory.

CudaTensor *augpy::array_to_tensor2(py::buffer *array, CudaTensor *tensor)

Copy a Python buffer to a tensor created from the given buffer tensor. This initiates an asynchronous copy from host to device memory.

CudaTensor *augpy::import_dltensor(py::capsule *tensor_capsule, const char *name)

Import a GPU tensor from another library into augpy.

Note

This requires explicit synchronization if augpy or the interfacing library is running operations on streams other than the default_stream.

Parameters
  • tensor_capsule: a Python capsule object that contains a DLManagedTensor

  • name: name under which the tensor is stored in the capsule, e.g., "dltensor" for Pytorch

py::capsule *augpy::export_dltensor(py::object *pytensor, std::string *name, bool destruct)

Export a GPU tensor to be used by another library.

Note

This requires explicit synchronization if augpy or the interfacing library is running operations on streams other than the default_stream.

Parameters
  • pytensor: Python-wrapped CudaTensor

  • name: name under which the tensor is stored in the returned capsule, e.g., "dltensor" for Pytorch

  • destruct: if true, add a destructor to the capsule which will delete the tensor when the capsule is deleted; only set to false if you know what you’re doing

Utility Functions

Functions that help writing functions that operate on tensors.

bool augpy::array_equals(int dim0, int ndim, int64_t *array1, int64_t *array2)

Returns true if array1[dim] == array2[dim] for all dimensions from dim0 to ndim-1.

void augpy::assert_contiguous(CudaTensor *t)

Throws std::invalid_argument if t is NULL or not contiguous.

size_t augpy::numel(CudaTensor *tensor)

Returns the number of elements in the tensor.

size_t augpy::numel(DLTensor *tensor)

Returns the number of elements in the tensor.

size_t augpy::numel(DLTensor &tensor)

Returns the number of elements in the tensor.

size_t augpy::numel(py::buffer_info &array)

Returns the number of elements in the array.

template<typename scalar_t>
size_t augpy::numel(scalar_t *shape, size_t ndim)

Returns the number of elements in the tensor with the given shape.

template<typename scalar_t>
size_t augpy::numel(std::vector<scalar_t> &shape)

Returns the number of elements in the tensor with the given shape.

size_t augpy::numbytes(CudaTensor *tensor)

Returns the number of bytes occupied by this tensor.

size_t augpy::numbytes(DLTensor *tensor)

Returns the number of bytes occupied by this tensor.

size_t augpy::numbytes(py::buffer_info &array)

Returns the number of bytes occupied by this tensor.

bool augpy::check_contiguous(CudaTensor *tensor)

Returns true if the tensor is contiguous.

bool augpy::check_contiguous(DLTensor *tensor)

Returns true if the tensor is contiguous.

bool augpy::check_contiguous(py::buffer_info &array)

Returns true if the array is contiguous.

void augpy::check_tensor(CudaTensor *tensor, size_t min_size, bool contiguous)

Check whether tensor is not NULL, has at least a minimum size in bytes, and is contiguous.

Parameters
  • tensor: tensor to check

  • min_size: check whether numbytes(tensor) >= min_size

  • contiguous: if true, check whether tensor is contiguous and return false if not

void augpy::check_same_device(DLTensor t1, DLTensor t2)

Check whether t1 and t2 are located on the same GPU device. If not, raise std::invalid_argument.

void augpy::check_same_dtype_device(DLTensor t1, DLTensor t2)

Check whether t1 and t2 have the same dtype and are located on the same GPU device. If not, raise std::invalid_argument.

void augpy::check_same_dtype_device_shape(DLTensor t1, DLTensor t2)

Check whether t1 and t2 have the same dtype, are located on the same GPU device, and have the same shape. If not, raise std::invalid_argument.

void augpy::calc_threads(unsigned int &threads, int device_id)

If threads == 0, set threads to cores_per_sm(device_id).

void augpy::calc_blocks_values_1d(DLTensor t, unsigned int &num_blocks, size_t &num, unsigned int &values_per_thread, unsigned int threads, unsigned int blocks_per_sm)

Use heuristics to calculate how many blocks and values per thread to use for the given 1D tensor.

Values per thread \(v\) is calculated based on the number of elements in the tensor t, the number of SMs on the device \(N_{sm}\), the number of blocks per sm \(B_{sm}\), and the number of threads per block \(N_{t}\):

\[ v = \left\lceil \frac{numel(t)}{N_{sm} \cdot B_{sm} \cdot N_t} \right\rceil \]

The number of blocks \(B\) is then calculated like this:

\[ B = \left\lceil \frac{\lceil numel(t) / v \rceil}{N_t}\right\rceil \]

Parameters
  • t: input tensor to operate on

  • num_blocks: output value, number of blocks in the grid

  • num: output value, number of elements in t, i.e., numel(t)

  • values_per_thread: input/output value, if >0 specifies the values per thread to use, otherwise will hold the calculated value

  • threads: number of threads in each block

  • blocks_per_sm: how many blocks to generate per SM on the device; defaults to BLOCKS_PER_SM

void augpy::calc_blocks_values_nd(DLTensor t, dim3 &grid, size_t &count, unsigned int &values_per_thread, unsigned int threads, unsigned int blocks_per_sm)

Similar to calc_blocks_values_1d, but for ND tensors. Use heuristics to calculate the size of the block grid and values per thread to use for the given ND tensor.

For ND tensors, values per thread only applies to the first dimension. The same heuristics are used, but \(v\) cannot exceed the size of the first dimension \(s_0\), so \(v' = min(v, s_0)\). grid.x is therefore \(\left\lceil \frac{s_0}{v'} \right\rceil\) and grid.y is \(\left\lceil \frac{numel(t)}{s_0\cdot N_t} \right\rceil\).

Parameters
  • t: input tensor to operate on

  • grid: output value, the block grid used for the kernel launch; grid.x will hold the number of iterations in the first dimension, grid.y the number of blocks required for the remaining dimensions

  • count: output value, number of elements in t starting with the second dimension, i.e., numel(t) / t.shape[0]

  • values_per_thread: input/output value, if >0 specifies the values per thread to use, otherwise will hold the calculated value

  • threads: number of threads in each block

  • blocks_per_sm: how many blocks to generate per SM on the device; defaults to BLOCKS_PER_SM

void augpy::calculate_contiguous_strides(DLTensor t, ndim_array &contiguous_strides)

Calculate the strides in number of elements the given tensor t would have if it was contiguous.

bool augpy::calculate_broadcast_strides(DLTensor t_src, DLTensor t_dst, ndim_array &src_strides, const int t_src_index)

If possible, calculate the strides that are needed to broadcast t_src to t_dst.

Broadcasting is possible if t_src.ndim <= t_dst.ndim and every dimension up to t_src.ndim is either the same or t_src shape is 1. Stride in a broadcastable dimensions is zero.

Return

true if broadcasting was used

Parameters
  • t_src: source tensor to broadcast

  • t_dst: target tensor to broadcast to

  • src_strides: output value, strides required to broadcast

  • t_src_index: only used for error message formatting, index of t_src in the function call

Exceptions
  • std::invalid_argument: if broadcasting not possible

bool augpy::calculate_broadcast_output_shape(DLTensor t1, DLTensor t2, int &ndim, int64_t *shape)

If possible, calculate the output shape when broadcasting tensors t1 and t2 together. Output shape will have max(t1.ndim, t2.ndim) dimensions. To broadcast, both tensors must either match the size of a dimension, or one of them must have size 1. The other tensor then determines the size of the dimension.

Parameters
  • t1: first tensor

  • t2: second tensor

  • ndim: output value, number of dimensions in output

  • shape: output value, shape array, must have at least length max(t1.ndim, t2.ndim)

bool augpy::calculate_broadcast_output_shape(DLTensor t1, DLTensor t2, int &ndim, ndim_array &shape)

Alias for calculate_broadcast_output_shape(DLTensor, DLTensor, int&, int64_t*).

CudaTensor *augpy::create_output_tensor(CudaTensor **tensors, int n_tensors, bool allow_null)

For all tensors in the given array, get the maximum size in each dimension and create a new tensor of that shape.

Does not check whether tensors are broadcastable.

Parameters
  • tensors: tensors that will be broadcast to output

  • n_tensors: number of tensors in array

  • allow_null: if true, allow tensors to be NULL, otherwise throw std::invalid_argument

void augpy::coalesce_dimensions(std::vector<DLTensor> &tensors)

Manipulate the shapes and strides of the given tensors to coalesce (remove) unnecessary dimensions, thus simplifying the tensors.

A dimension can be coalesced if all tensors are either contiguous in that dimension or have less dimensions.

Warning

Output is only valid if tensors have strides produced by calculate_broadcast_strides. Tensors must either have the same shape, or be broadcast and thus appear non-contiguous.

DLPack

This is the documentation of the common DLPack header. To quote the readme:

DLPack is an open in-memory tensor structure to for sharing tensor among frameworks. DLPack enables

  • Easier sharing of operators between deep learning frameworks.

  • Easier wrapping of vendor level operator implementations, allowing collaboration when introducing new devices/ops.

  • Quick swapping of backend implementations, like different version of BLAS

  • For final users, this could bring more operators, and possibility of mixing usage between frameworks.

Please refer to their Github for more details.

The common header of DLPack.

Copyright (c) 2017 by Contributors

Defines

DLPACK_EXTERN_C
DLPACK_VERSION

The current version of dlpack.

DLPACK_DLL

DLPACK_DLL prefix for windows.

Typedefs

typedef struct DLManagedTensor DLManagedTensor

C Tensor object, manage memory of DLTensor. This data structure is intended to faciliate the borrowing of DLTensor by another framework. It is not meant to transfer the tensor. When the borrowing framework doesn’t need the tensor, it should call the deleter to notify the host that the resource is no longer needed.

Enums

enum DLDeviceType

The device type in DLContext.

Values:

enumerator kDLCPU = 1

CPU device.

enumerator kDLGPU = 2

CUDA GPU device.

enumerator kDLCPUPinned = 3

Pinned CUDA GPU device by cudaMallocHost.

Note

kDLCPUPinned = kDLCPU | kDLGPU

enumerator kDLOpenCL = 4

OpenCL devices.

enumerator kDLVulkan = 7

Vulkan buffer for next generation graphics.

enumerator kDLMetal = 8

Metal for Apple GPU.

enumerator kDLVPI = 9

Verilog simulator buffer.

enumerator kDLROCM = 10

ROCm GPUs for AMD GPUs.

enumerator kDLExtDev = 12

Reserved extension device type, used for quickly test extension device The semantics can differ depending on the implementation.

enum DLDataTypeCode

The type code options DLDataType.

Values:

enumerator kDLInt = 0U
enumerator kDLUInt = 1U
enumerator kDLFloat = 2U
struct DLContext
#include <dlpack.h>

A Device context for Tensor and operator.

Public Members

DLDeviceType device_type

The device type used in the device.

int device_id

The device index.

struct DLDataType
#include <dlpack.h>

The data type the tensor can hold.

Examples

  • float: type_code = 2, bits = 32, lanes=1

  • float4(vectorized 4 float): type_code = 2, bits = 32, lanes=4

  • int8: type_code = 0, bits = 8, lanes=1

Public Members

uint8_t code

Type code of base types. We keep it uint8_t instead of DLDataTypeCode for minimal memory footprint, but the value should be one of DLDataTypeCode enum values.

uint8_t bits

Number of bits, common choices are 8, 16, 32.

uint16_t lanes

Number of lanes in the type, used for vector types.

struct DLTensor
#include <dlpack.h>

Plain C Tensor object, does not manage memory.

Public Members

void *data

The opaque data pointer points to the allocated data. This will be CUDA device pointer or cl_mem handle in OpenCL. This pointer is always aligns to 256 bytes as in CUDA.

For given DLTensor, the size of memory required to store the contents of data is calculated as follows:

static inline size_t GetDataSize(const DLTensor* t) {
  size_t size = 1;
  for (tvm_index_t i = 0; i < t->ndim; ++i) {
    size *= t->shape[i];
  }
  size *= (t->dtype.bits * t->dtype.lanes + 7) / 8;
  return size;
}

DLContext ctx

The device context of the tensor.

int ndim

Number of dimensions.

DLDataType dtype

The data type of the pointer.

int64_t *shape

The shape of the tensor.

int64_t *strides

strides of the tensor (in number of elements, not bytes) can be NULL, indicating tensor is compact and row-majored.

uint64_t byte_offset

The offset in bytes to the beginning pointer to data.

struct DLManagedTensor
#include <dlpack.h>

C Tensor object, manage memory of DLTensor. This data structure is intended to faciliate the borrowing of DLTensor by another framework. It is not meant to transfer the tensor. When the borrowing framework doesn’t need the tensor, it should call the deleter to notify the host that the resource is no longer needed.

Subclassed by augpy::CudaTensor

Public Members

DLTensor dl_tensor

DLTensor which is being memory managed.

void *manager_ctx

the context of the original host framework of DLManagedTensor in which DLManagedTensor is used in the framework. It can also be NULL.

void (*deleter)(struct DLManagedTensor *self)

Destructor signature void (*)(void*) - this should be called to destruct manager_ctx which holds the DLManagedTensor. It can be NULL if there is no way for the caller to provide a reasonable destructor. The destructors deletes the argument self as well.