Tensors¶
augpy’s CudaTensor
class is a backward
compatible extension to the DLPack specification.
This allows trivial conversion to and from DLPack tensors
and thus exchange of tensors between frameworks.
Currently, only GPU tensors are supported.
Data types¶
CudaTensors
can have the following
data types, defined as DLDataType
.
Note
Only scalar data types are supported, so lanes is always 1.
-
const DLDataType
augpy
::
dldtype_int8
= {kDLInt, 8, 1}¶ 8 bit signed integer.
-
const DLDataType
augpy
::
dldtype_uint8
= {kDLUInt, 8, 1}¶ 8 bit unsigned integer.
-
const DLDataType
augpy
::
dldtype_int16
= {kDLInt, 16, 1}¶ 16 bit signed integer.
-
const DLDataType
augpy
::
dldtype_uint16
= {kDLUInt, 16, 1}¶ 16 bit unsigned integer.
-
const DLDataType
augpy
::
dldtype_int32
= {kDLInt, 32, 1}¶ 32 bit signed integer.
-
const DLDataType
augpy
::
dldtype_uint32
= {kDLUInt, 32, 1}¶ 32 bit unsigned integer.
-
const DLDataType
augpy
::
dldtype_int64
= {kDLInt, 64, 1}¶ 64 bit signed integer.
-
const DLDataType
augpy
::
dldtype_uint64
= {kDLUInt, 64, 1}¶ 64 bit unsigned integer.
-
const DLDataType
augpy
::
dldtype_float16
= {kDLFloat, 16, 1}¶ 16 bit (half precision) float.
- Note
not yet supported
-
const DLDataType
augpy
::
dldtype_float32
= {kDLFloat, 32, 1}¶ 32 bit (single precision) float.
-
const DLDataType
augpy
::
dldtype_float64
= {kDLFloat, 64, 1}¶ 64 bit (double precision) float.
-
template<typename
scalar_t
>
DLDataTypeaugpy
::
get_dldatatype
()¶ Returns the corresponding DLDataType for type
scalar_t
- Template Parameters
scalar_t
: input type
-
bool
augpy
::
dldatatype_equals
(DLDataType t1, DLDataType t2)¶ Returns
true
if both data types are the same.
CudaTensor¶
-
DLTENSOR_MAX_NDIM
¶ Maximum number of dimensions a CudaTensor can have. Currently 6.
-
struct
augpy
::
CudaTensor
: public DLManagedTensor¶ Augpy’s tensor class. It is a backwards compatible extension to the DLPack. specification.
See DLPack for the full documentation.
It supports all the usual operations you would expect from a full-featured tensor class, like complex indexing and slicing.
Copy, math, and comparison operations are provided as separate functions to call on tensors.
Public Functions
-
CudaTensor
(int64_t *shape, int ndim, DLDataType dtype, int device_id)¶ Create a new tensor with the given shape, dtype, on a specific device.
- Parameters
shape
: Pointer to a shape arrayndim
: number of dimensions, i.e., length of theshape
arraydtype
: data type of the new tensordevice_id
: Cuda GPU device id where tensor memory is allocated
-
CudaTensor
(std::vector<int64_t> shape, DLDataType dtype, int device_id)¶ Alias for CudaTensor(int64_t*, int, DLDataType, int) called with
shape.data()
andshape.size()
.
-
CudaTensor
(CudaTensor *parent, int ndim, int64_t *shape)¶ Create a new tensor that borrows memory from a parent tensor, but has a different shape
- Parameters
parent
: Parent tensor to borrow memory fromndim
: number of dimensions of new tensorshape
: shape of new tensor, array of lengthndim
-
CudaTensor
(CudaTensor *parent, int ndim, int64_t *shape, int64_t *strides, int64_t byte_offset)¶ Create a new tensor that borrows memory from a parent tensor, but has a different shape, may stride, and start at a different offset.
- Parameters
parent
: Parent tensor to borrow memory fromndim
: number of dimensions of new tensorshape
: shape of new tensor, array of lengthndim
strides
: stride distances of the tensor, array of lengthndim
byte_offset
: start position in parent memory in bytes
-
CudaTensor
(CudaTensor *parent)¶ Create an exact copy of the
parent
tensor, borrowing its memory.
-
CudaTensor
(DLManagedTensor *parent)¶ Wrap a DLManagedTensor inside a CudaTensor, borrowing its memory.
-
~CudaTensor () noexcept(false)
Delete this CudaTensor. Calls the DLManagedTensor::deleter function if DLManagedTensor::manager_ctx is also set.
The managed_allocation will be marked as orphaned/ready for reuse if this tensor is the last remaining tensor that references it.
-
void *
ptr
()¶ Return a pointer to the first element in this tensor. Resolves DLTensor::byte_offset.
-
void
record
()¶ Mark this tensor to be in use by calling CudaEvent::record on its event.
-
cudaEvent_t
get_event
()¶ Return the Cuda event used to record
-
bool
is_contiguous
()¶ Returns
true
if the tensor is contiguous, i.e., elements are located next to each other in memory and in dimensions are not reversed.
-
CudaTensor *
index
(ssize_t i)¶ Index this tensor in the first dimension at index
i
. Behaves like numpy indexing, i.e, index from the back ifi
is negative where-1
refers to the last element.
-
CudaTensor *
slice_simple
(py::slice slice)¶ Slice this tensor in the first dimension. Behaves like numpy slicing, i.e, start, stop, and step may be negative.
-
CudaTensor *
slice_complex
(py::tuple slices)¶ Slice this tensor in up to DLTensor::ndim dimensions. Behaves like numpy slicing, i.e, start, stop, and step may be negative.
-
void
setitem_index
(ssize_t index, CudaTensor *src)¶ Read items from
src
and write them into the tensor at positions referenced by anindex
.
-
void
setitem_simple
(py::slice slice, CudaTensor *src)¶ Read items from
src
and write them into this tensor at positions referenced by aslice
.
-
void
setitem_complex
(py::tuple slices, CudaTensor *src)¶ Read items from
src
and write them into this tensor at positions referenced by a number ofslices
.
-
CudaTensor *
fill_index
(ssize_t index, double scalar)¶ Fill this tensor with the given
scalar
value at positions referenced by anindex
. Supports broadcasting.
-
CudaTensor *
fill_simple
(py::slice slice, double scalar)¶ Fill this tensor with the given
scalar
value at positions referenced by aslice
. Supports broadcasting.
-
CudaTensor *
fill_complex
(py::tuple slices, double scalar)¶ Fill this tensor with the given
scalar
value at positions referenced by a number ofslices
. Supports broadcasting.
-
CudaTensor *
reshape
(std::vector<int64_t> shape)¶ Returns a new tensor with the given shape that borrows memory from this tensor. Number of elements cannot change and this tensor must be contiguous.
-
std::string
repr
()¶ Returns a string representation of this tensor, e.g.,
<CudaTensor shape=(1, 2, 3), device=0, dtype=uint8>
.
-
py::tuple
pyshape
()¶ Returns the shape of this tensor as a Python tuple.
-
py::tuple
pystrides
()¶ Returns the strides of this tensor as a Python tuple.
-
-
CudaTensor *
augpy
::
copy
(CudaTensor *src, CudaTensor *dst, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 0)¶ Copy
dst
intodst
. Supports broadcasting.- Return
dst
-
CudaTensor *
augpy
::
fill
(double scalar, CudaTensor *dst, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 0)¶ Fill
dst
with the givenscalar
value.- Return
dst
-
CudaTensor *
augpy
::
cast_tensor
(CudaTensor *tensor, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 0)¶ Read values from
tensor
, cast them to the data type ofout
and store them there.tensor
andout
must have the same shape.
-
CudaTensor *
augpy
::
cast_type
(CudaTensor *tensor, DLDataType dtype, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 0)¶ Create a new tensor with values from
tensor
cast to the given data typedtype
.
-
CudaTensor *
augpy
::
empty_like
(CudaTensor *tensor)¶ Create a new empty tensor with the same shape and dtype on the same device.
-
typedef array<int64_t, DLTENSOR_MAX_NDIM>
augpy
::
ndim_array
¶ int64
array of length DLTENSOR_MAX_NDIM. Used to store shape or strides.
Tensor Math¶
For these functions, the result
parameter is optional.
If result
is NULL
a new tensor of appropriate size
is created and returned.
If result
is not NULL
, use the given tensor as output
and return NULL
.
For basic math functions all inputs and the result
tensor must have the same data type.
For comparison functions uint8
is used as result.
A value of 1
means the condition is fulfilled,
otherwise it is 0
.
Unless otherwise stated, all functions support all data type, broadcasting, and work with strided tensors.
The blocks_per_sm
and num_threads
control the
kernel launch parameters. The defaults are probably fine,
but they can be used to get some more speed if you
optimize for specific hardware.
-
CudaTensor *
augpy
::
add_scalar
(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)¶ Add a
scalar
value to atensor
.
-
CudaTensor *
augpy
::
add_tensor
(CudaTensor *tensor1, CudaTensor *tensor2, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)¶ Add
tensor2
totensor1
.
-
CudaTensor *
augpy
::
sub_scalar
(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)¶ Subtract a
scalar
value from atensor
.
-
CudaTensor *
augpy
::
rsub_scalar
(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)¶ Subtract a
tensor
from ascalar
value.
-
CudaTensor *
augpy
::
sub_tensor
(CudaTensor *tensor1, CudaTensor *tensor2, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)¶ Subtract
tensor2
fromtensor1
.
-
CudaTensor *
augpy
::
mul_scalar
(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)¶ Multiply a
tensor
by ascalar
value.
-
CudaTensor *
augpy
::
mul_tensor
(CudaTensor *tensor1, CudaTensor *tensor2, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)¶ Multiply
tensor1
bytensor2
.
-
CudaTensor *
augpy
::
div_scalar
(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)¶ Divide a
tensor
by ascalar
value.
-
CudaTensor *
augpy
::
rdiv_scalar
(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)¶ Divide a
scalar
value by atensor
.
-
CudaTensor *
augpy
::
div_tensor
(CudaTensor *tensor1, CudaTensor *tensor2, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)¶ Divide
tensor1
bytensor2
.
-
CudaTensor *
augpy
::
fma
(double scalar, CudaTensor *tensor1, CudaTensor *tensor2, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 0)¶ Compute a fused multiply-add on a scalar and two tensors, i.e., \(r = s \cdot t_1 \cdot t_2\).
If
tensor1
has an unsigned integer data type, thentensor2
must have the signed version of the same type, e.g., auint8
tensor must be paired with aint8
tensor.
-
CudaTensor *
augpy
::
gemm
(CudaTensor *A, CudaTensor *B, CudaTensor *C, double alpha, double beta)¶ Uses CuBLAS to Calculate the matrix multiplication of two 2D tensors. More specifically calculates
\[ C = A \times (\alpha \cdot B) + \beta \cdot C \]Only
float
anddouble
data types are supported and all tensors must have the same data type. All tensors must be contiguous.Returns a new tensor if
C
isNULL
, otherwiseC
is returned.
-
CudaTensor *
augpy
::
lt_scalar
(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)¶ tensor < scalar
.
-
CudaTensor *
augpy
::
lt_tensor
(CudaTensor *tensor1, CudaTensor *tensor2, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)¶ tensor1 < tensor2
.
-
CudaTensor *
augpy
::
le_scalar
(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)¶ tensor <= scalar
.
-
CudaTensor *
augpy
::
le_tensor
(CudaTensor *tensor1, CudaTensor *tensor2, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)¶ tensor1 <= tensor2
.
-
CudaTensor *
augpy
::
gt_scalar
(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)¶ tensor > scalar
.
-
CudaTensor *
augpy
::
ge_scalar
(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)¶ tensor >= scalar
.
-
CudaTensor *
augpy
::
eq_scalar
(CudaTensor *tensor, double scalar, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)¶ tensor == scalar
.
-
CudaTensor *
augpy
::
eq_tensor
(CudaTensor *tensor1, CudaTensor *tensor2, CudaTensor *out, unsigned int blocks_per_sm = BLOCKS_PER_SM, unsigned int num_threads = 512)¶ tensor1 == tensor2
.
Tensor Management¶
Converting from and to arrays or tensors,
exporting augpy’s
CudaTensors
to other frameworks, and importing existing
tensors from other frameworks without copying.
-
py::array *
augpy
::
tensor_to_array1
(CudaTensor *tensor)¶ Copy a given tensor to a new numpy array. This initiates an asynchronous copy from device to host memory.
-
py::array *
augpy
::
tensor_to_array2
(CudaTensor *tensor, py::buffer *array)¶ Copy a given tensor to a numpy array created from the given buffer
array
. This initiates an asynchronous copy from device to host memory.
-
CudaTensor *
augpy
::
array_to_tensor1
(py::buffer *array, int device_id)¶ Copy a Python buffer into a new tensor on the specified GPU device. This initiates an asynchronous copy from host to device memory.
-
CudaTensor *
augpy
::
array_to_tensor2
(py::buffer *array, CudaTensor *tensor)¶ Copy a Python buffer to a tensor created from the given buffer
tensor
. This initiates an asynchronous copy from host to device memory.
-
CudaTensor *
augpy
::
import_dltensor
(py::capsule *tensor_capsule, const char *name)¶ Import a GPU tensor from another library into augpy.
- Note
This requires explicit synchronization if augpy or the interfacing library is running operations on streams other than the default_stream.
- Parameters
tensor_capsule
: a Python capsule object that contains a DLManagedTensorname
: name under which the tensor is stored in the capsule, e.g.,"dltensor"
for Pytorch
-
py::capsule *
augpy
::
export_dltensor
(py::object *pytensor, std::string *name, bool destruct)¶ Export a GPU tensor to be used by another library.
- Note
This requires explicit synchronization if augpy or the interfacing library is running operations on streams other than the default_stream.
- Parameters
pytensor
: Python-wrapped CudaTensorname
: name under which the tensor is stored in the returned capsule, e.g.,"dltensor"
for Pytorchdestruct
: iftrue
, add a destructor to the capsule which will delete the tensor when the capsule is deleted; only set tofalse
if you know what you’re doing
Utility Functions¶
Functions that help writing functions that operate on tensors.
-
bool
augpy
::
array_equals
(int dim0, int ndim, int64_t *array1, int64_t *array2)¶ Returns
true
ifarray1[dim] == array2[dim]
for all dimensions fromdim0
tondim-1
.
-
void
augpy
::
assert_contiguous
(CudaTensor *t)¶ Throws
std::invalid_argument
ift
isNULL
or not contiguous.
-
size_t
augpy
::
numel
(CudaTensor *tensor)¶ Returns the number of elements in the tensor.
-
size_t
augpy
::
numel
(py::buffer_info &array)¶ Returns the number of elements in the array.
-
template<typename
scalar_t
>
size_taugpy
::
numel
(scalar_t *shape, size_t ndim)¶ Returns the number of elements in the tensor with the given shape.
-
template<typename
scalar_t
>
size_taugpy
::
numel
(std::vector<scalar_t> &shape)¶ Returns the number of elements in the tensor with the given shape.
-
size_t
augpy
::
numbytes
(CudaTensor *tensor)¶ Returns the number of bytes occupied by this tensor.
-
size_t
augpy
::
numbytes
(py::buffer_info &array)¶ Returns the number of bytes occupied by this tensor.
-
bool
augpy
::
check_contiguous
(CudaTensor *tensor)¶ Returns
true
if the tensor is contiguous.
-
bool
augpy
::
check_contiguous
(py::buffer_info &array)¶ Returns
true
if the array is contiguous.
-
void
augpy
::
check_tensor
(CudaTensor *tensor, size_t min_size, bool contiguous)¶ Check whether tensor is not
NULL
, has at least a minimum size in bytes, and is contiguous.- Parameters
tensor
: tensor to checkmin_size
: check whethernumbytes(tensor) >= min_size
contiguous
: iftrue
, check whether tensor is contiguous and returnfalse
if not
-
void
augpy
::
check_same_device
(DLTensor t1, DLTensor t2)¶ Check whether
t1
andt2
are located on the same GPU device. If not, raisestd::invalid_argument
.
-
void
augpy
::
check_same_dtype_device
(DLTensor t1, DLTensor t2)¶ Check whether
t1
andt2
have the same dtype and are located on the same GPU device. If not, raisestd::invalid_argument
.
-
void
augpy
::
check_same_dtype_device_shape
(DLTensor t1, DLTensor t2)¶ Check whether
t1
andt2
have the same dtype, are located on the same GPU device, and have the same shape. If not, raisestd::invalid_argument
.
-
void
augpy
::
calc_threads
(unsigned int &threads, int device_id)¶ If
threads == 0
, setthreads
to cores_per_sm(device_id).
-
void
augpy
::
calc_blocks_values_1d
(DLTensor t, unsigned int &num_blocks, size_t &num, unsigned int &values_per_thread, unsigned int threads, unsigned int blocks_per_sm)¶ Use heuristics to calculate how many blocks and values per thread to use for the given 1D tensor.
Values per thread \(v\) is calculated based on the number of elements in the tensor
t
, the number of SMs on the device \(N_{sm}\), the number of blocks per sm \(B_{sm}\), and the number of threads per block \(N_{t}\):\[ v = \left\lceil \frac{numel(t)}{N_{sm} \cdot B_{sm} \cdot N_t} \right\rceil \]The number of blocks \(B\) is then calculated like this:
\[ B = \left\lceil \frac{\lceil numel(t) / v \rceil}{N_t}\right\rceil \]- Parameters
t
: input tensor to operate onnum_blocks
: output value, number of blocks in the gridnum
: output value, number of elements in t, i.e.,numel(t)
values_per_thread
: input/output value, if>0
specifies the values per thread to use, otherwise will hold the calculated valuethreads
: number of threads in each blockblocks_per_sm
: how many blocks to generate per SM on the device; defaults to BLOCKS_PER_SM
-
void
augpy
::
calc_blocks_values_nd
(DLTensor t, dim3 &grid, size_t &count, unsigned int &values_per_thread, unsigned int threads, unsigned int blocks_per_sm)¶ Similar to calc_blocks_values_1d, but for ND tensors. Use heuristics to calculate the size of the block grid and values per thread to use for the given ND tensor.
For ND tensors, values per thread only applies to the first dimension. The same heuristics are used, but \(v\) cannot exceed the size of the first dimension \(s_0\), so \(v' = min(v, s_0)\).
grid.x
is therefore \(\left\lceil \frac{s_0}{v'} \right\rceil\) andgrid.y
is \(\left\lceil \frac{numel(t)}{s_0\cdot N_t} \right\rceil\).- Parameters
t
: input tensor to operate ongrid
: output value, the block grid used for the kernel launch;grid.x
will hold the number of iterations in the first dimension,grid.y
the number of blocks required for the remaining dimensionscount
: output value, number of elements in t starting with the second dimension, i.e.,numel(t) / t.shape[0]
values_per_thread
: input/output value, if>0
specifies the values per thread to use, otherwise will hold the calculated valuethreads
: number of threads in each blockblocks_per_sm
: how many blocks to generate per SM on the device; defaults to BLOCKS_PER_SM
-
void
augpy
::
calculate_contiguous_strides
(DLTensor t, ndim_array &contiguous_strides)¶ Calculate the strides in number of elements the given tensor
t
would have if it was contiguous.
-
bool
augpy
::
calculate_broadcast_strides
(DLTensor t_src, DLTensor t_dst, ndim_array &src_strides, const int t_src_index)¶ If possible, calculate the strides that are needed to broadcast
t_src
tot_dst
.Broadcasting is possible if
t_src.ndim <= t_dst.ndim
and every dimension up tot_src.ndim
is either the same ort_src
shape is 1. Stride in a broadcastable dimensions is zero.- Return
true
if broadcasting was used- Parameters
t_src
: source tensor to broadcastt_dst
: target tensor to broadcast tosrc_strides
: output value, strides required to broadcastt_src_index
: only used for error message formatting, index oft_src
in the function call
- Exceptions
std::invalid_argument
: if broadcasting not possible
-
bool
augpy
::
calculate_broadcast_output_shape
(DLTensor t1, DLTensor t2, int &ndim, int64_t *shape)¶ If possible, calculate the output shape when broadcasting tensors
t1
andt2
together. Output shape will havemax(t1.ndim, t2.ndim)
dimensions. To broadcast, both tensors must either match the size of a dimension, or one of them must have size 1. The other tensor then determines the size of the dimension.- Parameters
t1
: first tensort2
: second tensorndim
: output value, number of dimensions in outputshape
: output value, shape array, must have at least lengthmax(t1.ndim, t2.ndim)
-
bool
augpy
::
calculate_broadcast_output_shape
(DLTensor t1, DLTensor t2, int &ndim, ndim_array &shape)¶ Alias for calculate_broadcast_output_shape(DLTensor, DLTensor, int&, int64_t*).
-
CudaTensor *
augpy
::
create_output_tensor
(CudaTensor **tensors, int n_tensors, bool allow_null)¶ For all tensors in the given array, get the maximum size in each dimension and create a new tensor of that shape.
Does not check whether tensors are broadcastable.
- Parameters
tensors
: tensors that will be broadcast to outputn_tensors
: number of tensors in arrayallow_null
: iftrue
, allow tensors to beNULL
, otherwise throwstd::invalid_argument
-
void
augpy
::
coalesce_dimensions
(std::vector<DLTensor> &tensors)¶ Manipulate the shapes and strides of the given tensors to coalesce (remove) unnecessary dimensions, thus simplifying the tensors.
A dimension can be coalesced if all tensors are either contiguous in that dimension or have less dimensions.
- Warning
Output is only valid if tensors have strides produced by calculate_broadcast_strides. Tensors must either have the same shape, or be broadcast and thus appear non-contiguous.
DLPack¶
This is the documentation of the common DLPack header. To quote the readme:
DLPack is an open in-memory tensor structure to for sharing tensor among frameworks. DLPack enables
Easier sharing of operators between deep learning frameworks.
Easier wrapping of vendor level operator implementations, allowing collaboration when introducing new devices/ops.
Quick swapping of backend implementations, like different version of BLAS
For final users, this could bring more operators, and possibility of mixing usage between frameworks.
Please refer to their Github for more details.
The common header of DLPack.
Copyright (c) 2017 by Contributors
Defines
-
DLPACK_EXTERN_C
¶
-
DLPACK_VERSION
¶ The current version of dlpack.
-
DLPACK_DLL
¶ DLPACK_DLL prefix for windows.
Typedefs
-
typedef struct DLManagedTensor
DLManagedTensor
¶ C Tensor object, manage memory of DLTensor. This data structure is intended to faciliate the borrowing of DLTensor by another framework. It is not meant to transfer the tensor. When the borrowing framework doesn’t need the tensor, it should call the deleter to notify the host that the resource is no longer needed.
Enums
-
enum
DLDeviceType
¶ The device type in DLContext.
Values:
-
enumerator
kDLCPU
= 1¶ CPU device.
-
enumerator
kDLGPU
= 2¶ CUDA GPU device.
-
enumerator
kDLCPUPinned
= 3¶ Pinned CUDA GPU device by cudaMallocHost.
- Note
kDLCPUPinned = kDLCPU | kDLGPU
-
enumerator
kDLOpenCL
= 4¶ OpenCL devices.
-
enumerator
kDLVulkan
= 7¶ Vulkan buffer for next generation graphics.
-
enumerator
kDLMetal
= 8¶ Metal for Apple GPU.
-
enumerator
kDLVPI
= 9¶ Verilog simulator buffer.
-
enumerator
kDLROCM
= 10¶ ROCm GPUs for AMD GPUs.
-
enumerator
kDLExtDev
= 12¶ Reserved extension device type, used for quickly test extension device The semantics can differ depending on the implementation.
-
enumerator
-
enum
DLDataTypeCode
¶ The type code options DLDataType.
Values:
-
enumerator
kDLInt
= 0U¶
-
enumerator
kDLUInt
= 1U¶
-
enumerator
kDLFloat
= 2U¶
-
enumerator
-
struct
DLContext
¶ - #include <dlpack.h>
A Device context for Tensor and operator.
Public Members
-
DLDeviceType
device_type
¶ The device type used in the device.
-
int
device_id
¶ The device index.
-
DLDeviceType
-
struct
DLDataType
¶ - #include <dlpack.h>
The data type the tensor can hold.
Examples
float: type_code = 2, bits = 32, lanes=1
float4(vectorized 4 float): type_code = 2, bits = 32, lanes=4
int8: type_code = 0, bits = 8, lanes=1
Public Members
-
uint8_t
code
¶ Type code of base types. We keep it uint8_t instead of DLDataTypeCode for minimal memory footprint, but the value should be one of DLDataTypeCode enum values.
-
uint8_t
bits
¶ Number of bits, common choices are 8, 16, 32.
-
uint16_t
lanes
¶ Number of lanes in the type, used for vector types.
-
struct
DLTensor
¶ - #include <dlpack.h>
Plain C Tensor object, does not manage memory.
Public Members
-
void *
data
¶ The opaque data pointer points to the allocated data. This will be CUDA device pointer or cl_mem handle in OpenCL. This pointer is always aligns to 256 bytes as in CUDA.
For given DLTensor, the size of memory required to store the contents of data is calculated as follows:
static inline size_t GetDataSize(const DLTensor* t) { size_t size = 1; for (tvm_index_t i = 0; i < t->ndim; ++i) { size *= t->shape[i]; } size *= (t->dtype.bits * t->dtype.lanes + 7) / 8; return size; }
-
int
ndim
¶ Number of dimensions.
-
DLDataType
dtype
¶ The data type of the pointer.
-
int64_t *
shape
¶ The shape of the tensor.
-
int64_t *
strides
¶ strides of the tensor (in number of elements, not bytes) can be NULL, indicating tensor is compact and row-majored.
-
uint64_t
byte_offset
¶ The offset in bytes to the beginning pointer to data.
-
void *
-
struct
DLManagedTensor
- #include <dlpack.h>
C Tensor object, manage memory of DLTensor. This data structure is intended to faciliate the borrowing of DLTensor by another framework. It is not meant to transfer the tensor. When the borrowing framework doesn’t need the tensor, it should call the deleter to notify the host that the resource is no longer needed.
Subclassed by augpy::CudaTensor
Public Members
-
void *
manager_ctx
¶ the context of the original host framework of DLManagedTensor in which DLManagedTensor is used in the framework. It can also be NULL.
-
void (*
deleter
)(struct DLManagedTensor *self)¶ Destructor signature void (*)(void*) - this should be called to destruct manager_ctx which holds the DLManagedTensor. It can be NULL if there is no way for the caller to provide a reasonable destructor. The destructors deletes the argument self as well.
-
void *