Device & Memory

Classes and functions to manage GPU devices and memory.

Device Management

augpy gives you fine control over which Cuda device is used and which Cuda stream kernels are run on. All functions are asynchronous by design, so events and streams can be used to synchronize host code.

There are two thread-local global variables that control which device and stream are currently active:

class augpy::CudaEvent

Convenience wrapper for the cudaEvent_t type.

Public Functions

CudaEvent()

Get a Cuda event from the event pool of the current_device.

~CudaEvent () noexcept(false)
cudaEvent_t get_event()

Return the wrapped Cuda event.

void record()

Record wrapped event on current_stream.

bool query()

Returns true if event has occurred.

void synchronize(int microseconds = 100)

Block until event has occurred. Checks in microseconds interval. Faster intervals make this more accurate, but increase CPU load. Uses standard Cuda busy-waiting method if microseconds <= 0.

class augpy::CudaStream

Convenience wrapper for the cudaStream_t type.

Public Functions

CudaStream(int device_id = -1, int priority = -1)

Create a new Cuda stream on the given device. Lower numbers mean higher priority, and values are clipped to the valid range. Use get_device_properties to get the range of possible values for a device. See cudaStreamCreateWithPriority for more details.

Use device_id=-1 and priority=-1 to get the default_stream.

CudaStream(cudaStream_t stream)

Wrap the given cudaStream_t in a CudaStream.

~CudaStream () noexcept(false)
cudaStream_t &get_stream()

Return the wrapped Cuda stream.

void activate()

Make this the current_stream and remember the previous stream.

void deactivate()

Make the previous stream the current_stream.

void synchronize(int microseconds = 100)

Block until all work on this stream has finished. Checks in microseconds interval. Faster intervals make this more accurate, but increase CPU load. Uses standard Cuda busy-waiting method if microseconds <= 0.

std::string repr()

Returns a concise string representation of this stream.

int augpy::current_device

Controls which GPU device is used by each thread.

cudaStream_t augpy::current_stream

Controls which Cuda stream is used by each thread.

const CudaStream augpy::default_stream = CudaStream(-1, -1)

The default Cuda stream as a wrapped CudaStream.

Device Information

struct augpy::cudaDevicePropEx : public cudaDeviceProp

The cudaDeviceProp struct extended with stream priority fields.

Public Members

int leastStreamPriority = 0

Lowest priority a Cuda stream on this device can have.

int greatestStreamPriority = 0

Highest priority a Cuda stream on this device can have.

int coresPerSM = 0

Number of Cuda cores per SM.

int numCudaCores = 0

Total number of Cuda coes.

cudaDevicePropEx augpy::get_device_properties(int device_id)

Returns the device properties of the given GPU device.

int augpy::get_num_cuda_cores(int device_id)

Returns the number of Cuda cores of the given GPU device.

int augpy::cores_per_sm(int device_id)

Given a GPU device id, returns the number of Cuda cores per SM.

int augpy::cores_per_sm(int major, int minor)

Given the major and minor Cuda capability (e.g., 7 and 5), returns the number of Cuda cores per SM.

Memory Management

std::tuple<size_t, size_t, size_t> augpy::meminfo(int device_id)

For the device defined by device_id, return the current used, free, and total memory in bytes.

struct augpy::managed_allocation

A chunk of managed GPU memory.

Public Functions

managed_allocation(int device_id, size_t size)

Make new struct filled with device_id and size. Does not allocate memory. Use managed_cudamalloc instead.

void record()

Public Members

int device_id

GPU device id

size_t size

Number of bytes in allocation

void *ptr

Pointer to allocated memory

cudaEvent_t event

Cuda event used to track whether memory is currently in use

std::shared_ptr<managed_allocation> augpy::managed_cudamalloc(size_t size, int device_id)

Malloc size bytes ond GPU with given device id. Returns managed_allocation as std::shared_ptr. Throws cuda_error.

One all instances of shared_ptr are deleted, the allocated memory will be marked for deletion/reuse.

void augpy::managed_cudafree(void *ptr)

Frees device memory allocated by managed_cudamalloc at the given location. Throws cuda_error.

void augpy::managed_eventalloc(cudaEvent_t *event)

Return a Cuda event from the event pool of the current_device. Flags cudaEventBlockingSync and cudaEventDisableTiming are set.

void augpy::managed_eventfree(cudaEvent_t event)

Mark the given Cuda event as reusable.

void augpy::init_device(int device_id)

Initialize the GPU with the given device_id. You can, but do not need to this manually. It is done for you whenever you request memory or device properties for the first.

void augpy::release()

Release all allocated memory on all GPUs. All CudaTensors become invalid immediately. Do I have to tell you this is dangerous?

cnmem

augpy uses cnmem to manage GPU device memory.

Defines

CNMEM_API
CNMEM_VERSION

Typedefs

typedef struct cnmemDevice_t_ cnmemDevice_t

Enums

enum cnmemStatus_t

Values:

enumerator CNMEM_STATUS_SUCCESS = 0
enumerator CNMEM_STATUS_CUDA_ERROR
enumerator CNMEM_STATUS_INVALID_ARGUMENT
enumerator CNMEM_STATUS_NOT_INITIALIZED
enumerator CNMEM_STATUS_OUT_OF_MEMORY
enumerator CNMEM_STATUS_UNKNOWN_ERROR
enum cnmemManagerFlags_t

Values:

enumerator CNMEM_FLAGS_DEFAULT = 0
enumerator CNMEM_FLAGS_CANNOT_GROW = 1

Default flags.

enumerator CNMEM_FLAGS_CANNOT_STEAL = 2

Prevent the manager from growing its memory consumption.

enumerator CNMEM_FLAGS_MANAGED = 4

Prevent the manager from stealing memory.

Functions

cnmemStatus_t cnmemInit(int numDevices, const cnmemDevice_t *devices, unsigned flags)

Initialize the library and allocate memory on the listed devices.

For each device, an internal memory manager is created and the specified amount of memory is allocated (it is the size defined in device[i].size). For each, named stream an additional memory manager is created. Currently, it is implemented as a tree of memory managers: A root manager for the device and a list of children, one for each named stream.

This function must be called before any other function in the library. It has to be called by a single thread since it is not thread-safe.

Return

CNMEM_STATUS_SUCCESS, if everything goes fine, CNMEM_STATUS_INVALID_ARGUMENT, if one of the argument is invalid, CNMEM_STATUS_OUT_OF_MEMORY, if the requested size exceeds the available memory, CNMEM_STATUS_CUDA_ERROR, if an error happens in a CUDA function.

cnmemStatus_t cnmemFinalize()

Release all the allocated memory.

This function must be called by a single thread and after all threads that called cnmemMalloc/cnmemFree have joined. This function is not thread-safe.

Return

CNMEM_STATUS_SUCCESS, if everything goes fine, CNMEM_STATUS_NOT_INITIALIZED, if the cnmemInit function has not been called, CNMEM_STATUS_CUDA_ERROR, if an error happens in one of the CUDA functions.

cnmemStatus_t cnmemRetain()

Increase the internal reference counter of the context object.

This function increases the internal reference counter of the library. The purpose of that reference counting mechanism is to give more control to the user over the lifetime of the library. It is useful with scoped memory allocation which may be destroyed in a final memory collection after the end of main(). That function is thread-safe.

Return

CNMEM_STATUS_SUCCESS, if everything goes fine, CNMEM_STATUS_NOT_INITIALIZED, if the cnmemInit function has not been called,

cnmemStatus_t cnmemRelease()

Decrease the internal reference counter of the context object.

This function decreases the internal reference counter of the library. The purpose of that reference counting mechanism is to give more control to the user over the lifetime of the library. It is useful with scoped memory allocation which may be destroyed in a final memory collection after the end of main(). That function is thread-safe.

You can use cnmemRelease to explicitly finalize the library.

Return

CNMEM_STATUS_SUCCESS, if everything goes fine, CNMEM_STATUS_NOT_INITIALIZED, if the cnmemInit function has not been called,

cnmemStatus_t cnmemRegisterStream(cudaStream_t stream)

Add a new stream to the pool of managed streams on a device.

This function registers a new stream into a device memory manager. It is thread-safe.

Return

CNMEM_STATUS_SUCCESS, if everything goes fine, CNMEM_STATUS_INVALID_ARGUMENT, if one of the argument is invalid,

cnmemStatus_t cnmemMalloc(void **ptr, size_t size, cudaStream_t stream)

Allocate memory.

This function allocates memory and initializes a pointer to device memory. If no memory is available, it returns a CNMEM_STATUS_OUT_OF_MEMORY error. This function is thread safe.

The behavior of that function is the following:

  • If the stream is NULL, the root memory manager is asked to allocate a buffer of device memory. If there’s a buffer of size larger or equal to the requested size in the list of free blocks, it is returned. If there’s no such buffer but the manager is allowed to grow its memory usage (the CNMEM_FLAGS_CANNOT_GROW flag is not set), the memory manager calls cudaMalloc. If cudaMalloc fails due to no more available memory or the manager is not allowed to grow, the manager attempts to steal memory from one of its children (unless CNMEM_FLAGS_CANNOT_STEAL is set). If that attempt also fails, the manager returns CNMEM_STATUS_OUT_OF_MEMORY.

  • If the stream is a named stream, the initial request goes to the memory manager associated with that stream. If a free node is available in the lists of that manager, it is returned. Otherwise, the request is passed to the root node and works as if the request were made on the NULL stream.

The calls to cudaMalloc are potentially costly and may induce GPU synchronizations. Also the mechanism to steal memory from the children induces GPU synchronizations (the manager has to make sure no kernel uses a given buffer before stealing it) and it the execution is sequential (in a multi-threaded context, the code is executed in a critical section inside the cnmem library - no need for the user to wrap cnmemMalloc with locks).

Return

CNMEM_STATUS_SUCCESS, if everything goes fine, CNMEM_STATUS_NOT_INITIALIZED, if the cnmemInit function has not been called, CNMEM_STATUS_INVALID_ARGUMENT, if one of the argument is invalid. For example, ptr == 0, CNMEM_STATUS_OUT_OF_MEMORY, if there is not enough memory available, CNMEM_STATUS_CUDA_ERROR, if an error happens in one of the CUDA functions.

cnmemStatus_t cnmemFree(void *ptr, cudaStream_t stream)

Release memory.

This function releases memory and recycles a memory block in the manager. This function is thread safe.

Return

CNMEM_STATUS_SUCCESS, if everything goes fine, CNMEM_STATUS_NOT_INITIALIZED, if the cnmemInit function has not been called, CNMEM_STATUS_INVALID_ARGUMENT, if one of the argument is invalid. For example, ptr == 0, CNMEM_STATUS_CUDA_ERROR, if an error happens in one of the CUDA functions.

cnmemStatus_t cnmemMemGetInfo(size_t *freeMem, size_t *totalMem, cudaStream_t stream)

Returns the amount of memory managed by the memory manager associated with a stream.

The pointers totalMem and freeMem must be valid. At the moment, this function has a comple- xity linear in the number of allocated blocks so do not call it in performance critical sections.

Return

CNMEM_STATUS_SUCCESS, if everything goes fine, CNMEM_STATUS_NOT_INITIALIZED, if the cnmemInit function has not been called, CNMEM_STATUS_INVALID_ARGUMENT, if one of the argument is invalid, CNMEM_STATUS_CUDA_ERROR, if an error happens in one of the CUDA functions.

cnmemStatus_t cnmemPrintMemoryState(FILE *file, cudaStream_t stream)

Print a list of nodes to a file.

This function is intended to be used in case of complex scenarios to help understand the behaviour of the memory managers/application. It is thread safe.

Return

CNMEM_STATUS_SUCCESS, if everything goes fine, CNMEM_STATUS_NOT_INITIALIZED, if the cnmemInit function has not been called, CNMEM_STATUS_INVALID_ARGUMENT, if one of the argument is invalid. For example, used_mem == 0 or free_mem == 0, CNMEM_STATUS_CUDA_ERROR, if an error happens in one of the CUDA functions.

const char *cnmemGetErrorString(cnmemStatus_t status)

Converts a cnmemStatus_t value to a string.

struct cnmemDevice_t_
#include <cnmem.h>

Public Members

int device

The device number.

size_t size

The size to allocate for that device. If 0, the implementation chooses the size.

int numStreams

The number of named streams associated with the device. The NULL stream is not counted.

cudaStream_t *streams

The streams associated with the device. It can be NULL. The NULL stream is managed.

size_t *streamSizes

The size reserved for each streams. It can be 0.