Device & Memory¶
Classes and functions to manage GPU devices and memory.
Device Management¶
augpy gives you fine control over which Cuda device is used and which Cuda stream kernels are run on. All functions are asynchronous by design, so events and streams can be used to synchronize host code.
There are two thread-local global variables that control which device and stream are currently active:
-
class
augpy
::
CudaEvent
¶ Convenience wrapper for the cudaEvent_t type.
Public Functions
-
CudaEvent
()¶ Get a Cuda event from the event pool of the current_device.
-
~CudaEvent () noexcept(false)
-
cudaEvent_t
get_event
()¶ Return the wrapped Cuda event.
-
void
record
()¶ Record wrapped event on current_stream.
-
bool
query
()¶ Returns
true
if event has occurred.
-
void
synchronize
(int microseconds = 100)¶ Block until event has occurred. Checks in
microseconds
interval. Faster intervals make this more accurate, but increase CPU load. Uses standard Cuda busy-waiting method ifmicroseconds <= 0
.
-
-
class
augpy
::
CudaStream
¶ Convenience wrapper for the cudaStream_t type.
Public Functions
-
CudaStream
(int device_id = -1, int priority = -1)¶ Create a new Cuda stream on the given device. Lower numbers mean higher priority, and values are clipped to the valid range. Use get_device_properties to get the range of possible values for a device. See cudaStreamCreateWithPriority for more details.
Use
device_id=-1
andpriority=-1
to get the default_stream.
-
CudaStream
(cudaStream_t stream)¶ Wrap the given cudaStream_t in a CudaStream.
-
~CudaStream () noexcept(false)
-
cudaStream_t &
get_stream
()¶ Return the wrapped Cuda stream.
-
void
activate
()¶ Make this the current_stream and remember the previous stream.
-
void
deactivate
()¶ Make the previous stream the current_stream.
-
void
synchronize
(int microseconds = 100)¶ Block until all work on this stream has finished. Checks in
microseconds
interval. Faster intervals make this more accurate, but increase CPU load. Uses standard Cuda busy-waiting method ifmicroseconds <= 0
.
-
std::string
repr
()¶ Returns a concise string representation of this stream.
-
-
int
augpy
::
current_device
¶ Controls which GPU device is used by each thread.
-
cudaStream_t
augpy
::
current_stream
¶ Controls which Cuda stream is used by each thread.
-
const CudaStream
augpy
::
default_stream
= CudaStream(-1, -1)¶ The default Cuda stream as a wrapped CudaStream.
Device Information¶
-
struct
augpy
::
cudaDevicePropEx
: public cudaDeviceProp¶ The cudaDeviceProp struct extended with stream priority fields.
-
cudaDevicePropEx
augpy
::
get_device_properties
(int device_id)¶ Returns the device properties of the given GPU device.
-
int
augpy
::
get_num_cuda_cores
(int device_id)¶ Returns the number of Cuda cores of the given GPU device.
-
int
augpy
::
cores_per_sm
(int device_id)¶ Given a GPU device id, returns the number of Cuda cores per SM.
-
int
augpy
::
cores_per_sm
(int major, int minor)¶ Given the major and minor Cuda capability (e.g., 7 and 5), returns the number of Cuda cores per SM.
Memory Management¶
-
std::tuple<size_t, size_t, size_t>
augpy
::
meminfo
(int device_id) For the device defined by
device_id
, return the current used, free, and total memory in bytes.
-
struct
augpy
::
managed_allocation
¶ A chunk of managed GPU memory.
Public Functions
-
managed_allocation
(int device_id, size_t size)¶ Make new struct filled with
device_id
andsize
. Does not allocate memory. Use managed_cudamalloc instead.
-
void
record
()¶
-
-
std::shared_ptr<managed_allocation>
augpy
::
managed_cudamalloc
(size_t size, int device_id)¶ Malloc
size
bytes ond GPU with givendevice id
. Returns managed_allocation as std::shared_ptr. Throws cuda_error.One all instances of shared_ptr are deleted, the allocated memory will be marked for deletion/reuse.
-
void
augpy
::
managed_cudafree
(void *ptr)¶ Frees device memory allocated by managed_cudamalloc at the given location. Throws cuda_error.
-
void
augpy
::
managed_eventalloc
(cudaEvent_t *event)¶ Return a Cuda event from the event pool of the current_device. Flags
cudaEventBlockingSync
andcudaEventDisableTiming
are set.
-
void
augpy
::
managed_eventfree
(cudaEvent_t event)¶ Mark the given Cuda event as reusable.
-
void
augpy
::
init_device
(int device_id)¶ Initialize the GPU with the given
device_id
. You can, but do not need to this manually. It is done for you whenever you request memory or device properties for the first.
-
void
augpy
::
release
()¶ Release all allocated memory on all GPUs. All CudaTensors become invalid immediately. Do I have to tell you this is dangerous?
cnmem¶
augpy uses cnmem to manage GPU device memory.
Typedefs
-
typedef struct cnmemDevice_t_
cnmemDevice_t
¶
Enums
-
enum
cnmemStatus_t
¶ Values:
-
enumerator
CNMEM_STATUS_SUCCESS
= 0¶
-
enumerator
CNMEM_STATUS_CUDA_ERROR
¶
-
enumerator
CNMEM_STATUS_INVALID_ARGUMENT
¶
-
enumerator
CNMEM_STATUS_NOT_INITIALIZED
¶
-
enumerator
CNMEM_STATUS_OUT_OF_MEMORY
¶
-
enumerator
CNMEM_STATUS_UNKNOWN_ERROR
¶
-
enumerator
-
enum
cnmemManagerFlags_t
¶ Values:
-
enumerator
CNMEM_FLAGS_DEFAULT
= 0¶
-
enumerator
CNMEM_FLAGS_CANNOT_GROW
= 1¶ Default flags.
-
enumerator
CNMEM_FLAGS_CANNOT_STEAL
= 2¶ Prevent the manager from growing its memory consumption.
-
enumerator
CNMEM_FLAGS_MANAGED
= 4¶ Prevent the manager from stealing memory.
-
enumerator
Functions
-
cnmemStatus_t
cnmemInit
(int numDevices, const cnmemDevice_t *devices, unsigned flags)¶ Initialize the library and allocate memory on the listed devices.
For each device, an internal memory manager is created and the specified amount of memory is allocated (it is the size defined in device[i].size). For each, named stream an additional memory manager is created. Currently, it is implemented as a tree of memory managers: A root manager for the device and a list of children, one for each named stream.
This function must be called before any other function in the library. It has to be called by a single thread since it is not thread-safe.
- Return
CNMEM_STATUS_SUCCESS, if everything goes fine, CNMEM_STATUS_INVALID_ARGUMENT, if one of the argument is invalid, CNMEM_STATUS_OUT_OF_MEMORY, if the requested size exceeds the available memory, CNMEM_STATUS_CUDA_ERROR, if an error happens in a CUDA function.
-
cnmemStatus_t
cnmemFinalize
()¶ Release all the allocated memory.
This function must be called by a single thread and after all threads that called cnmemMalloc/cnmemFree have joined. This function is not thread-safe.
- Return
CNMEM_STATUS_SUCCESS, if everything goes fine, CNMEM_STATUS_NOT_INITIALIZED, if the cnmemInit function has not been called, CNMEM_STATUS_CUDA_ERROR, if an error happens in one of the CUDA functions.
-
cnmemStatus_t
cnmemRetain
()¶ Increase the internal reference counter of the context object.
This function increases the internal reference counter of the library. The purpose of that reference counting mechanism is to give more control to the user over the lifetime of the library. It is useful with scoped memory allocation which may be destroyed in a final memory collection after the end of main(). That function is thread-safe.
- Return
CNMEM_STATUS_SUCCESS, if everything goes fine, CNMEM_STATUS_NOT_INITIALIZED, if the cnmemInit function has not been called,
-
cnmemStatus_t
cnmemRelease
()¶ Decrease the internal reference counter of the context object.
This function decreases the internal reference counter of the library. The purpose of that reference counting mechanism is to give more control to the user over the lifetime of the library. It is useful with scoped memory allocation which may be destroyed in a final memory collection after the end of main(). That function is thread-safe.
You can use
cnmemRelease
to explicitly finalize the library.- Return
CNMEM_STATUS_SUCCESS, if everything goes fine, CNMEM_STATUS_NOT_INITIALIZED, if the cnmemInit function has not been called,
-
cnmemStatus_t
cnmemRegisterStream
(cudaStream_t stream)¶ Add a new stream to the pool of managed streams on a device.
This function registers a new stream into a device memory manager. It is thread-safe.
- Return
CNMEM_STATUS_SUCCESS, if everything goes fine, CNMEM_STATUS_INVALID_ARGUMENT, if one of the argument is invalid,
-
cnmemStatus_t
cnmemMalloc
(void **ptr, size_t size, cudaStream_t stream)¶ Allocate memory.
This function allocates memory and initializes a pointer to device memory. If no memory is available, it returns a CNMEM_STATUS_OUT_OF_MEMORY error. This function is thread safe.
The behavior of that function is the following:
If the stream is NULL, the root memory manager is asked to allocate a buffer of device memory. If there’s a buffer of size larger or equal to the requested size in the list of free blocks, it is returned. If there’s no such buffer but the manager is allowed to grow its memory usage (the CNMEM_FLAGS_CANNOT_GROW flag is not set), the memory manager calls cudaMalloc. If cudaMalloc fails due to no more available memory or the manager is not allowed to grow, the manager attempts to steal memory from one of its children (unless CNMEM_FLAGS_CANNOT_STEAL is set). If that attempt also fails, the manager returns CNMEM_STATUS_OUT_OF_MEMORY.
If the stream is a named stream, the initial request goes to the memory manager associated with that stream. If a free node is available in the lists of that manager, it is returned. Otherwise, the request is passed to the root node and works as if the request were made on the NULL stream.
The calls to cudaMalloc are potentially costly and may induce GPU synchronizations. Also the mechanism to steal memory from the children induces GPU synchronizations (the manager has to make sure no kernel uses a given buffer before stealing it) and it the execution is sequential (in a multi-threaded context, the code is executed in a critical section inside the cnmem library - no need for the user to wrap cnmemMalloc with locks).
- Return
CNMEM_STATUS_SUCCESS, if everything goes fine, CNMEM_STATUS_NOT_INITIALIZED, if the cnmemInit function has not been called, CNMEM_STATUS_INVALID_ARGUMENT, if one of the argument is invalid. For example, ptr == 0, CNMEM_STATUS_OUT_OF_MEMORY, if there is not enough memory available, CNMEM_STATUS_CUDA_ERROR, if an error happens in one of the CUDA functions.
-
cnmemStatus_t
cnmemFree
(void *ptr, cudaStream_t stream)¶ Release memory.
This function releases memory and recycles a memory block in the manager. This function is thread safe.
- Return
CNMEM_STATUS_SUCCESS, if everything goes fine, CNMEM_STATUS_NOT_INITIALIZED, if the cnmemInit function has not been called, CNMEM_STATUS_INVALID_ARGUMENT, if one of the argument is invalid. For example, ptr == 0, CNMEM_STATUS_CUDA_ERROR, if an error happens in one of the CUDA functions.
-
cnmemStatus_t
cnmemMemGetInfo
(size_t *freeMem, size_t *totalMem, cudaStream_t stream)¶ Returns the amount of memory managed by the memory manager associated with a stream.
The pointers totalMem and freeMem must be valid. At the moment, this function has a comple- xity linear in the number of allocated blocks so do not call it in performance critical sections.
- Return
CNMEM_STATUS_SUCCESS, if everything goes fine, CNMEM_STATUS_NOT_INITIALIZED, if the cnmemInit function has not been called, CNMEM_STATUS_INVALID_ARGUMENT, if one of the argument is invalid, CNMEM_STATUS_CUDA_ERROR, if an error happens in one of the CUDA functions.
-
cnmemStatus_t
cnmemPrintMemoryState
(FILE *file, cudaStream_t stream)¶ Print a list of nodes to a file.
This function is intended to be used in case of complex scenarios to help understand the behaviour of the memory managers/application. It is thread safe.
- Return
CNMEM_STATUS_SUCCESS, if everything goes fine, CNMEM_STATUS_NOT_INITIALIZED, if the cnmemInit function has not been called, CNMEM_STATUS_INVALID_ARGUMENT, if one of the argument is invalid. For example, used_mem == 0 or free_mem == 0, CNMEM_STATUS_CUDA_ERROR, if an error happens in one of the CUDA functions.
-
const char *
cnmemGetErrorString
(cnmemStatus_t status)¶ Converts a cnmemStatus_t value to a string.
-
struct
cnmemDevice_t_
¶ - #include <cnmem.h>
Public Members
-
int
device
¶ The device number.
-
size_t
size
¶ The size to allocate for that device. If 0, the implementation chooses the size.
-
int
numStreams
¶ The number of named streams associated with the device. The NULL stream is not counted.
-
cudaStream_t *
streams
¶ The streams associated with the device. It can be NULL. The NULL stream is managed.
-
size_t *
streamSizes
¶ The size reserved for each streams. It can be 0.
-
int