Core Functionality¶

Exceptions raised by augpy, managing devices and computation streams, and controlling how functions are run on GPUs.

Device Management
Device Information
Exceptions
Blocks and threads

Device Management ¶

augpy gives you fine control over which Cuda device is used and which Cuda stream kernels are run on. All functions are asynchronous by design, so events and streams can be used to synchronize host code.

There are two thread-local global variables that control which device and stream are currently active.

current_device ¶

Each thread tracks its currently used Cuda device in the current_device variable. Use CudaDevice.activate() to make a stream the current stream and CudaDevice.deactivate(). to restore the previous state. Use get_current_device() to get the currently active device.

current_stream ¶

Each thread tracks its currently used Cuda stream in the current_stream variable. Use CudaStream.activate() to make a stream the current stream and CudaStream.deactivate(). to restore the previous state. Use get_current_stream() to get the currently active stream.

default_stream ¶

You can use the default_stream to synchronize CPU and GPU execution without explicitly creating and activating a different stream.

Note

All operations in augpy are asynchronous with respect to the CPU, so calling CudaTensor.numpy() will initiate copying data from the device to the host memory and return immediately. You need to use CudaStream.synchronize(), or CudaEvent.record() and CudaEvent.synchronize() to ensure that data is fully copied before the array is accessed.

class augpy.CudaDevice(device_id: int)[source]¶

Create a new CudaDevice with the given Cuda device ID. 0 is the default and typically fastest device in the system.

Parameters: device_id (int) – GPU device ID

__init__(self: augpy._augpy.'CudaDevice', device_id: int) → None [source]¶

Return type: None

activate(self: augpy._augpy.'CudaDevice') → None [source]¶

Make this the current_stream and remember the previous stream.

Return type: None

deactivate(self: augpy._augpy.'CudaDevice') → None [source]¶

Make the previous stream the current_stream.

Return type: None

get_device(self: augpy._augpy.'CudaDevice') → int [source]¶

Return the device ID.

Return type: int

get_properties(self: augpy._augpy.'CudaDevice') → augpy._augpy.’CudaDevice’Prop[source]¶

Return the device properties, see py/core:get_device_properties for more detials.

Return type: ‘CudaDevice’Prop

synchronize(self: augpy._augpy.'CudaDevice') → None [source]¶

Block until all work on this device has finished. Cuda uses busy waiting to achieve this. See synchronization method of py/core:CudaStream or py/core:CudaEvent to avoid the CPU load this incurs.

Return type: None

class augpy.CudaEvent[source]¶

Convenience wrapper for the cudaEvent_t.

Creating a new CudaEvent retrieves an event from the event pool of the current_device.

__init__(self: augpy._augpy.'CudaEvent') → None [source]¶

Return type: None

query(self: augpy._augpy.'CudaEvent') → bool [source]¶

Returns True if event has occurred.

Return type: bool

record(self: augpy._augpy.'CudaEvent') → None [source]¶

Record wrapped event on current_stream.

Return type: None

synchronize(self: augpy._augpy.'CudaEvent', microseconds: int = 100) → None [source]¶

Block until event has occurred. Checks in microseconds interval. Faster intervals make this more accurate, but increase CPU load. Uses standard Cuda busy-waiting method if microseconds <= 0.

Parameters: microseconds (int) – check interval
Return type: None

class augpy.CudaStream(device_id: int = 0, priority: int = 0)[source]¶

Convenience wrapper for the cudaStream_t type.

Creates a new Cuda stream on the given device. Lower numbers mean higher priority, and values are clipped to the valid range. Use get_device_properties() to get the range of possible values for a device.

See:: cudaStreamCreateWithPriority

Use device_id=-1 and priority=-1 to get the default_stream.

Parameters

device_id (int) – GPU device ID
priority (int) – stream priority

__init__(self: augpy._augpy.'CudaStream', device_id: int = 0, priority: int = 0) → None [source]¶

Return type: None

activate(self: augpy._augpy.'CudaStream') → None [source]¶

Make this the current_stream and remember the previous stream.

Return type: None

deactivate(self: augpy._augpy.'CudaStream') → None [source]¶

Make the previous stream the current_stream.

Return type: None

synchronize(self: augpy._augpy.'CudaStream', microseconds: int = 100) → None [source]¶

Block until all work on this stream has finished. Checks in microseconds interval. Faster intervals make this more accurate, but increase CPU load. Uses standard Cuda busy-waiting method if microseconds <= 0.

Return type: None

augpy.get_current_device() → int [source]¶

Returns the active device ID.

See:

current_device.

rtype: int

augpy.get_current_stream() → augpy._augpy.CudaStream[source]¶

Returns the active CudaStream.

See:

current_stream

rtype: CudaStream

augpy.default_stream¶: The default CudaStream. Implicitly available on all Cuda devices.

augpy.release() → None [source]¶

Release all allocated memory on all GPUs. All CudaTensors become invalid immediately. Do I have to tell you this is dangerous?

Return type: None

Device Information ¶

augpy.get_device_properties(device_id: int) → augpy._augpy.CudaDeviceProp[source]¶

Get CudaDeviceProp for given device.

Parameters: device_id (int) – Cude device id
Returns: properties of device
Return type: CudaDeviceProp

class augpy.CudaDeviceProp[source]¶

The cudaDeviceProp struct extended with stream priority fields leastStreamPriority and greatestStreamPriority, coresPerMultiprocessor, and maxGridSize.

__init__()[source]¶: Initialize self. See help(type(self)) for accurate signature.

property coresPerMultiprocessor¶: Number of Cuda cores per multiprocessor

property coresPerSM¶: Number of Cuda cores per SM.

property greatestStreamPriority¶: Highest priority a Cuda stream on this device can have.

property l2CacheSize¶: Size of L2 cache in bytes

property leastStreamPriority¶: Lowest priority a Cuda stream on this device can have.

property major¶: Major compute capability

property maxGridSize¶: Max number of blocks in each grid dimension

property maxThreadsDim¶: Maximum size of each dimension of a block

property maxThreadsPerBlock¶: Maximum number of threads per block

property maxThreadsPerMultiProcessor¶: Maximum resident threads per multiprocessor

property minor¶: Minor compute capability

property multiProcessorCount¶: Number of multiprocessors on device

property name¶: ASCII string identifying device

property numCudaCores¶: Total number of Cuda coes.

property regsPerBlock¶: 32-bit registers available per block

property regsPerMultiprocessor¶: 32-bit registers available per multiprocessor

property sharedMemPerBlock¶: Shared memory available per block in bytes

property sharedMemPerMultiprocessor¶: Shared memory available per multiprocessor in bytes

property streamPrioritiesSupported¶: Device supports stream priorities

property totalConstMem¶: Constant memory available on device in bytes

property totalGlobalMem¶: Global memory available on device in bytes

property warpSize¶: Warp size in threads

augpy.meminfo(device_id: int = 0) → Tuple[int, int, int][source]¶

For the device defined by device_id, return the current used, free, and total memory in bytes.

Return type: Tuple[int, int, int]

Exceptions ¶

exception augpy.CudaError¶: Raised when a problem with a GPU occurs, e.g., device is unavailable or invalid kernel configuration.

exception augpy.CuRandError¶: Raised when a problem with CuRand occurs, e.g., no memory left for random state.

exception augpy.CuBlasError¶: Raised when a problem with CuBlas occurs, e.g., no memory left for handle.

exception augpy.MemoryError¶: Raised when a problem with GPU memory occurs, e.g., no memory left for tensor.

exception augpy.NvJpegError¶: Raised when a problem with JPEG decoding occurs, e.g., corrupt or unsupported image.

Blocks and threads ¶

Cuda code executes in blocks of threads, each of which calculates one or more values in a tensor. The number of threads in a block greatly influences the performance of kernels, as they will share resources likes caches, but can also collaborate in calculations.

A Cuda-enabled GPU is organized in SMs with a number of Cuda cores each. Each SM can work on a block independently. Thus, it is important that a task is divided into at least as many blocks as there are SMs. You can use get_device_properties() to find out, among other useful information, how many SMs there are and the number of Cuda cores per SM.

augpy functions often allow you to control how they are executed on the GPU:

Set the blocks_per_sm parameter to control how many blocks the work is divided into. The total number of blocks will be blocks_per_sm times number of SMs on the GPU.
Set the threads parameter to control how many threads there are in each block. If threads is zero, the number of cores per SM is used.

Together these parameters define the total number of threads that will be started for the kernel. The calculations for each element of the tensor are distributed evenly among these threads, i.e., each thread may calculate more than one value. More values per thread is often more efficient, however it must be balanced against the number of blocks to keep all SMs busy.

The defaults for blocks_per_sm and threads parameters are usually quite sensible, but depending on your GPU architecture other combinations may provide better performance.

Core Functionality¶

Device Management¶

current_device¶

current_stream¶

default_stream¶

Device Information¶

Exceptions¶

Blocks and threads¶