Core Functionality

Exceptions raised by augpy, managing devices and computation streams, and controlling how functions are run on GPUs.

Device Management

augpy gives you fine control over which Cuda device is used and which Cuda stream kernels are run on. All functions are asynchronous by design, so events and streams can be used to synchronize host code.

There are two thread-local global variables that control which device and stream are currently active.

current_device

Each thread tracks its currently used Cuda device in the current_device variable. Use CudaDevice.activate() to make a stream the current stream and CudaDevice.deactivate(). to restore the previous state. Use get_current_device() to get the currently active device.

current_stream

Each thread tracks its currently used Cuda stream in the current_stream variable. Use CudaStream.activate() to make a stream the current stream and CudaStream.deactivate(). to restore the previous state. Use get_current_stream() to get the currently active stream.

default_stream

You can use the default_stream to synchronize CPU and GPU execution without explicitly creating and activating a different stream.

Note

All operations in augpy are asynchronous with respect to the CPU, so calling CudaTensor.numpy() will initiate copying data from the device to the host memory and return immediately. You need to use CudaStream.synchronize(), or CudaEvent.record() and CudaEvent.synchronize() to ensure that data is fully copied before the array is accessed.

class augpy.CudaDevice(device_id: int)[source]

Create a new CudaDevice with the given Cuda device ID. 0 is the default and typically fastest device in the system.

Parameters

device_id (int) – GPU device ID

__init__(self: augpy._augpy.'CudaDevice', device_id: int)None[source]
Return type

None

activate(self: augpy._augpy.'CudaDevice')None[source]

Make this the current_stream and remember the previous stream.

Return type

None

deactivate(self: augpy._augpy.'CudaDevice')None[source]

Make the previous stream the current_stream.

Return type

None

get_device(self: augpy._augpy.'CudaDevice')int[source]

Return the device ID.

Return type

int

get_properties(self: augpy._augpy.'CudaDevice') → augpy._augpy.’CudaDevice’Prop[source]

Return the device properties, see py/core:get_device_properties for more detials.

Return type

‘CudaDevice’Prop

synchronize(self: augpy._augpy.'CudaDevice')None[source]

Block until all work on this device has finished. Cuda uses busy waiting to achieve this. See synchronization method of py/core:CudaStream or py/core:CudaEvent to avoid the CPU load this incurs.

Return type

None

class augpy.CudaEvent[source]

Convenience wrapper for the cudaEvent_t.

Creating a new CudaEvent retrieves an event from the event pool of the current_device.

__init__(self: augpy._augpy.'CudaEvent')None[source]
Return type

None

query(self: augpy._augpy.'CudaEvent')bool[source]

Returns True if event has occurred.

Return type

bool

record(self: augpy._augpy.'CudaEvent')None[source]

Record wrapped event on current_stream.

Return type

None

synchronize(self: augpy._augpy.'CudaEvent', microseconds: int = 100)None[source]

Block until event has occurred. Checks in microseconds interval. Faster intervals make this more accurate, but increase CPU load. Uses standard Cuda busy-waiting method if microseconds <= 0.

Parameters

microseconds (int) – check interval

Return type

None

class augpy.CudaStream(device_id: int = 0, priority: int = 0)[source]

Convenience wrapper for the cudaStream_t type.

Creates a new Cuda stream on the given device. Lower numbers mean higher priority, and values are clipped to the valid range. Use get_device_properties() to get the range of possible values for a device.

See:

cudaStreamCreateWithPriority

Use device_id=-1 and priority=-1 to get the default_stream.

Parameters
  • device_id (int) – GPU device ID

  • priority (int) – stream priority

__init__(self: augpy._augpy.'CudaStream', device_id: int = 0, priority: int = 0)None[source]
Return type

None

activate(self: augpy._augpy.'CudaStream')None[source]

Make this the current_stream and remember the previous stream.

Return type

None

deactivate(self: augpy._augpy.'CudaStream')None[source]

Make the previous stream the current_stream.

Return type

None

synchronize(self: augpy._augpy.'CudaStream', microseconds: int = 100)None[source]

Block until all work on this stream has finished. Checks in microseconds interval. Faster intervals make this more accurate, but increase CPU load. Uses standard Cuda busy-waiting method if microseconds <= 0.

Return type

None

augpy.get_current_device()int[source]

Returns the active device ID.

See:

current_device.

rtype

int

augpy.get_current_stream() → augpy._augpy.CudaStream[source]

Returns the active CudaStream.

See:

current_stream

rtype

CudaStream

augpy.default_stream

The default CudaStream. Implicitly available on all Cuda devices.

augpy.release()None[source]

Release all allocated memory on all GPUs. All CudaTensors become invalid immediately. Do I have to tell you this is dangerous?

Return type

None

Device Information

augpy.get_device_properties(device_id: int) → augpy._augpy.CudaDeviceProp[source]

Get CudaDeviceProp for given device.

Parameters

device_id (int) – Cude device id

Returns

properties of device

Return type

CudaDeviceProp

class augpy.CudaDeviceProp[source]

The cudaDeviceProp struct extended with stream priority fields leastStreamPriority and greatestStreamPriority, coresPerMultiprocessor, and maxGridSize.

__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

property coresPerMultiprocessor

Number of Cuda cores per multiprocessor

property coresPerSM

Number of Cuda cores per SM.

property greatestStreamPriority

Highest priority a Cuda stream on this device can have.

property l2CacheSize

Size of L2 cache in bytes

property leastStreamPriority

Lowest priority a Cuda stream on this device can have.

property major

Major compute capability

property maxGridSize

Max number of blocks in each grid dimension

property maxThreadsDim

Maximum size of each dimension of a block

property maxThreadsPerBlock

Maximum number of threads per block

property maxThreadsPerMultiProcessor

Maximum resident threads per multiprocessor

property minor

Minor compute capability

property multiProcessorCount

Number of multiprocessors on device

property name

ASCII string identifying device

property numCudaCores

Total number of Cuda coes.

property regsPerBlock

32-bit registers available per block

property regsPerMultiprocessor

32-bit registers available per multiprocessor

property sharedMemPerBlock

Shared memory available per block in bytes

property sharedMemPerMultiprocessor

Shared memory available per multiprocessor in bytes

property streamPrioritiesSupported

Device supports stream priorities

property totalConstMem

Constant memory available on device in bytes

property totalGlobalMem

Global memory available on device in bytes

property warpSize

Warp size in threads

augpy.meminfo(device_id: int = 0) → Tuple[int, int, int][source]

For the device defined by device_id, return the current used, free, and total memory in bytes.

Return type

Tuple[int, int, int]

Exceptions

exception augpy.CudaError

Raised when a problem with a GPU occurs, e.g., device is unavailable or invalid kernel configuration.

exception augpy.CuRandError

Raised when a problem with CuRand occurs, e.g., no memory left for random state.

exception augpy.CuBlasError

Raised when a problem with CuBlas occurs, e.g., no memory left for handle.

exception augpy.MemoryError

Raised when a problem with GPU memory occurs, e.g., no memory left for tensor.

exception augpy.NvJpegError

Raised when a problem with JPEG decoding occurs, e.g., corrupt or unsupported image.

Blocks and threads

Cuda code executes in blocks of threads, each of which calculates one or more values in a tensor. The number of threads in a block greatly influences the performance of kernels, as they will share resources likes caches, but can also collaborate in calculations.

A Cuda-enabled GPU is organized in SMs with a number of Cuda cores each. Each SM can work on a block independently. Thus, it is important that a task is divided into at least as many blocks as there are SMs. You can use get_device_properties() to find out, among other useful information, how many SMs there are and the number of Cuda cores per SM.

augpy functions often allow you to control how they are executed on the GPU:

  1. Set the blocks_per_sm parameter to control how many blocks the work is divided into. The total number of blocks will be blocks_per_sm times number of SMs on the GPU.

  2. Set the threads parameter to control how many threads there are in each block. If threads is zero, the number of cores per SM is used.

Together these parameters define the total number of threads that will be started for the kernel. The calculations for each element of the tensor are distributed evenly among these threads, i.e., each thread may calculate more than one value. More values per thread is often more efficient, however it must be balanced against the number of blocks to keep all SMs busy.

The defaults for blocks_per_sm and threads parameters are usually quite sensible, but depending on your GPU architecture other combinations may provide better performance.