Core Functionality¶
Exceptions raised by augpy, managing devices and computation streams, and controlling how functions are run on GPUs.
Device Management¶
augpy gives you fine control over which Cuda device is used and which Cuda stream kernels are run on. All functions are asynchronous by design, so events and streams can be used to synchronize host code.
There are two thread-local global variables that control which device and stream are currently active.
current_device¶
Each thread tracks its currently used Cuda device in the
current_device
variable.
Use CudaDevice.activate()
to make a stream the current stream and
CudaDevice.deactivate()
.
to restore the previous state.
Use get_current_device()
to get the currently active device.
current_stream¶
Each thread tracks its currently used Cuda stream in the
current_stream
variable.
Use CudaStream.activate()
to make a stream the current stream and
CudaStream.deactivate()
.
to restore the previous state.
Use get_current_stream()
to get the currently active stream.
default_stream¶
You can use the default_stream
to synchronize CPU and GPU execution without explicitly creating
and activating a different stream.
Note
All operations in augpy are asynchronous with respect to the
CPU, so calling CudaTensor.numpy()
will initiate
copying data from the device to the host memory and return
immediately.
You need to use CudaStream.synchronize()
, or
CudaEvent.record()
and CudaEvent.synchronize()
to ensure that data is fully copied before the array
is accessed.
-
class
augpy.
CudaDevice
(device_id: int)[source]¶ Create a new CudaDevice with the given Cuda device ID. 0 is the default and typically fastest device in the system.
- Parameters
device_id (int) – GPU device ID
-
activate
(self: augpy._augpy.'CudaDevice') → None[source]¶ Make this the current_stream and remember the previous stream.
- Return type
-
deactivate
(self: augpy._augpy.'CudaDevice') → None[source]¶ Make the previous stream the current_stream.
- Return type
-
class
augpy.
CudaEvent
[source]¶ Convenience wrapper for the cudaEvent_t.
Creating a new CudaEvent retrieves an event from the event pool of the current_device.
-
query
(self: augpy._augpy.'CudaEvent') → bool[source]¶ Returns
True
if event has occurred.- Return type
-
record
(self: augpy._augpy.'CudaEvent') → None[source]¶ Record wrapped event on current_stream.
- Return type
-
-
class
augpy.
CudaStream
(device_id: int = 0, priority: int = 0)[source]¶ Convenience wrapper for the cudaStream_t type.
Creates a new Cuda stream on the given device. Lower numbers mean higher priority, and values are clipped to the valid range. Use
get_device_properties()
to get the range of possible values for a device.Use
device_id=-1
andpriority=-1
to get thedefault_stream
.-
__init__
(self: augpy._augpy.'CudaStream', device_id: int = 0, priority: int = 0) → None[source]¶ - Return type
-
activate
(self: augpy._augpy.'CudaStream') → None[source]¶ Make this the current_stream and remember the previous stream.
- Return type
-
deactivate
(self: augpy._augpy.'CudaStream') → None[source]¶ Make the previous stream the current_stream.
- Return type
-
synchronize
(self: augpy._augpy.'CudaStream', microseconds: int = 100) → None[source]¶ Block until all work on this stream has finished. Checks in
microseconds
interval. Faster intervals make this more accurate, but increase CPU load. Uses standard Cuda busy-waiting method ifmicroseconds <= 0
.- Return type
-
-
augpy.
get_current_stream
() → augpy._augpy.CudaStream[source]¶ Returns the active
CudaStream
.- See:
-
- rtype
CudaStream
-
augpy.
default_stream
¶ The default
CudaStream
. Implicitly available on all Cuda devices.
-
augpy.
release
() → None[source]¶ Release all allocated memory on all GPUs. All
CudaTensors
become invalid immediately. Do I have to tell you this is dangerous?- Return type
Device Information¶
-
augpy.
get_device_properties
(device_id: int) → augpy._augpy.CudaDeviceProp[source]¶ Get
CudaDeviceProp
for given device.- Parameters
device_id (int) – Cude device id
- Returns
properties of device
- Return type
-
class
augpy.
CudaDeviceProp
[source]¶ The cudaDeviceProp struct extended with stream priority fields
leastStreamPriority
andgreatestStreamPriority
,coresPerMultiprocessor
, andmaxGridSize
.-
property
coresPerMultiprocessor
¶ Number of Cuda cores per multiprocessor
-
property
coresPerSM
¶ Number of Cuda cores per SM.
-
property
greatestStreamPriority
¶ Highest priority a Cuda stream on this device can have.
-
property
l2CacheSize
¶ Size of L2 cache in bytes
-
property
leastStreamPriority
¶ Lowest priority a Cuda stream on this device can have.
-
property
major
¶ Major compute capability
-
property
maxGridSize
¶ Max number of blocks in each grid dimension
-
property
maxThreadsDim
¶ Maximum size of each dimension of a block
-
property
maxThreadsPerBlock
¶ Maximum number of threads per block
-
property
maxThreadsPerMultiProcessor
¶ Maximum resident threads per multiprocessor
-
property
minor
¶ Minor compute capability
-
property
multiProcessorCount
¶ Number of multiprocessors on device
-
property
name
¶ ASCII string identifying device
-
property
numCudaCores
¶ Total number of Cuda coes.
-
property
regsPerBlock
¶ 32-bit registers available per block
-
property
regsPerMultiprocessor
¶ 32-bit registers available per multiprocessor
Shared memory available per block in bytes
Shared memory available per multiprocessor in bytes
-
property
streamPrioritiesSupported
¶ Device supports stream priorities
-
property
totalConstMem
¶ Constant memory available on device in bytes
-
property
totalGlobalMem
¶ Global memory available on device in bytes
-
property
warpSize
¶ Warp size in threads
-
property
Exceptions¶
-
exception
augpy.
CudaError
¶ Raised when a problem with a GPU occurs, e.g., device is unavailable or invalid kernel configuration.
-
exception
augpy.
CuRandError
¶ Raised when a problem with CuRand occurs, e.g., no memory left for random state.
-
exception
augpy.
CuBlasError
¶ Raised when a problem with CuBlas occurs, e.g., no memory left for handle.
-
exception
augpy.
MemoryError
¶ Raised when a problem with GPU memory occurs, e.g., no memory left for tensor.
-
exception
augpy.
NvJpegError
¶ Raised when a problem with JPEG decoding occurs, e.g., corrupt or unsupported image.
Blocks and threads¶
Cuda code executes in blocks of threads, each of which calculates one or more values in a tensor. The number of threads in a block greatly influences the performance of kernels, as they will share resources likes caches, but can also collaborate in calculations.
A Cuda-enabled GPU is organized in SMs with a number of Cuda cores each.
Each SM can work on a block independently.
Thus, it is important that a task is divided into at least as many
blocks as there are SMs.
You can use
get_device_properties()
to find out, among other useful information, how many SMs there
are and the number of Cuda cores per SM.
augpy functions often allow you to control how they are executed on the GPU:
Set the
blocks_per_sm
parameter to control how many blocks the work is divided into. The total number of blocks will beblocks_per_sm
times number of SMs on the GPU.Set the
threads
parameter to control how many threads there are in each block. Ifthreads
is zero, the number of cores per SM is used.
Together these parameters define the total number of threads that will be started for the kernel. The calculations for each element of the tensor are distributed evenly among these threads, i.e., each thread may calculate more than one value. More values per thread is often more efficient, however it must be balanced against the number of blocks to keep all SMs busy.
The defaults for blocks_per_sm
and threads
parameters
are usually quite sensible, but depending on your GPU architecture
other combinations may provide better performance.