RecentGuides and Tutorials

What are NVIDIA CUDA CORES?

We explain what the NVIDIA CUDA Cores of graphics cards are and how this technology works and how it has allowed a significant leap in computing through parallelization

Whenever we talk about NVIDIA graphics cards, we talk about their technical specifications, such as the working frequency of the GPU, the amount of memory, the TDP or the CUDA Cores, among others. We know very well what the frequency is and how it acts on the graphics chip, which is the TDP or the amount of memory and the type of it, as well as other considerations of the same, but today we are going to focus on talking about NVIDIA CUDA , the technology on which NVIDIA's silicon chips are based.


WHAT IS CUDA?


The NVIDIA CUDA architecture is on which NVIDIA graphics cards are based, designed to perform parallel computing, offering great computing power in the GPU, which allows the system to have great performance. NVIDIA to date has sold millions of graphics that include the NVIDIA CUDA solution, being highly valued by developers, researchers, scientists and also gamers. The segment that takes most advantage of this technology are the first three, since NVIDIA CUDA allows solutions to move smoothly within the fields of video and image processing, biology or computational chemistry, dynamic fluid simulation, seismological analysis or others. so many applications that require great computing power.

cpu vs gpu

NVIDIA CUDA ARCHITECTURE


In the classic architecture of a graphics card we can find the presence of two types of processors, vertex processors and fragment processors, dedicated to different and indispensable tasks within the graphics channel and with different instruction sets. This presents two important problems, on the one hand the load imbalance that appears between both processors and on the other the difference between their respective sets of instructions. In this way, the natural evolution in the architecture of a GPU has been the search for a unified architecture where there is no distinction between both types of processors. Thus the CUDA architecture was arrived at, where all the execution cores need the same set of instructions and practically the same resources.

We can see in the image the presence of execution units called Streaming Multiprocessors (SM), 8 units in the example in the figure, which are interconnected by a common memory area. Each SM is in turn composed of some computing cores called cuda cores 'CUDA cores', which are in charge of executing the instructions and that in our example we see that there are 32 cores for each SM, which makes a total of 256 cores processing. This hardware design allows easy programming of the GPU cores using a high-level language such as the C language for NVIDIA CUDA. In this way, the programmer simply writes a sequential program within which what is known as a kernel is called, which can be a simple function or a complete program.

nvidia cuda architecture

This kernel runs in parallel within the GPU as a set of threads (threads and that the programmer organizes within a hierarchy in which they can be grouped into blocks (blocks), and which in turn can be distributed forming a mesh (grid For convenience, blocks and meshes can have one, two or three dimensions. There are many situations in which the data you are working with naturally have a mesh structure, but in general, decompose the data into A thread hierarchy is not an easy task. Thus, a thread block is a set of concurrent threads that can cooperate with each other through synchronization mechanisms and share access to a memory space exclusive to each block. And a mesh it is a set of blocks that can be executed independently and that therefore can be launched in parallel in the Streaming Multiprocessors (SM).

CUDA cores

When a kernel is invoked, the programmer specifies the number of threads per block and the number of blocks that make up the mesh. Once on the GPU, each thread is assigned a unique identification number within its block, and each block receives an identifier within the mesh. This allows each thread to decide what data it has to work on, which greatly simplifies memory addressing when working with multidimensional data, such as image processing or solving differential equations in two and three dimensions.

Another aspect to highlight in the CUDA cores architecture is the presence of a work distribution unit that is responsible for distributing the blocks among the available SMs. The threads within each block run concurrently and when a block ends, the distribution unit drops new blocks on the free SMs. SMs map each thread onto an SP core, and each thread runs independently with its own program counter and status registers. Since each thread has its own records assigned, there is no penalty for context switching, but instead there is a limit on the maximum number of active threads because each SM has a certain number of records.

A particular characteristic of the CUDA architecture is the grouping of threads into groups of 32. A group of 32 threads is called warp, and can be considered as the unit of parallel execution, since all threads of the same warp they are physically executed in parallel and therefore start at the same instruction (although afterwards they are free to branch and execute independently). Thus, when a block is selected for execution within an SM, the block is divided into warps, one that is ready to be executed is selected and the next instruction is issued to all the threads that make up the warp. Since they all execute the same instruction in unison, maximum efficiency is achieved when all threads coincide in their execution path (no forks). Although the programmer can ignore this behavior, it should be taken into account if you want to optimize an application.

Regarding memory, during its execution the threads can access the data from different spaces within a memory hierarchy. Thus, each thread has a private local memory area and each block has a shared memory area visible to all the threads of the same block, with high bandwidth and low latency (similar to a level 1 cache). Finally, all threads have access to the same global memory space (in the order of MiB or GiB) located on an external DRAM memory chip. Since this memory has a very high latency, it is a good practice to copy the data that will be accessed frequently to the shared memory area.

The CUDA programming model assumes that both the host and the device maintain their own separate memory spaces. The only memory area accessible from the host is global memory. In addition, both the global memory reservation or release and the transfer of data between the host and the device must be done explicitly from the host through calls to specific CUDA functions.

nvidia memory

WHAT IS GPU COMPUTING ACCELERATION?


The other key point of CUDA cores is the acceleration of graphic processing when a GPU is combined with the processor (CPU), which allows applications to deep learning, analysis and engineering. This system was introduced by NVIDIA in 2017 and since then, GPUs have been installed in highly energy efficient data centers in universities, government facilities, large companies and SMEs around the world. These are of great importance in accelerating applications ranging from artificial intelligence to cars, drones and robots.


HOW DOES THIS ACCELERATION IN SOFTWARE UTILITIES WORK?


This system of acceleration of the calculation in graphic cards, makes that most of the weight of the computing load passes to the GPU, leaving the remaining code to be executed in the CPU. The transfer of the weight of the computation is automatic and the user does not have to do anything, only, the user will notice that the paralyzed tasks will have a more agile operation.

This system is explained more simply, if we understand the basic differences between a processor and a graphics card. If we see a CPU internally. CPUs are developed with a limited number of cores, for example an Intel i7 7700K has four cores, which indicates that it is four processors for serial computing, while GPUs have thousands of cores, smaller, but a lot more efficient, which are developed to offer multiple simultaneous tasks. A clear example is the NVIDIA GTX 1080 Ti, the most powerful graphics card for gaming on the market, which offers us 3594 CUDA Cores.

The difference between the two is very important, as we can see, since between a 4-core processor and the graphics card with 3594 CUDA Cores, we find not only a giant difference of cores for computing, but a difference between serial computing and the parallel. Serial computing means, in a simple way, that the cores would do a single task, while the CUDA Cores are capable, each one, of taking a piece of the global task and working on it for their part, then joining all the elements and outputting the completed task. This is what allows a graphics card to be much more efficient and offer greater power compared to a processor in tasks that require a lot of parallelism.


PROCESSING IN PARALLEL UNDER CUDA


Previously, systems made centralized processing overhead on the CPU, which was the one who did all the work, but now the processing tasks are shared between the CPU and the GPU. This new work system requires technological innovations and the search for a new solution, that is why NVIDIA has developed CUDA, for parallel computing, which affects GeForce, Quadro and Tesla graphics cards, allowing developers to improve performance and streamline Tasks.

One of the segments that has benefited the most from NVIDIA CUDA has been the scientific industry, who have seen surveying, mapping, storm forecasting and other projects have seen their day-to-day life improve and streamline. Today thousands of researchers can do work on molecular dynamics in an academic and pharmaceutical field, which streamlines development and research for pharmacology, allowing progress in less time in the cure of complex diseases such as cancer, Alzheimer's and other nowadays incurable diseases.

Not only in the scientific segment, many companies also use NVIDIA CUDA to make predictions in risky financial operations, speeding up time by at least eighteen times or more. Other examples are the great reception that Tesla GPUs have for Cloud Computing and other computing systems that require great work power. CUDA also allows autonomous vehicles to operate simply and efficiently, being able to make real-time calculations that could not be done using other systems. This computing agility allows the vehicle to make important decisions in a very short time, to avoid obstacles, move without problems or avoid accidents.


CUDA PARALLEL CALCULATION PLATFORM


The great advantage of NVIDA CUDA is the parallelization of tasks, allowing through extensions to work in C and C ++, in parallel, by processing tasks and data with different levels of importance. These parallelization tasks can be performed using various high-level languages, such as C and C ++ and more recently Python, which is low-level, or simply using open standards that contain OpenACC directives.

CUDA is currently the most used platform for the acceleration of tasks and great advances have been made thanks to the development under this technology. So much so that today, CUDA technology is the most widely used and one of the most important, so much so that in all Spanish universities it is trained in this aspect, teaching programming based on NVIDIA CUDA, since it will be useful for the student in the future, allowing the parallelization of tasks and streamlining processes.

Show more

Robert Sole

Director of Contents and Writing of this same website, technician in renewable energy generation systems and low voltage electrical technician. I work in front of a PC, in my free time I am in front of a PC and when I leave the house I am glued to the screen of my smartphone. Every morning when I wake up I walk across the Stargate to make some coffee and start watching YouTube videos. I once saw a dragon ... or was it a Dragonite?

Related publications

6 comments

Leave your comment

Your email address will not be published. Required fields are marked with *

Button back to top
CLOSE

Ad blocker detected

This site is funded through the use of advertising. We always make sure that the advertising is not too intrusive for the reader and we prioritize the reader's experience on the website. However, if you block the ads, part of our funding will be reduced.