Chapter 2.3 Gradient Recording and Control in PyTorch

Author

Brench

Published

2026-05-16

Modified

2026-05-17

In Section 2.1, we answered the question of where gradients come from. PyTorch records the forward computation, then traces those records during the backward pass. Autograd uses the computation graph as the structure that carries gradients back through the operations.

When writing real code, however, another practical question appears almost immediately: does this computation need to be recorded at all?

During training, the answer is yes because backpropagation will follow. During validation, inference, feature extraction, or a quick run just to inspect outputs, the record is often unnecessary. Keeping it means storing intermediate results, building a graph, spending extra memory, and sometimes accidentally dragging numerical-only code into a gradient path.

This section therefore changes the angle. Instead of asking how differentiation is carried out, we ask which computations Autograd records and which ones it skips. PyTorch provides several direct controls: torch.no_grad(), torch.enable_grad(), and the inference-focused torch.inference_mode(). They leave the numerical forward results unchanged, but they affect whether a graph is built, whether backward propagation is possible, and how much extra memory and runtime overhead is introduced.

This reflects an important separation in PyTorch: operators decide how values are computed, while Autograd decides whether those computations leave differentiable records. We will begin with the most common control, no_grad(), and then compare the available gradient recording modes.

import torch
import torch.nn as nn
import torch.nn.functional as F

print('PyTorch version:', torch.__version__)

2.3.1 `torch.no_grad()`: Pause Recording

By default, when a tensor with requires_grad=True participates in an operation, PyTorch builds the corresponding computation graph. In other words, computations performed in a differentiable context are quietly recorded by Autograd. Sometimes, however, that record is not needed.

For instance, model validation usually does not require gradients because no backward pass will be run. During inference, we care about the output values, not about preserving the path that produced them. In such cases, allowing Autograd to keep recording only increases memory use and can slow execution.

For this reason, PyTorch provides the torch.no_grad() context manager, which can also be used as a decorator. It gives Autograd a clear instruction: do not record computations inside this block.

Here is a simple comparison. In the default mode:

model = nn.Linear(6, 4)
x = torch.randn(10, 6)
y = torch.randn(10, 4)

y_pred = model(x)
print('`y_pred.requires_grad` before `no_grad()`:', y_pred.requires_grad)

The output is True because model parameters require gradients by default, so the forward result becomes part of the computation graph.

Now put the same forward pass inside no_grad():

with torch.no_grad():
    y_pred = model(x)

print('`y_pred.requires_grad` inside `no_grad()`:', y_pred.requires_grad)

This time, the output is False.

Inside no_grad(), the forward computation still runs as usual. The difference is that newly produced results are not tracked by Autograd. Once a tensor is untracked, later values derived from it are usually untracked as well. Calling backward() on such a value raises an error because there is no computation graph to trace.

loss = F.mse_loss(y_pred, y)
try:
    loss.backward()
except RuntimeError as err:
    print('RuntimeError:', err)

Here, loss is computed outside the no_grad() block, but it depends on y_pred, which was already produced without tracking. The other input, y, does not request gradients either. As a result, loss is also outside the graph, and backward() has nothing to follow.

It is easy to assume that no_grad() changes tensor attributes such as requires_grad to False, but that is not what happens. no_grad() controls whether computations in the current block are tracked; it does not edit existing tensor properties. A tensor created outside the block may still have requires_grad=True, yet computations involving it inside no_grad() are treated as untracked.

x = torch.randn(10, 6, requires_grad=True)

with torch.no_grad():
    print('`x.requires_grad` inside `no_grad()`:', x.requires_grad)
    y_pred = model(x)
    print('`y_pred.requires_grad` inside `no_grad()`:', y_pred.requires_grad)

So no_grad() does not remove a tensor’s ability to become differentiable in principle. It only prevents new computations in that context from being recorded. The requires_grad attribute is a capability declaration, while no_grad() is a behavior switch. The two ideas are separate.

Similarly, if a new tensor is created inside no_grad() and later needs to participate in Autograd, we can call requires_grad_():

with torch.no_grad():
    x = torch.randn(10, 6)
    print('`x.requires_grad` inside `no_grad()`:', x.requires_grad)

x.requires_grad_()
print('`x.requires_grad` after `requires_grad_()`:', x.requires_grad)

In short, no_grad() temporarily disables recording. It does not permanently revoke the possibility of differentiability. PyTorch still keeps enough internal state to restore gradient recording later, and that state has some computational and memory cost. This will matter when we compare it with inference_mode(), where PyTorch turns off more Autograd-related machinery and prevents tensors created inside the context from re-entering tracking through requires_grad_().

From a lower-level perspective, numerical computation and Autograd recording are separate behaviors in PyTorch. no_grad() affects only the recording side, which is why it appears so often in validation, inference deployment, and parameter update code.

The next question follows naturally: if gradient recording can be disabled, can it be restored only for a small region? What if one step inside an inference flow suddenly needs gradients? That is where torch.enable_grad() comes in.

2.3.2 `torch.enable_grad()`: Resume Recording

The previous section showed that no_grad() pauses Autograd recording. If we are already inside such a block, can we turn gradients back on for only a small part of the computation?

Yes. That is exactly what enable_grad() is for.

The nesting can also go the other way: gradients can be enabled outside, with no_grad() used for an inner region. These contexts compose freely. In the default mode, though, writing enable_grad() at the outermost level is usually redundant because gradients are already enabled.

Consider a simple example:

x = torch.randn(10, 6, requires_grad=True)

with torch.no_grad():
    y = x * 3  # Does not record computation graph
    print('`y.requires_grad` in `no_grad()`:', y.requires_grad)

    with torch.enable_grad():
        z = x * 4  # Enables gradient tracking
        print('`z.requires_grad` in `enable_grad()`:', z.requires_grad)

# Only z will have gradients tracked
z.backward(gradient=torch.ones_like(z))

The important point is the interaction between the two contexts. The outer no_grad() disables recording, and the inner enable_grad() temporarily restores it. After the inner block exits, the outer no_grad() is still in effect, so subsequent computation becomes untracked again. Gradient modes are therefore managed like a stack: entering a context pushes a mode, and leaving it restores the previous one.

Why is this useful?

This is useful because many engineering paths are shared. Most inference computation may not need gradients, while one intermediate step might need sensitivity analysis. A debugging branch may also need to compute a temporary gradient. Without enable_grad(), the code would need to be split apart or the outer state would need to be switched repeatedly. With it, recording can be restored exactly where it is needed.

PyTorch also provides a more general interface, torch.set_grad_enabled(). It takes a boolean and sets the current gradient mode directly. no_grad() and enable_grad() are common special cases of this interface.

x = torch.randn(10, 6)
is_training = False

with torch.set_grad_enabled(is_training):
    y_pred = model(x)

When is_training=True, this behaves like enable_grad(). When is_training=False, it behaves like no_grad(). This is often a clean way to share logic between training and evaluation branches.

So far, we have introduced two common gradient-control contexts: no_grad() disables recording, and enable_grad() restores it. Because they nest cleanly, they form a flexible stack-based control system. Next, we turn to torch.inference_mode(), a context designed specifically for inference optimization that goes beyond no_grad() in both performance and memory behavior.

2.3.3 `torch.inference_mode()`: Stop Recording for Good

The previous two sections gave us a fairly flexible mechanism:

no_grad() disables gradient recording.
enable_grad() locally restores gradient recording.
set_grad_enabled() is a general interface for setting the current gradient mode.
Gradient modes can be nested and restored.

At first glance, that sounds like enough. Why does PyTorch need another tool called inference_mode()?

The reason is a stronger assumption: if we know not only that gradients are unnecessary now, but also that the results will never be used for backpropagation, can the framework remove even more gradient-related overhead?

This is the motivation behind inference_mode()¹.

In no_grad() mode, PyTorch still maintains version counters, view tracking, and internal checks that protect gradient correctness. These mechanisms matter during training because they prevent in-place operations from damaging graph structure and guard against shared-memory view problems. In pure inference, however, they become extra overhead. If a result will never be used for gradient computation, the framework can skip more of this tracking and perform more aggressive memory optimization. As a result, inference_mode() is usually faster and more memory-efficient than no_grad().

But it is irreversible.

As shown earlier, a tensor created inside no_grad() can still have gradients enabled later:

with torch.no_grad():
    x = torch.randn(10, 6)

x.requires_grad_()  # we can still enable gradients for x
print('`x.requires_grad` after `requires_grad_`:', x.requires_grad)

But if a tensor is created inside inference_mode() and we later try to set requires_grad=True, PyTorch raises an error:

with torch.inference_mode():
    x = torch.randn(10, 6)

try:
    x.requires_grad_()
except RuntimeError as err:
    print('RuntimeError:', err)

This happens because inference_mode() does more than pause recording. It creates special inference tensors that are marked as never entering the Autograd system. Even if gradient mode is enabled later, they will not be added to a computation graph. So no_grad() is a temporary shutdown, while inference_mode() is much closer to a permanent shutdown. It is appropriate when a code block is guaranteed to be inference-only.

2.3.4 Behavior Comparison Across Gradient Modes

At this point, we have three different gradient semantics: default mode, no_grad() mode, and inference_mode() mode. Each one expresses a different strength of commitment and leads to a different tradeoff between flexibility and performance.

In the default mode, Autograd must assume that any current computation might later participate in backpropagation. Therefore, it will:

Build a complete computation graph.
Save intermediate results needed for backpropagation.
Maintain version counters and view consistency checks.

This mode is the most flexible, but it also has the highest cost. It is usually used for the forward pass during training.

Entering no_grad() makes a temporary statement: this block of computation is not participating in backpropagation right now.

With that statement, Autograd can optimize by:

Not building a computation graph.
Not saving intermediate results.
Still retaining internal consistency mechanisms for Autograd.
Restoring the normal gradient mode after exiting the context.

This is a temporary shutdown. Flexibility remains, but performance improves noticeably. It is commonly used for validation and model evaluation.

inference_mode() makes a stronger commitment: this block of computation will never participate in gradient computation. Under that premise, Autograd can optimize more aggressively:

It does not build a computation graph.
It skips gradient-related version checks and view tracking.
Tensors created in this mode cannot re-enter the Autograd system.

This is the least reversible form of shutdown. It gives the strongest optimization, but also imposes the most restrictions. It is suitable for pure inference, model evaluation, and data processing.

Footnotes

inference_mode() was introduced in PyTorch 1.9 for inference-stage performance optimization. For implementation details, see RFC-0011-InferenceMode.↩︎

Reuse

CC BY-NC 4.0

2.3.1 torch.no_grad(): Pause Recording

2.3.2 torch.enable_grad(): Resume Recording

2.3.3 torch.inference_mode(): Stop Recording for Good