Skip to content

Commit 5987d79

Browse files
authored
Update gradient_accumulation.md (#3649)
1 parent 31af8d4 commit 5987d79

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

docs/source/usage_guides/gradient_accumulation.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -245,7 +245,7 @@ As was pointed out in this [blog-post](https://huggingface.co/blog/gradient_accu
245245

246246
> [...] for gradient accumulation across token-level tasks like causal LM training, the correct loss should be computed by the **total loss across all batches in a gradient accumulation step** divided by the **total number of all non padding tokens in those batches**. This is not the same as the average of the per-batch loss values.
247247
248-
In other words, some adjustements must be made on losses that operate on a token-level basis.
248+
In other words, some adjustments must be made on losses that operate on a token-level basis.
249249

250250
### Skeleton code
251251

@@ -282,7 +282,7 @@ for update_step in range(total_updates):
282282
num_items_in_batch = accelerator.gather(num_items_in_batch).sum().item()
283283

284284
for i, batch in enumerate(batch_samples):
285-
# if we perform gradient accumulation in a multi-devices set-up, we want to avoid unecessary communications when accumulating
285+
# if we perform gradient accumulation in a multi-devices set-up, we want to avoid unnecessary communications when accumulating
286286
# cf: https://muellerzr.github.io/blog/gradient_accumulation.html
287287
if (i < len(batch_samples) - 1 and accelerator.num_processes > 1):
288288
ctx = model.no_sync
@@ -294,7 +294,7 @@ for update_step in range(total_updates):
294294
with ctx():
295295
inputs, targets = batch
296296
outputs = model(inputs)
297-
loss = loss_function(outputs, targets) # the loss function shoud sum over samples rather than averaging
297+
loss = loss_function(outputs, targets) # the loss function should sum over samples rather than averaging
298298

299299
# We multiply by num_processes because the DDP calculates the average gradient across all devices whereas dividing by num_items_in_batch already takes into account all devices
300300
# Same reason for gradient_accumulation_steps, but this times it's Accelerate that calculate the average gradient across the accumulated steps
@@ -394,7 +394,7 @@ for update_step in range(total_gradient_updates):
394394
for i, batch in enumerate(batch_samples):
395395
inputs, labels = batch["input_ids"], batch["labels"]
396396
total_batched_samples += 1
397-
# if we perform gradient accumulation in a multi-devices set-up, we want to avoid unecessary communications when accumulating
397+
# if we perform gradient accumulation in a multi-devices set-up, we want to avoid unnecessary communications when accumulating
398398
# cf: https://muellerzr.github.io/blog/gradient_accumulation.html
399399
if (i < len(batch_samples) - 1 and accelerator.num_processes > 1):
400400
ctx = model.no_sync

0 commit comments

Comments
 (0)