pytorch / torchtitan Public

Notifications You must be signed in to change notification settings
Fork 377
Star 3.8k

Code
Issues 102
Pull requests 61
Discussions
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Security
Insights

Issues: pytorch/torchtitan

Beta

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

102 Open 235 Closed

Author

Filter by author

Uh oh!

There was an error while loading. Please reload this page.

Label

Filter by label

Uh oh!

There was an error while loading. Please reload this page.

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Uh oh!

There was an error while loading. Please reload this page.

Milestones

Filter by milestone

Uh oh!

There was an error while loading. Please reload this page.

Assignee

Filter by who’s assigned

Assigned to nobody

Uh oh!

There was an error while loading. Please reload this page.

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

How to pretrain from scratch a Qwen 2.5 7B-base model using Torchtitan?

#1223 opened May 25, 2025 by tjoymeed

expert_bias is updated during training but saved checkpoint contains only zero values

#1222 opened May 24, 2025 by trestad

[Flux] Incorrect loss after loading from checkpoint

#1213 opened May 21, 2025 by CarlosGomes98

[RFC] validation and evaluation in torchtitan

#1210 opened May 20, 2025 by tianyu-l

float8 rowwise vanilla TP low throughput bug

Something isn't working

module: float8

#1207 opened May 20, 2025 by danielvegamyhre

[MXFP8] unable to run titan llama3 debug model with mxfp8. Assertion: n_rows % max_row_tile_size == 0 bug

Something isn't working

#1200 opened May 16, 2025 by lessw2020

Save RNG states during checkpointing for deterministic debugging enhancement

New feature or request

#1194 opened May 14, 2025 by wwwjn

document the usage of environment variables better_engineering

Repo code quality improvements

documentation

Improvements or additions to documentation

high priority triage review

#1192 opened May 14, 2025 by tianyu-l

PP Zero Bubble CI tests failure ci test failure high priority module: pipelining triage review

#1188 opened May 13, 2025 by tianyu-l

issues on llama3 compile + (async) TP + AC ci test failure high priority module: torch.compile triage review

#1185 opened May 13, 2025 by tianyu-l

Can we support outputting checkpoints directly in .pt format? enhancement

New feature or request

module: checkpoint

#1177 opened May 9, 2025 by andrewor14

how to inference with pretrained model?

#1169 opened May 6, 2025 by dragen1860

[Flux] Flux Issue Tracking

#1151 opened Apr 28, 2025 by wwwjn

4 of 16 tasks

[Feature] Support validation

#1150 opened Apr 28, 2025 by CarlosGomes98

[Question] FSDP+TP CUDA_DEVICE_MAX_CONNECTIONS documentation

Improvements or additions to documentation

module: fsdp question

Further information is requested

#1147 opened Apr 27, 2025 by ChenchaoZhao

Inconsistent loss when resume training with vocab size that is not divisible by world size. high priority module: checkpoint triage review

#1136 opened Apr 23, 2025 by weixuansun

fully_shard() for huggingface model: pytorch caches too much GPU memory module: fsdp question

Further information is requested

#1126 opened Apr 21, 2025 by mingdianliu

[DeepSeek MoE] current workstream planning enhancement

New feature or request

#1125 opened Apr 21, 2025 by lessw2020

Llama 4 issue tracking high priority triage review

#1118 opened Apr 17, 2025 by tianyu-l

3 of 13 tasks

Seeing - "Recomputed values for the following tensors have different metadata than during the forward pass." high priority triage review

#1117 opened Apr 17, 2025 by githubsgi

FSDP2 root level parameter management module: fsdp question

Further information is requested

#1091 opened Apr 11, 2025 by dingqingy

Torch.compile and TP during multiresolution Training module: torch.compile question

Further information is requested

#1081 opened Apr 9, 2025 by nighting0le01

Is the currnet configuration system over-engineered? question

Further information is requested

#1055 opened Apr 3, 2025 by wangkuiyi

Clarify PP split point documentation. question

Further information is requested

#1054 opened Apr 3, 2025 by githubsgi

Overflow in F.scaled_dot_product_attention when using profiling with deterministic training

#1049 opened Apr 3, 2025 by JungHoyoun

Previous 1 2 3 4 5 Next

Previous Next

ProTip! What’s not been updated in a month: updated:<2025-04-25.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!