Skip to content

1.4.0

Latest
Compare
Choose a tag to compare
@jdnurme jdnurme released this 20 Nov 22:54
· 5 commits to main since this release
9791744

What's Changed

  • Lightning multinode parquet by @jdnurme in #73
  • Install pytest on running continuous test for dataflux pypi package. by @akansha1812 in #75
  • Make DatafluxPytTrain a wrapper of DataFluxMapStyleDataset by @abhibyreddi in #74
  • updated base docker, example yaml, added readme by @jdnurme in #76
  • Fix DatafluxPytTrain.getitem by @abhibyreddi in #77
  • Run continuous test on the pypi installed package on presubmit by @akansha1812 in #78
  • Add code to make it possible to deploy training on a multi-node GKE cluster by @abhibyreddi in #81
  • Configure shared memory size by @abhibyreddi in #82
  • Reorder Dockerfile and add dockerignore to speed up builds by @MattIrv in #84
  • Correcting the checkpointing functions to handle the Path object. by @Yash9060 in #83
  • Parse bucket name from ckpt directory name instead of separate parameter for bucket name by @Yash9060 in #85
  • Make Lightning checkpoint demo work with Bernard's GKE framework and with FSDP strategy by @MattIrv in #86
  • Initialize new storage_client.bucket on every request by @Yash9060 in #87
  • Add README file for lightning image segmentation workload by @abhibyreddi in #89
  • Check in initial Parquet benchmark based on MaxText data loading benchmark by @MattIrv in #90
  • Add GKE deployment for MaxText Parquet training benchmark by @MattIrv in #91
  • Skip training when demo is run to benchmark Dataflux by @abhibyreddi in #92
  • Update the definition of the local flag by @abhibyreddi in #93
  • Allow running demo code in listing-only mode by @abhibyreddi in #95
  • Raise exception when ADC are missing by @abhibyreddi in #94
  • Update defaults for batch_size and num_workers by @abhibyreddi in #96
  • Faster Lightning Checkpoint download by @MattIrv in #99
  • Adding custom GCS Writer. by @Yash9060 in #98
  • update to latest dataflux client by @jdnurme in #101
  • add continuous benchmark with kokoro by @jdnurme in #102
  • Run image training demo as part of continuous integ tests by @abhibyreddi in #104
  • Adding GCS Custom reader by @Yash9060 in #105
  • MultiNode demo by @Yash9060 in #106
  • add benchmark code and update kokoro scripts by @jdnurme in #108
  • Parameterizing min_epochs, max_epochs & max_steps by @Yash9060 in #107
  • Add a helper method to create storage_client when needed. by @awonak in #109
  • Make step time configurable by @abhibyreddi in #110
  • Remove client initialization for fast listing from dataflux-pytorch by @akansha1812 in #111
  • Multipart checkpoint upload by @jdnurme in #114
  • adds unit tests, adds presubmit integration test, updated demo code by @jdnurme in #117
  • Add code to clear kernel cache after saving checkpoints by @abhibyreddi in #122
  • update continuous to run full benchmark by @jdnurme in #123
  • Adding benchmarking code for multi node checkpointing. by @Yash9060 in #121
  • set multipart upload to default behavior by @jdnurme in #127
  • Introduce AsyncCheckpointIO option for non-blocking checkpoint saves by @awonak in #116
  • Print average times to save and load checkpoints together by @abhibyreddi in #129
  • Changing hardcoded values to placeholders by @Yash9060 in #128
  • Make num_nodes configurable by @abhibyreddi in #130
  • update lightning bench with multipart and 10k info by @jdnurme in #131
  • update default dataflux to use multipart by @jdnurme in #133
  • Run unit tests on x86 Mac by @abhibyreddi in #115
  • implement fast download for df checkpoint by @jdnurme in #134
  • Add image segmentation benchmark results to README by @abhibyreddi in #118
  • Add single node async benchmark execution to integration tests by @awonak in #135
  • Refactor benchmark tables by @awonak in #136
  • add option to run benchmark without lightning by @jdnurme in #137
  • Fix AsyncCheckpointIO race condition by @awonak in #138
  • Update image segmentation benchmark README by @abhibyreddi in #139
  • add upload and download improvements to multinode by @jdnurme in #141
  • Update documented step time by @abhibyreddi in #142
  • CPU simulated benchmarking for GKE cluster. by @Yash9060 in #143
  • Simulated CPU benchmarking code by @Yash9060 in #145
  • Add support for multi-node checkpointing with fsspec by @abhibyreddi in #144
  • Correcting the code for simulated benchmarks by @Yash9060 in #146
  • Multi-node checkpoint benchmark improvements by @MattIrv in #149
  • Set pytorch version to 2.3.1 by @abhibyreddi in #148
  • update main readme with checkpoing bench results by @jdnurme in #150
  • Add support to benchmark multi-node checkpointing with default FSDP strategy by @abhibyreddi in #151
  • Remove duplicative pip install instructions from multi-node checkpoint benchmark readme by @MattIrv in #152
  • Skip saving checkpoints during training by @abhibyreddi in #153
  • Install checkpoint benchmark dependencies before running the benchmark by @abhibyreddi in #155
  • Update checkpoint readmes by @MattIrv in #159
  • Implement a custom FSDP strategy for benchmarking loads from boot disk by @abhibyreddi in #157
  • Added debug flag to GCSReader/Writer by @Yash9060 in #154
  • Correcting load_checkpoint for simulated benchmarks. by @Yash9060 in #161
  • Add support for benchmarking checkpoint save/restore to/from distributed filesystems by @abhibyreddi in #162
  • Correct table header row by @abhibyreddi in #163
  • Adding option to use FSspec with simulated benchmarks by @Yash9060 in #164
  • Create client for each processs by @akansha1812 in #166
  • update bench script to run simulated multinode bench by @jdnurme in #167
  • Move client initialization to getitem and getitems by @akansha1812 in #170
  • Add additional timing info to multi-node simulated demo/benchmark. by @MattIrv in #172
  • add simple llama load benchmark and results by @jdnurme in #171
  • Revert "add simple llama load benchmark and results" by @jdnurme in #177
  • Update save/load print statements for FSDP benchmark by @MattIrv in #176
  • Update auto wrap policy and remove duplicate load in trainer.fit by @MattIrv in #175
  • Refactor DemoTransformer model to match Lightning demo by @MattIrv in #174
  • Remove model and path arguments from FSDP strategy constructors by @MattIrv in #173
  • PyTorch distributed checkpoint async save demo by @awonak in #168
  • updated readme to include async and multinode features by @jdnurme in #183
  • Update multi-node checkpoint benchmark instructions by @abhibyreddi in #180
  • Adding simulated benchmark stuff. by @Yash9060 in #181
  • Add 24-hour triage and PyTorch naming lines to README by @MattIrv in #185
  • Updating .gitignore by @Yash9060 in #184
  • move demo model class to common lib by @akansha1812 in #182
  • Reference const, not value by @awonak in #189
  • Update README.md with consistent naming by @MattIrv in #192
  • Remove image segmentation duplicates by @akansha1812 in #186
  • update benchmark with multinode numbers by @jdnurme in #191
  • Add demo.image_segmentation.model by @akansha1812 in #195
  • LLAMA2 Simulated version by @Yash9060 in #198
  • release 1.4.0 by @jdnurme in #199

New Contributors

Full Changelog: v1.3.0...v1.4.0