What's Changed
- Lightning multinode parquet by @jdnurme in #73
- Install pytest on running continuous test for dataflux pypi package. by @akansha1812 in #75
- Make DatafluxPytTrain a wrapper of DataFluxMapStyleDataset by @abhibyreddi in #74
- updated base docker, example yaml, added readme by @jdnurme in #76
- Fix DatafluxPytTrain.getitem by @abhibyreddi in #77
- Run continuous test on the pypi installed package on presubmit by @akansha1812 in #78
- Add code to make it possible to deploy training on a multi-node GKE cluster by @abhibyreddi in #81
- Configure shared memory size by @abhibyreddi in #82
- Reorder Dockerfile and add dockerignore to speed up builds by @MattIrv in #84
- Correcting the checkpointing functions to handle the Path object. by @Yash9060 in #83
- Parse bucket name from ckpt directory name instead of separate parameter for bucket name by @Yash9060 in #85
- Make Lightning checkpoint demo work with Bernard's GKE framework and with FSDP strategy by @MattIrv in #86
- Initialize new storage_client.bucket on every request by @Yash9060 in #87
- Add README file for lightning image segmentation workload by @abhibyreddi in #89
- Check in initial Parquet benchmark based on MaxText data loading benchmark by @MattIrv in #90
- Add GKE deployment for MaxText Parquet training benchmark by @MattIrv in #91
- Skip training when demo is run to benchmark Dataflux by @abhibyreddi in #92
- Update the definition of the local flag by @abhibyreddi in #93
- Allow running demo code in listing-only mode by @abhibyreddi in #95
- Raise exception when ADC are missing by @abhibyreddi in #94
- Update defaults for batch_size and num_workers by @abhibyreddi in #96
- Faster Lightning Checkpoint download by @MattIrv in #99
- Adding custom GCS Writer. by @Yash9060 in #98
- update to latest dataflux client by @jdnurme in #101
- add continuous benchmark with kokoro by @jdnurme in #102
- Run image training demo as part of continuous integ tests by @abhibyreddi in #104
- Adding GCS Custom reader by @Yash9060 in #105
- MultiNode demo by @Yash9060 in #106
- add benchmark code and update kokoro scripts by @jdnurme in #108
- Parameterizing min_epochs, max_epochs & max_steps by @Yash9060 in #107
- Add a helper method to create storage_client when needed. by @awonak in #109
- Make step time configurable by @abhibyreddi in #110
- Remove client initialization for fast listing from dataflux-pytorch by @akansha1812 in #111
- Multipart checkpoint upload by @jdnurme in #114
- adds unit tests, adds presubmit integration test, updated demo code by @jdnurme in #117
- Add code to clear kernel cache after saving checkpoints by @abhibyreddi in #122
- update continuous to run full benchmark by @jdnurme in #123
- Adding benchmarking code for multi node checkpointing. by @Yash9060 in #121
- set multipart upload to default behavior by @jdnurme in #127
- Introduce AsyncCheckpointIO option for non-blocking checkpoint saves by @awonak in #116
- Print average times to save and load checkpoints together by @abhibyreddi in #129
- Changing hardcoded values to placeholders by @Yash9060 in #128
- Make num_nodes configurable by @abhibyreddi in #130
- update lightning bench with multipart and 10k info by @jdnurme in #131
- update default dataflux to use multipart by @jdnurme in #133
- Run unit tests on x86 Mac by @abhibyreddi in #115
- implement fast download for df checkpoint by @jdnurme in #134
- Add image segmentation benchmark results to README by @abhibyreddi in #118
- Add single node async benchmark execution to integration tests by @awonak in #135
- Refactor benchmark tables by @awonak in #136
- add option to run benchmark without lightning by @jdnurme in #137
- Fix AsyncCheckpointIO race condition by @awonak in #138
- Update image segmentation benchmark README by @abhibyreddi in #139
- add upload and download improvements to multinode by @jdnurme in #141
- Update documented step time by @abhibyreddi in #142
- CPU simulated benchmarking for GKE cluster. by @Yash9060 in #143
- Simulated CPU benchmarking code by @Yash9060 in #145
- Add support for multi-node checkpointing with fsspec by @abhibyreddi in #144
- Correcting the code for simulated benchmarks by @Yash9060 in #146
- Multi-node checkpoint benchmark improvements by @MattIrv in #149
- Set pytorch version to 2.3.1 by @abhibyreddi in #148
- update main readme with checkpoing bench results by @jdnurme in #150
- Add support to benchmark multi-node checkpointing with default FSDP strategy by @abhibyreddi in #151
- Remove duplicative pip install instructions from multi-node checkpoint benchmark readme by @MattIrv in #152
- Skip saving checkpoints during training by @abhibyreddi in #153
- Install checkpoint benchmark dependencies before running the benchmark by @abhibyreddi in #155
- Update checkpoint readmes by @MattIrv in #159
- Implement a custom FSDP strategy for benchmarking loads from boot disk by @abhibyreddi in #157
- Added debug flag to GCSReader/Writer by @Yash9060 in #154
- Correcting load_checkpoint for simulated benchmarks. by @Yash9060 in #161
- Add support for benchmarking checkpoint save/restore to/from distributed filesystems by @abhibyreddi in #162
- Correct table header row by @abhibyreddi in #163
- Adding option to use FSspec with simulated benchmarks by @Yash9060 in #164
- Create client for each processs by @akansha1812 in #166
- update bench script to run simulated multinode bench by @jdnurme in #167
- Move client initialization to getitem and getitems by @akansha1812 in #170
- Add additional timing info to multi-node simulated demo/benchmark. by @MattIrv in #172
- add simple llama load benchmark and results by @jdnurme in #171
- Revert "add simple llama load benchmark and results" by @jdnurme in #177
- Update save/load print statements for FSDP benchmark by @MattIrv in #176
- Update auto wrap policy and remove duplicate load in trainer.fit by @MattIrv in #175
- Refactor DemoTransformer model to match Lightning demo by @MattIrv in #174
- Remove model and path arguments from FSDP strategy constructors by @MattIrv in #173
- PyTorch distributed checkpoint async save demo by @awonak in #168
- updated readme to include async and multinode features by @jdnurme in #183
- Update multi-node checkpoint benchmark instructions by @abhibyreddi in #180
- Adding simulated benchmark stuff. by @Yash9060 in #181
- Add 24-hour triage and PyTorch naming lines to README by @MattIrv in #185
- Updating .gitignore by @Yash9060 in #184
- move demo model class to common lib by @akansha1812 in #182
- Reference const, not value by @awonak in #189
- Update README.md with consistent naming by @MattIrv in #192
- Remove image segmentation duplicates by @akansha1812 in #186
- update benchmark with multinode numbers by @jdnurme in #191
- Add demo.image_segmentation.model by @akansha1812 in #195
- LLAMA2 Simulated version by @Yash9060 in #198
- release 1.4.0 by @jdnurme in #199
New Contributors
Full Changelog: v1.3.0...v1.4.0