How do we reconcile the LightningModule that the user sees vs the automatically parallelized model the trainer sees? Not all gradient clipping techniques are compatible with techniques like DeepSpeed or FSDP. For example: https://github.com/PyTorchLightning/pytorch-lightning/blob/c7451b3ccf742b0e89713...
🐛 Describe the bug It looks like gradient checkpointing (activation checkpointing) it is not allowed if used with torch.compile. For example this code: import torch import torch.utils.checkpoint import torch._dynamo torch._dynamo.config...
🚀 Feature See code here: https://github.com/pseeth/autoclip Motivation a simple method for automatically and adaptively choosing a gradient clipping threshold, based on the history of gradient norms observed during training. Experimental...
https://discuss.pytorch.org/t/ddp-and-gradient-checkpointing/132244/3 https://github.com/mlfoundations/open_clip/blob/c933765dc557d88e15be968e78d7580d95f86af8/src/training/main.py#L156 trying to figure out how to do this with accelerate ...
🚀 Feature Implement Image-Gradients for PT Lightning. Motivation Recently I was working on a vanilla PT implementation of the DenseDepth paper. They happen to use a DepthLoss as one of their loss functions. Incidentally, DepthLoss is bas...
Am launching a script taht trains a model which works well when trained without ddp and using gradient checkpointing, or using ddp but no gradient checkpointing, using fabric too. However, when setting both ddp and gradient checkpointing, activate thorugh gradient_checkpointing_enable() function o...
[](https://console.paperspace.com/github/gradient-ai/PyTorch-Lightning?machine=Free-GPU)Collection of Jupyter notebook tutorials from the [PyTorch Lightning documentation](https://pytorch-lightning.readthedocs.io/)....
Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes. - Add stronger typing to gradient accumulation scheduler callback (#3558) · Lightning-AI/pytorch-lightning@c61e1e6
v2.4 How to reproduce the bug No response Error messages and logs # Error messages and logs here please Environment Current environment #- PyTorch Lightning Version (e.g., 2.4.0): #- PyTorch Version (e.g., 2.4): #- Python version (e.g., 3.12): #- OS (e.g., Linux): #- CUDA...
Description & Motivation When training different model sizes on a different number of devices or different hardware, the batch size needs to be carefully tuned in order to achieve maximum GPU utilization without incurring Out Of Memory E...