diff --git a/Optimizations.md b/Optimizations.md index 6c8b332..6594423 100644 --- a/Optimizations.md +++ b/Optimizations.md @@ -18,4 +18,32 @@ A number of optimization can be enabled by [commandline arguments](Run-with-Cust Extra tips (Windows): - https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/3889 Disable Hardware GPU scheduling. - disable browser hardware acceleration -- Go in nvidia control panel, 3d parameters, and change power profile to "maximum performance" \ No newline at end of file +- Go in nvidia control panel, 3d parameters, and change power profile to "maximum performance" + +## Memory & Performance Impact of Optimizers and Flags + +*This is an example test using specific hardware and configuration, your mileage may vary* +*Tested using nVidia RTX3060 and CUDA 11.7* + +| Cross-attention | Peak Memory at Batch size 1/2/4/8/16 | Initial It/s | Peak It/s | Note | +| --------------- | ------------------------------------ | -------- | --------- | ---- | +| None | 4.1 / 6.2 / OOM / OOM / OOM | 4.2 | 4.6 | slow and early out-of-memory +| v1 | 2.8 / 2.8 / 2.8 / 3.1 / 4.1 | 4.1 | 4.7 | slow but lowest memory usage and does not require sometimes problematic xformers +| InvokeAI | 3.1 / 4.2 / 6.3 / 6.6 / 7.0 | 5.5 | 6.6 | almost identical to default optimizer +| Doggetx | 3.1 / 4.2 / 6.3 / 6.6 / 7.1 | 5.4 | 6.6 | default | +| Doggetx | 2.2 / 2.7 / 3.8 / 5.9 / 6.2 | 4.1 | 6.3 | using `medvram` preset result in decent memory savings without huge performance hit +| Doggetx | 0.9 / 1.1 / 2.2 / 4.3 / 6.4 | 1.0 | 6.3 | using `lowvram` preset is extremely slow due to constant swapping +| Xformers | 2.8 / 2.8 / 2.8 / 3.1 / 4.1 | 6.5 | 7.5 | fastest and low memory +| Xformers | 2.9 / 2.9 / 2.9 / 3.6 / 4.1 | 6.4 | 7.6 | with `cuda_alloc_conf` and `opt-channelslast` + +Notes: +- Performance at batch-size 1 is around **~70%** of peak performance +- Peak performance is typically around batch size 8 + After that it grows by few percent if you have extra VRAM before it starts to drop due to GC kicking in +- Performance with `lowvram` preset is very low below batch size 8 and by then memory savings are not that big + +Other possible optimizations: +- `PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512` + No performance impact and increases initial memory footprint a bit but reduces memory fragmentation in long runs +- `opt-channelslast` + Hit-and-miss: seems like additional slight performance increase with higher batch sizes and slower with small sizes, but differences are within margin-of-error