Updated Optimizations (markdown)

2025-03-04 13:04:53 +08:00 · 2023-01-07 11:19:43 -05:00 · 2023-01-07 11:19:43 -05:00 · 6df331a03a
commit 6df331a03a
parent 82011d27c3
1 changed files with 29 additions and 1 deletions
--- a/Optimizations.md
+++ b/Optimizations.md
@ -19,3 +19,31 @@ Extra tips (Windows):
 - https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/3889 Disable Hardware GPU scheduling.
 - disable browser hardware acceleration
 - Go in nvidia control panel, 3d parameters, and change power profile to "maximum performance"
 ## Memory & Performance Impact of Optimizers and Flags
 *This is an example test using specific hardware and configuration, your mileage may vary*  
 *Tested using nVidia RTX3060 and CUDA 11.7*
 | Cross-attention | Peak Memory at Batch size 1/2/4/8/16 | Initial It/s | Peak It/s | Note |
 | --------------- | ------------------------------------ | -------- | --------- | ---- |
 | None            | 4.1 / 6.2 / OOM / OOM / OOM | 4.2 | 4.6 | slow and early out-of-memory
 | v1              | 2.8 / 2.8 / 2.8 / 3.1 / 4.1 | 4.1 | 4.7 | slow but lowest memory usage and does not require sometimes problematic xformers
 | InvokeAI        | 3.1 / 4.2 / 6.3 / 6.6 / 7.0 | 5.5 | 6.6 | almost identical to default optimizer
 | Doggetx         | 3.1 / 4.2 / 6.3 / 6.6 / 7.1 | 5.4 | 6.6 | default |
 | Doggetx         | 2.2 / 2.7 / 3.8 / 5.9 / 6.2 | 4.1 | 6.3 | using `medvram` preset result in decent memory savings without huge performance hit
 | Doggetx         | 0.9 / 1.1 / 2.2 / 4.3 / 6.4 | 1.0 | 6.3 | using `lowvram` preset is extremely slow due to constant swapping
 | Xformers        | 2.8 / 2.8 / 2.8 / 3.1 / 4.1 | 6.5 | 7.5 | fastest and low memory
 | Xformers        | 2.9 / 2.9 / 2.9 / 3.6 / 4.1 | 6.4 | 7.6 | with `cuda_alloc_conf` and `opt-channelslast`
 Notes:
 - Performance at batch-size 1 is around **~70%** of peak performance  
 - Peak performance is typically around batch size 8  
  After that it grows by few percent if you have extra VRAM before it starts to drop due to GC kicking in  
 - Performance with `lowvram` preset is very low below batch size 8 and by then memory savings are not that big  
 Other possible optimizations:
 - `PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512`  
  No performance impact and increases initial memory footprint a bit but reduces memory fragmentation in long runs  
 - `opt-channelslast`  
  Hit-and-miss: seems like additional slight performance increase with higher batch sizes and slower with small sizes, but differences are within margin-of-error