Updated Optimizations (markdown)

Vladimir Mandic 2023-01-07 11:19:43 -05:00
parent 82011d27c3
commit 6df331a03a

@ -19,3 +19,31 @@ Extra tips (Windows):
- https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/3889 Disable Hardware GPU scheduling. - https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/3889 Disable Hardware GPU scheduling.
- disable browser hardware acceleration - disable browser hardware acceleration
- Go in nvidia control panel, 3d parameters, and change power profile to "maximum performance" - Go in nvidia control panel, 3d parameters, and change power profile to "maximum performance"
## Memory & Performance Impact of Optimizers and Flags
*This is an example test using specific hardware and configuration, your mileage may vary*
*Tested using nVidia RTX3060 and CUDA 11.7*
| Cross-attention | Peak Memory at Batch size 1/2/4/8/16 | Initial It/s | Peak It/s | Note |
| --------------- | ------------------------------------ | -------- | --------- | ---- |
| None | 4.1 / 6.2 / OOM / OOM / OOM | 4.2 | 4.6 | slow and early out-of-memory
| v1 | 2.8 / 2.8 / 2.8 / 3.1 / 4.1 | 4.1 | 4.7 | slow but lowest memory usage and does not require sometimes problematic xformers
| InvokeAI | 3.1 / 4.2 / 6.3 / 6.6 / 7.0 | 5.5 | 6.6 | almost identical to default optimizer
| Doggetx | 3.1 / 4.2 / 6.3 / 6.6 / 7.1 | 5.4 | 6.6 | default |
| Doggetx | 2.2 / 2.7 / 3.8 / 5.9 / 6.2 | 4.1 | 6.3 | using `medvram` preset result in decent memory savings without huge performance hit
| Doggetx | 0.9 / 1.1 / 2.2 / 4.3 / 6.4 | 1.0 | 6.3 | using `lowvram` preset is extremely slow due to constant swapping
| Xformers | 2.8 / 2.8 / 2.8 / 3.1 / 4.1 | 6.5 | 7.5 | fastest and low memory
| Xformers | 2.9 / 2.9 / 2.9 / 3.6 / 4.1 | 6.4 | 7.6 | with `cuda_alloc_conf` and `opt-channelslast`
Notes:
- Performance at batch-size 1 is around **~70%** of peak performance
- Peak performance is typically around batch size 8
After that it grows by few percent if you have extra VRAM before it starts to drop due to GC kicking in
- Performance with `lowvram` preset is very low below batch size 8 and by then memory savings are not that big
Other possible optimizations:
- `PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512`
No performance impact and increases initial memory footprint a bit but reduces memory fragmentation in long runs
- `opt-channelslast`
Hit-and-miss: seems like additional slight performance increase with higher batch sizes and slower with small sizes, but differences are within margin-of-error