Amp mode stands for "automatic mixed precision" which allows Torch to choose to execute certain operations that are numerically "safe" to be run in FP16 precision in FP16 precision (addition, subtraction), but use FP32 if unsafe (POW, etc). This saves some VRAM but also grants a significant performance boost.
This enables gradient checkpoint This will reduce the VRAM usage MANY gigabytes. There is a small performance loss, but you can also possible increase your batch size by using it.
Keeping the batch_size low reduces VRAM use. This is a more "fine dial" on VRAM use. Adjusting it up or down by 1 will increase or decrease VRAM use by about 0.5-1GB. For 12GB gpus you will need to keep batch_size 1 or 2.
The third is gradient accumulation, which doesnot reduce VRAM, but gives you a "virtual batch size multiplier" when you are not able to increase the batch_size directly.
This will combine the loss from multiple batches before applying updates. There is some small VRAM overhead to this but not as much as increasing the batch size. Increasing it beyond 2 does not continue to increase VRAM, only going from 1 to 2 seems to affect VRAM use, and by a small amount.