Change `fp8_quantize` so that we can pass around reciprocals everywhere,
so scales are always passed around in the checkpoint format.
I also noticed that we ignore any input scales that we might have when
fbgemm is available. Skip this path if we already have a scale.