Quantized models on multi-GPU

I'm experimenting with the new implementation of CUDA acceleration for quantized models and wondering how to use sharded tensors in this context. I'm having a hard time adapting the `ShardedVarBuilder` to load like `quantized_var_builder::VarBuilder::from_gguf`.

Do you have any recommendations on the best approach in this case?

@LaurentMazare