-
Notifications
You must be signed in to change notification settings - Fork 1k
Support for SDXL refiner #227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
More info on the CLI issue:Here's a visualization of what the unet is outputting on float16 vs float32 prevision via the CLI swift command. This is a greyscale visualization of the latent channels for comparison. The expected output should be some prediction of noise in the image (in latent space), but the fp16 output has a large number of repeated values, which of course breaks the diffusion process. |
@atiorh this PR is ready for review :) |
Thanks for the PR @ZachNagengast! I am traveling this week but will aim to review it asap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a cursory pass to kick off some discussions. Thanks for the great work @ZachNagengast!
One more high-level note: Looks like the current behavior is to detect whether the given XL Unet model is a refiner or a base one and run the pipeline accordingly. Could we also handle the cascaded base+refiner workflow similar to the canonical configuration on diffusers? What do you think? We could enforce a reduceMemory
-like policy for the base2refiner transition to avoid 2x memory.
let i = 4 * (y * width + x) | ||
if ptr[i+3] == 0 { | ||
// Alpha mask for controlnets | ||
redChannel[y * width + x] = alphaMaskValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to replicate it with this new function, and the plannerRGBShapedArray
function is still in use for the controlnet conditioning. Would be great to get another eye on it if it's working as intended.
@@ -50,7 +50,7 @@ public struct Encoder: ResourceManaging { | |||
scaleFactor: Float32, | |||
random: inout RandomSource | |||
) throws -> MLShapedArray<Float32> { | |||
let imageData = try image.plannerRGBShapedArray(minValue: -1.0, maxValue: 1.0) | |||
let imageData = try image.toRGBShapedArray(minValue: -1.0, maxValue: 1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please summarize the reasons for moving away from plannerRGBShapedArray
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is intended to fix a bug in the plannerRGBShapedArray
that wasn't working right with 1024x1024. There may be another approach that fixes whatever the bug was, but I opted to just use the CoreGraphics library for simplicity.
Example of the bug (converted to plannerRGBShapedArray
then back using CGImage.fromShapedArray
):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, thanks @ZachNagengast! Is this for any 1024x1024 or some pixel value-dependent edge case? I might want to file a ticket for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, makes sense to workaround it by using your function at the moment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea you can try it out just do
let imageData = try image.plannerRGBShapedArray(minValue: -1.0, maxValue: 1.0)
let decodedImage = try CGImage.fromShapedArray(imageData)
and inspect the decoded image, it should look off like this for the SDXL pipeline.
@@ -93,7 +93,7 @@ public struct Encoder: ResourceManaging { | |||
|
|||
var inputDescription: MLFeatureDescription { | |||
try! model.perform { model in | |||
model.modelDescription.inputDescriptionsByName["z"]! | |||
model.modelDescription.inputDescriptionsByName.first!.value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This preserves backwards compatibility so I am not concerned about the rename and thanks for the semantically correct fix :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to update, just appreciating the change
@@ -44,7 +44,19 @@ public struct PipelineConfiguration: Hashable { | |||
public var encoderScaleFactor: Float32 = 0.18215 | |||
/// Scale factor to use on the latent before decoding | |||
public var decoderScaleFactor: Float32 = 0.18215 | |||
|
|||
/// If `originalSize` is not the same as `targetSize` the image will appear to be down- or upsampled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am inclined to recommend a dedicated StableDiffusionXLPipeline.Configuration.swift
as quite a few things on the input conditioning and pipeline are different. @ZachNagengast & @pcuenca What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I'd agree it probably shouldn't be a requirement on the caller to set the scale factor properly, especially when we already know the pipeline is SDXL and can set the default properly. My question would be whether we'd want a new class for this or just setup the defaults somewhere in the pipeline. Or potentially a broader option of just including the config files with the converted models directly, alongside the merges.txt and vocab.json, and using those for the default config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potentially a broader option of just including the config files with the converted models directly
That would be nice and we can even make that backward compatible. However, it still wouldn't address the argument set mismatch across XL and non-XL right? IMO, the easiest is to create 2 (XL and non-XL) default configs and then add your proposal from above as an extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that makes sense to me, and seems to be the approach Diffusers
likes to take. My only nitpick is the repeated code, be it can be consolidated a fair amount by committing to the two pipelines and pulling out shared code as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we can refactor over time :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the approach discussed, separate configs do seem necessary to make usage simpler.
@@ -1392,6 +1425,13 @@ def parser_spec(): | |||
"If specified, enable unet to receive additional inputs from controlnet. " | |||
"Each input added to corresponding resnet output." | |||
) | |||
parser.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this to --unet-compute-precision-float32
? Sidebar: I still don't understand why the float32 constraint would only apply to the CLI and not any other Swift program that calls StableDiffusionXLPipeline
🤔 We should discuss this more 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I'll rename it. The CLI thing is still a mystery to me, I'd love to hear if anyone can replicate it or if it's just my machine. I was just getting noise on the output with using fp16, which went away when in fp32 format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still traveling with bad internet so I am yet to convert and test the refiner model.. (hence the slow review). Will confirm once I do!
@atiorh Regarding your note on the base2refiner transition, here are two approaches that could work well:
Maybe both would be useful down the road, what do you think? |
Totally, this is what I had in mind too:
|
@@ -796,9 +827,15 @@ def convert_unet(pipe, args): | |||
for k, v in sample_unet_inputs.items() | |||
} | |||
|
|||
if args.xl_version and pipe.config.requires_aesthetics_score and args.unet_support_cli: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if args.xl_version and pipe.config.requires_aesthetics_score and args.unet_support_cli: | |
if args.xl_version and pipe.config.requires_aesthetics_score and args.unet_compute_precision_float32: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just tested this PR and have some high-level overview for now:
- I haven't been able to reproduce the unet inference issues in float16 with the CLI. It works for me using both
cpuAndGPU
andall
compute units (although usingall
results in very slow compilation times, as expected). It'd be really useful to identify an easy reproducible case. Any ideas about how to proceed @ZachNagengast? - The
plannerRGBShapedArray
issue is caused by row padding applied byvImage
for performance reasons. I have a simple fix that I can submit as a separate PR for discussion. (I could submit a PR to this PR if that's preferred, but I think the problem is general enough to deal with separately). - I didn't see any provisions in the Swift portion of the code to deal with absent
textEncoder
ortextEncoder2
modules, so applying the refiner in text2image mode results in an error unless both encoders exist. This is only a problem because the original refiner checkpoints don't include atextEncoder
, so a conversion process followed by inference on the output folder would fail. - There was a minor typo in the name of the arg to control float32 conversion of the UNet (it had only been changed in one of two places); I submitted a small suggestion for it.
Other than that, I agree in providing a dedicated method to run both the base and refiner models one after the other, in addition to using them separately.
I'll dive deeper into the code details later :)
Great work @ZachNagengast, this is a really exciting feature!
- Also skip loading model if check_output_correctness is missing, since the model does not require inferencing at conversion time
@pcuenca Just updated the conversion script to bundle the UnetRefiner.mlmodelc with the base unet if specified via Also saw your update to the plannerRGBShapedArray, nice fix. I'll include that back in with the pipeline updates, which should be coming later today. Regarding the float precision thing - good to know your computer didn't have a problem. Unless anyone else reports a similar issue, I'll just remove that logic for simplicity. |
ceaa54e
to
e28f812
Compare
Here's an example refiner + base conversion script. Thinking behind the new parameter was the potential for future refiners not tightly coupled with the base, so this will let you bundle any pre-trained unet via the --refiner-version param as the refiner.
|
@atiorh @pcuenca Update - just pushed some code to run the UnetRefiner and also handle the potential missing models in the bundle. Haven't included any dedicated subconfigs yet, but this latest update uses
|
var textEncoder2: TextEncoderXLModel | ||
|
||
/// Model used to predict noise residuals given an input, diffusion time step, and conditional embedding | ||
var unet: Unet | ||
|
||
/// Model used to refine the image, if present | ||
var unetRefiner: Unet? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we make unet
also an optional since we were talking about the following scenarios:
1-) Only refiner unet provided, we only refine
2-) Only base unet provided, we only run base gen
3-) Both provided, we run full pipeline (loading/unloading base/refiner during the transition)
4-) Neither provided, we error out
If you think (1-) is not relevant, we can discuss 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, 1-) might have a catch in the case that someone wants to use the refiner as a base model (in the case of imageToImage
mode). If only a refiner is provided then we can assume it's imageToImage
mode and require a startingImage
as well. It might be useful to include some flexibility to use either a Unet or UnetRefiner for purely imageToImage applications to save disk space.
This question also goes back to torch2coreml.py I think - in the conversion script we could logically separate refiner unets in a similar way to how the controlnets are separate, and discourage using refiner unets as base models. Currently, refiner unets are considered the same as any other sdxl unet unless specifically set by the new argument. I do think there are cases where you'd want to use a "base" model as a refiner as more fine-tuned models/loras come out, and there are also some cases I've seen people use the refiner as a base model, but probably less common overall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see! Let's keep it as is for now then. I will test the workflows as is.
logger.info(f"Converting refiner") | ||
del pipe | ||
gc.collect() | ||
args.model_version = args.refiner_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZachNagengast Could we restore the original model_version after the refiner conversion is done 2 lines down? The name override breaks some filename assumptions in the coremlcompiler calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! Just pushed
Update: still testing but things are looking great so far! @ZachNagengast One request: could you please hardcode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested base+refiner flow and ran into an issue that I think may require some changes, thanks for the continued iterations @ZachNagengast!
@ZachNagengast Tested out the base+refiner and base only workflows, works great! Final things to put in place before I can approve:
|
Thanks for testing it out @atiorh, just pushed refiner chunking, and the compilation step should work now too, where it will save the files as |
public var refinerStart: Float = 0.7 | ||
public var refinerStart: Float = 0.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was based on my own testing, but according to the paper it was specialized on the first 200 steps, so 20% seems reasonable. I noticed fewer "wrong" artifacts show up with 0.8
which seems preferred as a default, whereas I didn't notice much more "refinement" with a higher percentage of refinement steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for the clarification!
if FileManager.default.fileExists(atPath: urls.unetRefinerURL.path) { | ||
unetRefiner = Unet(modelAt: urls.unetRefinerURL, configuration: config) | ||
if FileManager.default.fileExists(atPath: urls.unetRefinerChunk1URL.path) && | ||
FileManager.default.fileExists(atPath: urls.unetRefinerChunk2URL.path) { | ||
unetRefiner = Unet(chunksAt: [urls.unetRefinerChunk1URL, urls.unetRefinerChunk2URL], | ||
configuration: config) | ||
} else { | ||
unetRefiner = nil | ||
unetRefiner = Unet(modelAt: urls.unetRefinerURL, configuration: config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't unetRefiner
be nil
when it's not there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: after testing I can confirm that the generation loop fails if there's no refiner in the bundle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch. I copied this from the controlnet logic but missed the first parts.
self.unet.unloadResources() | ||
unet.unloadResources() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool. But if reduceMemory
is false
, I think both the normal unet and the refiner would be loaded while iterating through the unet
. I would suggest we always prewarm
the refiner so it's unloaded until needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way the UnetRefiner will always lazy-loaded, which makes sense because we're always unloading the base unet here every time. @atiorh FYI, I'll update the logic so that UnetRefiner will always be unloaded initially on launch, not just for iOS. It was touched on before but there might be a use case for a reduceLatency
type of setting if the user wants to keep both in memory at all times, but that can be for later.
@@ -270,7 +270,8 @@ public struct StableDiffusionXLPipeline: StableDiffusionPipelineProtocol { | |||
} | |||
|
|||
if reduceMemory { | |||
unetModel.unloadResources() | |||
unetRefiner?.unloadResources() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd maybe move this line out of the conditional for the same reason above.
I think we are also missing the hardcoded |
swift/StableDiffusion/pipeline/StableDiffusionXL+Resources.swift
Outdated
Show resolved
Hide resolved
… from torch2coreml
Allow a custom VAE to be converted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for the perseverance and great work @ZachNagengast! Merging now and we can address minor issues in subsequent PRs
This PR provides updates for refiner conversion to the torch2coreml script and inference utilities to the StableDiffusionXLPipeline.
How to test
Conversion (standalone):
python -m python_coreml_stable_diffusion.torch2coreml --convert-text-encoder --convert-vae-decoder --convert-vae-encoder --convert-unet --model-version stabilityai/stable-diffusion-xl-refiner-1.0 --compute-unit ALL --attention-implementation ORIGINAL --bundle-resources-for-swift-cli -o <your_output_path> --xl-version
Conversion (base + refiner):
python -m python_coreml_stable_diffusion.torch2coreml --convert-text-encoder --convert-vae-decoder --convert-vae-encoder --convert-unet --model-version stabilityai/stable-diffusion-xl-base-1.0 --refiner-version stabilityai/stable-diffusion-xl-refiner-1.0 --compute-unit ALL --attention-implementation ORIGINAL --bundle-resources-for-swift-cli -o <your_output_path> --xl-version
CLI Inference:
swift run StableDiffusionSample "a photo of an astronaut riding a horse on mars" --resource-path <your_output_path>/Resources --step-count 50 --save-every 5 --strength 0.5 --output-path <your_output_path> --image <your_input_image_path>.png --xl
Summary of the changes
Conversion script
DiffusionPipeline
as it now automatically sets up the pipeline based on model name, so it can share all the same logic with SDXL and SD 1/2 pipelinestext_encoder
, onlytext_encoder_2
, so this handles either of them being missingtime_ids
input shape from [12] to [2, {5,6}] to better match huggingface/diffuserstime_ids
that can either be [2,6] or [2,5] shape to support the refiner'saesthetic_score
conditioningInference
plannerRGBShapedArray()
was having major issues with 1024-sized images, so I had to add new logic for converting cgimages to a normalizedMLShapedArray
with the newtoRGBShapedArray()
time_ids
parameters, although it assumes a square image right now. Will need to be updated to support aspect ratios.encodePrompt()
to recognize this.Now for the catch:
I noticed that trying to run inference on the CLI wasn't working quite right, and I figured out that it needed the Unet to be float32 precision to work. I'm not sure why this happens, and the refiner works perfectly fine when running through the Diffusers app.
So the question is for this PR - would you prefer to default to float32 for the refiner (adds 4.5gb to the Unet), or disable the CLI for refiner unless the unet is in float32 precision, and give the option as a conversion argument? Or some other path? Open to any ideas of how to best go about it.
Sample outputs:
Base here done with 50 steps, Refiner done with 0.5 strength (25 steps)
Thank you for your interest in contributing to Core ML Stable Diffusion! Please review CONTRIBUTING.md first. If you would like to proceed with making a pull request, please indicate your agreement to the terms outlined in CONTRIBUTING.md by checking the box below. If not, please go ahead and fork this repo and make your updates.
We appreciate your interest in the project!
Do not erase the below when submitting your pull request:
#########