Skip to content

Support for SDXL refiner #227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Sep 26, 2023
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
4cd95e1
Initial support for SDXL refiner
ZachNagengast Aug 3, 2023
7c03b1d
Cleanup
ZachNagengast Aug 3, 2023
db154eb
Add arg for converting Unet in float32 precision
ZachNagengast Aug 4, 2023
1d8bfff
Merge branch 'main' into sdxl-refiner
ZachNagengast Aug 8, 2023
d3be8b9
Setup scale factor with pipeline in CLI
ZachNagengast Aug 8, 2023
5f9d50c
Update cli arg and future warning
ZachNagengast Aug 20, 2023
fc204ef
Merge branch 'main' into sdxl-refiner
ZachNagengast Aug 22, 2023
f4142cf
Bundle refiner unet if specified
ZachNagengast Aug 25, 2023
3aa0e23
Merge branch 'main' into sdxl-refiner
ZachNagengast Aug 25, 2023
58b4f62
Update script for bundled refiner
ZachNagengast Aug 28, 2023
137d0c8
Merge branch 'main' into sdxl-refiner
ZachNagengast Aug 28, 2023
e28f812
Flip skip_model_load bool
ZachNagengast Aug 28, 2023
2543c1f
Cleanup
ZachNagengast Aug 28, 2023
b211d2d
Support bundled UnetRefiner
ZachNagengast Aug 29, 2023
fdc0185
Add seperate refiner config value
ZachNagengast Sep 8, 2023
81dd25b
Update readme for SDXL refiner
ZachNagengast Sep 10, 2023
d589ab8
Merge branch 'main' into sdxl-refiner
ZachNagengast Sep 11, 2023
43619e0
Add condition for new SDXL coreml input features
ZachNagengast Sep 12, 2023
352f349
Revert pipeline interface change, add extra logging on pipe load
ZachNagengast Sep 13, 2023
e5724db
Reset model_version after refiner conversion
ZachNagengast Sep 14, 2023
90864bc
Reset model_version before refiner conversion but after pipe init
ZachNagengast Sep 15, 2023
e2e2b16
Add refiner chunking
ZachNagengast Sep 19, 2023
7cb53e8
Ensure unets are unloaded for reduceMemory true
ZachNagengast Sep 19, 2023
84450eb
Handle missing UnetRefiner.mlmodelc on pipeline load
ZachNagengast Sep 19, 2023
1dae882
Prewarm refiner on load, unload on complete
ZachNagengast Sep 19, 2023
ae516e7
Force cpu_and_gpu for VAE until it can be fixed
ZachNagengast Sep 19, 2023
efda893
Include output dtype of np.float32 for all conversions
ZachNagengast Sep 19, 2023
a5f0280
Allow a custom VAE to be converted.
pcuenca Sep 25, 2023
17dec17
Revert hardcoded reduceMemory
ZachNagengast Sep 25, 2023
580bcbd
Merge branch 'main' into sdxl-refiner
ZachNagengast Sep 25, 2023
c69deb7
Fix merge
ZachNagengast Sep 25, 2023
68c39b3
Default chunking arg for --merge-chunks-in-pipeline-model when called…
ZachNagengast Sep 25, 2023
afdfbcd
Merge pull request #2 from pcuenca/custom-vae-version
ZachNagengast Sep 25, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 9 additions & 3 deletions README.md
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is --xl-version used for the base model only, or do you need to use it along with --refiner-version for the refiner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrittvo That's a good question, I suppose it would work for a non-xl model in the same way. All it is doing is converting the unet for the provided model and renaming it as the refiner model for that base model. The only real requirement is that they are both the same kind of model so the latent matchup. That's for the conversion - but the pipeline for the swift side of things has nothing to handle a UnetRefiner.mlmodelc for no XL models, so it would just be ignored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So be more specific though, yes the --xl-version should be used along with the --refiner-version in order to convert the refiner in a similar way as the --model-version input.

Original file line number Diff line number Diff line change
Expand Up @@ -209,10 +209,11 @@ An example `<selected-recipe-string-key>` would be `"recipe_4.50_bit_mixedpalett
e.g.:

```bash
python -m python_coreml_stable_diffusion.torch2coreml --convert-unet --convert-vae-decoder --convert-text-encoder --xl-version --model-version stabilityai/stable-diffusion-xl-base-1.0 --bundle-resources-for-swift-cli --attention-implementation ORIGINAL -o <output-dir>
python -m python_coreml_stable_diffusion.torch2coreml --convert-unet --convert-vae-decoder --convert-text-encoder --xl-version --model-version stabilityai/stable-diffusion-xl-base-1.0 --refiner-version stabilityai/stable-diffusion-xl-refiner-1.0 --bundle-resources-for-swift-cli --attention-implementation ORIGINAL -o <output-dir>
```

- `--xl-version`: Additional argument to pass to the conversion script when specifying an XL model
- `--refiner-version`: Additional argument to pass to the conversion script when specifying an XL refiner model, required for ["Ensemble of Expert Denoisers"](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#1-ensemble-of-expert-denoisers) inference.
- `--attention-implementation ORIGINAL` (recommended for `cpuAndGPU`)
- Due to known float16 overflow issues in the VAE, it runs in float32 precision for now

Expand All @@ -225,7 +226,7 @@ swift run StableDiffusionSample <prompt> --resource-path <output-mlpackages-dire
```

- Only `--compute-units cpuAndGPU` is supported for now
- Only the `base` model is supported, `refiner` model is not yet supported
- Only the `base` model is required, `refiner` model is optional and will be used by default if provided in the resource directory
- ControlNet for XL is not yet supported


Expand Down Expand Up @@ -365,6 +366,7 @@ This generally takes 15-20 minutes on an M1 MacBook Pro. Upon successful executi

- `--model-version`: The model version name as published on the [Hugging Face Hub](https://huggingface.co/models?search=stable-diffusion)

- `--refiner-version`: The refiner version name as published on the [Hugging Face Hub](https://huggingface.co/models?search=stable-diffusion). This is optional and if specified, this argument will convert and bundle the refiner unet alongside the model unet.

- `--bundle-resources-for-swift-cli`: Compiles all 4 models and bundles them along with necessary resources for text tokenization into `<output-mlpackages-directory>/Resources` which should provided as input to the Swift package. This flag is not necessary for the diffusers-based Python pipeline.

Expand Down Expand Up @@ -439,7 +441,7 @@ This Swift package contains two products:

Both of these products require the Core ML models and tokenization resources to be supplied. When specifying resources via a directory path that directory must contain the following:

- `TextEncoder.mlmodelc` (text embedding model)
- `TextEncoder.mlmodelc` or `TextEncoder2.mlmodelc (text embedding model)
- `Unet.mlmodelc` or `UnetChunk1.mlmodelc` & `UnetChunk2.mlmodelc` (denoising autoencoder model)
- `VAEDecoder.mlmodelc` (image decoder model)
- `vocab.json` (tokenizer vocabulary file)
Expand All @@ -453,6 +455,10 @@ Optionally, it may also include the safety checker model that some versions of S

- `SafetyChecker.mlmodelc`

Optionally, for the SDXL refiner:

- `UnetRefiner.mlmodelc` (refiner unet model)

Optionally, for ControlNet:

- `ControlledUNet.mlmodelc` or `ControlledUnetChunk1.mlmodelc` & `ControlledUnetChunk2.mlmodelc` (enabled to receive ControlNet values)
Expand Down
154 changes: 109 additions & 45 deletions python_coreml_stable_diffusion/torch2coreml.py

Large diffs are not rendered by default.

58 changes: 54 additions & 4 deletions swift/StableDiffusion/pipeline/CGImage+vImage.swift
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import Foundation
import Accelerate
import CoreML
import CoreGraphics

@available(iOS 16.0, macOS 13.0, *)
extension CGImage {
Expand Down Expand Up @@ -77,7 +78,7 @@ extension CGImage {
else {
throw ShapedArrayError.incorrectFormatsConvertingToShapedArray
}

var sourceImageBuffer = try vImage_Buffer(cgImage: self)

var mediumDestination = try vImage_Buffer(width: Int(width), height: Int(height), bitsPerPixel: mediumFormat.bitsPerPixel)
Expand All @@ -88,7 +89,7 @@ extension CGImage {
nil,
vImage_Flags(kvImagePrintDiagnosticsToConsole),
nil)

guard let converter = converter?.takeRetainedValue() else {
throw ShapedArrayError.vImageConverterNotInitialized
}
Expand All @@ -99,7 +100,7 @@ extension CGImage {
var destinationR = try vImage_Buffer(width: Int(width), height: Int(height), bitsPerPixel: 8 * UInt32(MemoryLayout<Float>.size))
var destinationG = try vImage_Buffer(width: Int(width), height: Int(height), bitsPerPixel: 8 * UInt32(MemoryLayout<Float>.size))
var destinationB = try vImage_Buffer(width: Int(width), height: Int(height), bitsPerPixel: 8 * UInt32(MemoryLayout<Float>.size))

var minFloat: [Float] = Array(repeating: minValue, count: 4)
var maxFloat: [Float] = Array(repeating: maxValue, count: 4)

Expand All @@ -125,7 +126,56 @@ extension CGImage {
let imageData = redData + greenData + blueData

let shapedArray = MLShapedArray<Float32>(data: imageData, shape: [1, 3, self.height, self.width])


return shapedArray
}

private func normalizePixelValues(pixel: UInt8) -> Float {
return (Float(pixel) / 127.5) - 1.0
}

public func toRGBShapedArray(minValue: Float, maxValue: Float)
throws -> MLShapedArray<Float32> {
let image = self
let width = image.width
let height = image.height
let alphaMaskValue: Float = minValue

guard let colorSpace = CGColorSpace(name: CGColorSpace.sRGB),
let context = CGContext(data: nil, width: width, height: height, bitsPerComponent: 8, bytesPerRow: 4 * width, space: colorSpace, bitmapInfo: CGImageAlphaInfo.premultipliedLast.rawValue),
let ptr = context.data?.bindMemory(to: UInt8.self, capacity: width * height * 4) else {
return []
}

context.draw(image, in: CGRect(x: 0, y: 0, width: width, height: height))

var redChannel = [Float](repeating: 0, count: width * height)
var greenChannel = [Float](repeating: 0, count: width * height)
var blueChannel = [Float](repeating: 0, count: width * height)

for y in 0..<height {
for x in 0..<width {
let i = 4 * (y * width + x)
if ptr[i+3] == 0 {
// Alpha mask for controlnets
redChannel[y * width + x] = alphaMaskValue
Copy link
Contributor

@atiorh atiorh Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vzsg had specified a similar convention in #209 for ControlNet inpainting. Tagging for visibility to ensure we are not moving in a different direction with this potential switch from plannerRGBShapedArray to toRGBShapedArray.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted to replicate it with this new function, and the plannerRGBShapedArray function is still in use for the controlnet conditioning. Would be great to get another eye on it if it's working as intended.

greenChannel[y * width + x] = alphaMaskValue
blueChannel[y * width + x] = alphaMaskValue
} else {
redChannel[y * width + x] = normalizePixelValues(pixel: ptr[i])
greenChannel[y * width + x] = normalizePixelValues(pixel: ptr[i+1])
blueChannel[y * width + x] = normalizePixelValues(pixel: ptr[i+2])
}
}
}

let colorShape = [1, 1, height, width]
let redShapedArray = MLShapedArray<Float32>(scalars: redChannel, shape: colorShape)
let greenShapedArray = MLShapedArray<Float32>(scalars: greenChannel, shape: colorShape)
let blueShapedArray = MLShapedArray<Float32>(scalars: blueChannel, shape: colorShape)

let shapedArray = MLShapedArray<Float32>(concatenating: [redShapedArray, greenShapedArray, blueShapedArray], alongAxis: 1)

return shapedArray
}
}
Expand Down
2 changes: 1 addition & 1 deletion swift/StableDiffusion/pipeline/Encoder.swift
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ public struct Encoder: ResourceManaging {

var inputDescription: MLFeatureDescription {
try! model.perform { model in
model.modelDescription.inputDescriptionsByName["z"]!
model.modelDescription.inputDescriptionsByName.first!.value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This preserves backwards compatibility so I am not concerned about the rename and thanks for the semantically correct fix :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to update, just appreciating the change

}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,14 @@ public struct PipelineConfiguration: Hashable {
public var negativePrompt: String = ""
/// Starting image for image2image or in-painting
public var startingImage: CGImage? = nil
//public var maskImage: CGImage? = nil
/// Fraction of inference steps to be used in `.imageToImage` pipeline mode
/// Must be between 0 and 1
/// Higher values will result in greater transformation of the `startingImage`
public var strength: Float = 1.0
/// Fraction of inference steps to at which to start using the refiner unet if present in `textToImage` mode
/// Must be between 0 and 1
/// Higher values will result in fewer refiner steps
public var refinerStart: Float = 0.8
/// Number of images to generate
public var imageCount: Int = 1
/// Number of inference steps to perform
Expand All @@ -44,7 +50,19 @@ public struct PipelineConfiguration: Hashable {
public var encoderScaleFactor: Float32 = 0.18215
/// Scale factor to use on the latent before decoding
public var decoderScaleFactor: Float32 = 0.18215

/// If `originalSize` is not the same as `targetSize` the image will appear to be down- or upsampled.
Copy link
Contributor

@atiorh atiorh Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am inclined to recommend a dedicated StableDiffusionXLPipeline.Configuration.swift as quite a few things on the input conditioning and pipeline are different. @ZachNagengast & @pcuenca What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I'd agree it probably shouldn't be a requirement on the caller to set the scale factor properly, especially when we already know the pipeline is SDXL and can set the default properly. My question would be whether we'd want a new class for this or just setup the defaults somewhere in the pipeline. Or potentially a broader option of just including the config files with the converted models directly, alongside the merges.txt and vocab.json, and using those for the default config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially a broader option of just including the config files with the converted models directly

That would be nice and we can even make that backward compatible. However, it still wouldn't address the argument set mismatch across XL and non-XL right? IMO, the easiest is to create 2 (XL and non-XL) default configs and then add your proposal from above as an extension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that makes sense to me, and seems to be the approach Diffusers likes to take. My only nitpick is the repeated code, be it can be consolidated a fair amount by committing to the two pipelines and pulling out shared code as needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we can refactor over time :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the approach discussed, separate configs do seem necessary to make usage simpler.

/// Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952.
public var originalSize: Float32 = 1024
/// `cropsCoordsTopLeft` can be used to generate an image that appears to be “cropped” from the position `cropsCoordsTopLeft` downwards.
/// Favorable, well-centered images are usually achieved by setting `cropsCoordsTopLeft` to (0, 0).
public var cropsCoordsTopLeft: Float32 = 0
/// For most cases, `target_size` should be set to the desired height and width of the generated image.
public var targetSize: Float32 = 1024
/// Used to simulate an aesthetic score of the generated image by influencing the positive text condition.
public var aestheticScore: Float32 = 6
/// Can be used to simulate an aesthetic score of the generated image by influencing the negative text condition.
public var negativeAestheticScore: Float32 = 2.5

/// Given the configuration, what mode will be used for generation
public var mode: PipelineMode {
guard startingImage != nil else {
Expand Down
2 changes: 2 additions & 0 deletions swift/StableDiffusion/pipeline/StableDiffusionPipeline.swift
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,9 @@ public enum StableDiffusionRNG {
}

public enum PipelineError: String, Swift.Error {
case missingUnetInputs
case startingImageProvidedWithoutEncoder
case startingText2ImgWithoutTextEncoder
case unsupportedOSVersion
}

Expand Down
25 changes: 24 additions & 1 deletion swift/StableDiffusion/pipeline/StableDiffusionXL+Resources.swift
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ public extension StableDiffusionXLPipeline {
public let unetURL: URL
public let unetChunk1URL: URL
public let unetChunk2URL: URL
public let unetRefinerURL: URL
public let unetRefinerChunk1URL: URL
public let unetRefinerChunk2URL: URL
public let decoderURL: URL
public let encoderURL: URL
public let vocabURL: URL
Expand All @@ -26,6 +29,9 @@ public extension StableDiffusionXLPipeline {
unetURL = baseURL.appending(path: "Unet.mlmodelc")
unetChunk1URL = baseURL.appending(path: "UnetChunk1.mlmodelc")
unetChunk2URL = baseURL.appending(path: "UnetChunk2.mlmodelc")
unetRefinerURL = baseURL.appending(path: "UnetRefiner.mlmodelc")
unetRefinerChunk1URL = baseURL.appending(path: "UnetRefinerChunk1.mlmodelc")
unetRefinerChunk2URL = baseURL.appending(path: "UnetRefinerChunk2.mlmodelc")
decoderURL = baseURL.appending(path: "VAEDecoder.mlmodelc")
encoderURL = baseURL.appending(path: "VAEEncoder.mlmodelc")
vocabURL = baseURL.appending(path: "vocab.json")
Expand All @@ -51,7 +57,12 @@ public extension StableDiffusionXLPipeline {
/// Expect URL of each resource
let urls = ResourceURLs(resourcesAt: baseURL)
let tokenizer = try BPETokenizer(mergesAt: urls.mergesURL, vocabularyAt: urls.vocabURL)
let textEncoder = TextEncoderXL(tokenizer: tokenizer, modelAt: urls.textEncoderURL, configuration: config)
let textEncoder: TextEncoderXL?
if FileManager.default.fileExists(atPath: urls.textEncoderURL.path) {
textEncoder = TextEncoderXL(tokenizer: tokenizer, modelAt: urls.textEncoderURL, configuration: config)
} else {
textEncoder = nil
}

// padToken is different in the second XL text encoder
let tokenizer2 = try BPETokenizer(mergesAt: urls.mergesURL, vocabularyAt: urls.vocabURL, padToken: "!")
Expand All @@ -67,6 +78,17 @@ public extension StableDiffusionXLPipeline {
unet = Unet(modelAt: urls.unetURL, configuration: config)
}

// Refiner Unet model
let unetRefiner: Unet?
if FileManager.default.fileExists(atPath: urls.unetRefinerChunk1URL.path) &&
FileManager.default.fileExists(atPath: urls.unetRefinerChunk2URL.path) {
unetRefiner = Unet(chunksAt: [urls.unetRefinerChunk1URL, urls.unetRefinerChunk2URL],
configuration: config)
} else {
unetRefiner = Unet(modelAt: urls.unetRefinerURL, configuration: config)
}


// Image Decoder
let decoder = Decoder(modelAt: urls.decoderURL, configuration: config)

Expand All @@ -83,6 +105,7 @@ public extension StableDiffusionXLPipeline {
textEncoder: textEncoder,
textEncoder2: textEncoder2,
unet: unet,
unetRefiner: unetRefiner,
decoder: decoder,
encoder: encoder,
reduceMemory: reduceMemory
Expand Down
Loading