sebastiankamph

Improve text and prompt understanding further in ComfyUI

Added 2024-09-04 20:41:41 +0000 UTC

Video here: https://youtu.be/xcR-tzLi_7Y

There's a small little node called CLIPAttentionMultiply which changes how the text of your prompt is understood. It changes the multiplication factor (default 1) of different values.

Think of it like this, you can increase how important your prompt is. So with these values here we're increasing prompt understanding by a small margin, but large enough to make a difference without breaking the images.

QKV? Query - Key - Value

q (query): Weight of tokens as they influence each other in a sentence
k (key): Weight of tokens of input text. A token is usually a word
v (value): Strength of attention to input tokens
out: Strength of output
Learn more about attention here: https://www.youtube.com/watch?v=tIvKXrEDMhk

Imagine you are drawing a picture with a friend, and your friend is helping you by telling you what to draw based on a description (the prompt).

Query: This is like you asking, “What should I draw next?” You focus on one part of the description, like "a cat sitting under a tree."
Key: These are all the different parts of the description your friend has in mind, like “cat,” “sitting,” “under,” and “tree.” Each key is like a clue about what’s important in the picture.
Value: These are the actual details your friend gives you when a key matches. For example, if the key is “cat,” the value might be “a small, fluffy cat with green eyes.” The values are the specific things you need to draw.

So, when you ask "What should I draw next?" (query), your friend checks all the clues they have (keys) and tells you the right details (values) to add to your picture. The AI does this to create the right parts of an image based on the input you give it!

Comparison

In the comparison image below, 10 generations using default values, CLIPAttentionMultiply and CLIPAttentionMultiply + New Text Encoder (as per previous Patreon post https://www.patreon.com/posts/even-better-text-111362478). While there is still some randomness in the generations, both CLIPAttentionMultiply and CLIPAttentionMultiply + New Text Encoder performs better than the default values.

Generations were made with the Flux Dev NF4 model at FP16.
Download highres comparison at the bottom.

TLDR: Use CLIPAttentionMultiply. Connect it between your load model and text prompt. Also use the New Text Encoder (ViT-L-14-TEXT-detail-improved-hiT-GmP-TE-only-HF.safetensors). See https://www.patreon.com/posts/even-better-text-111362478