Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add basic transforms for multimodal #1141

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

RdoubleA
Copy link
Contributor

@RdoubleA RdoubleA commented Jul 3, 2024

No description provided.

Copy link

pytorch-bot bot commented Jul 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1141

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 1 Cancelled Job, 4 Unrelated Failures

As of commit 5f7e8aa with merge base f158577 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 3, 2024
Copy link
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks great! I left a few comments more around naming, since tiles/patches/image is so confusing in general. I think that this class would benefit from a visual small example too.

regarding the interval parts, IMO we should try to make the operations as explicit as possible. Instead of tok1, tok2; i; vision_mask[0], we should try to give them names, like idx_start, idx_end, etc.

participate in cross-attention with an image token will show True in the mask
and follow these rules:
1) Text tokens immediately following the image token up until the next image token
2) Consecutive image tokens attend to all subsequent text tokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make it more visual? Something like this: https://github.com/huggingface/transformers/blob/60bb571e993b7d73257fb64044726b569fef9403/src/transformers/models/llava_next/modeling_llava_next.py#L446

Or a link to the paper + page where they have an image for it


class CrossAttentionMask(Transform):
"""
Computes the cross-attention mask for text + image inputs. Text tokens that
Copy link
Contributor

@felipemello1 felipemello1 Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this CrossAttention is specific for text + images, should we indicate it in the name? Something like MultimodalCrossAttentionMask or VisionTextCrossAttentionMask?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good point. Will rename

text sequence.

Args:
num_patches (int): Number of patches per image, excluding class token.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should add a link to modules/VisionTransformer for a in-depth explanation of what num_patches mean. For better clarity, would it make sense to rename it num_patches_per_tile, since later we multiply it by n_tiles?

If we say "number of patches per image", it may be confusing, because an image can have a variable number of patches.

later on you say:

single_image_seq_len = n_tiles * self.num_patches + 1
image_seq_len = single_image_seq_len * n_img

So tile != image. Image is a set of tiles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry, probably my shallow understanding of patches/images/tiles. What I intended was num_patches per tile. If it makes sense I'd like to keep the name consistent with your vision transformer (either patch_grid_size or patch_size maybe?), whichever parameter you use to compute num patches.

I also assumed at this point tiles is padded to the max in all the images. Is this incorrect? Where does the padding happen?

Copy link
Contributor

@felipemello1 felipemello1 Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patches_per_tile is a fixed size, and its calculated as (tile_size // patch_size)**2.

What I did in VisionTransformer was to ask the user to pass tile_size and patch_size, and I calculated it for them. The VisionTransformer has a helper function that saves this value: https://github.com/felipemello1/torchtune/blob/f683812626ad4559464840112ddce516487bea5c/torchtune/modules/vision_transformer.py#L249

Maybe get it from the model, or ask for tile_size and patch_size, to avoid user confusion?

pass


class Compose(Transform):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the torchvision compose has a different behavior, I wonder if it makes sense to change Compose to something else, so users dont get confused with tv.Compose. Maybe "ComposeTransforms"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about Pipeline?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own understanding, the main difference with torchvision compose is that we support multiple inputs and multiple outputs here? Can we not just use torchvision compose with a single dict?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried naming something as Pipeline, as Kartikay said it would confuse people, because it is also used by other libraries :P. I guess sklearn?

"""
Returns a list of tuples of the form (start, end) where start is the index
of the current image token and end is the index of the next image token, exclusive.
If the image token attends until the end of the sequence, end will be -1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we add Args:,Returns, Examples?

vision_masks = [
[tok1, tok2]
for tok1, tok2 in zip(
vision_token_locations[:-1], vision_token_locations[1:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Maybe add a comment:
"offset by one and get consecutive indices"

I guess renaming tok1, tok2 to something like tok_idx_prev, tok_idx_next may be more intuitive


def __call__(self, *, tokens, images, **kwargs):
# We are still at sample level pre-collating
n_img, n_tiles, _, _, _ = images.shape
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe add a comment with the other dimensions, so we know what they are, but keep the "_", so we know they are not used?

You said "# We are still at sample level pre-collating"
So is n_img == bsz? If so, for consistency with VisionTransformer, should we rename it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah please add type and shape info for arguments to __call__

# We are still at sample level pre-collating
n_img, n_tiles, _, _, _ = images.shape
text_seq_len = len(tokens)
single_image_seq_len = n_tiles * self.num_patches + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe add comment explaining that +1 is for CLS, if thats the case

Comment on lines +136 to +138
image_num
* single_image_seq_len : (image_num + 1)
* single_image_seq_len,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: split line differently if linter allows it, as written this is confusing

* single_image_seq_len,
] = True

kwargs.update({"encoder_mask": mask, "tokens": tokens, "images": images})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to also update with tokens and images? Isn't this a no-op for those args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since they are explicit keyword args, they get unfolded from kwargs and you have to add them back in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants