Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Details about GPTV generation? #7

Closed
hkunzhe opened this issue May 20, 2024 · 3 comments
Closed

Details about GPTV generation? #7

hkunzhe opened this issue May 20, 2024 · 3 comments

Comments

@hkunzhe
Copy link

hkunzhe commented May 20, 2024

Thank you for sharing the valuable dataset with the community! Could you please explain the specific GPTV annotation process differs from Sec2.3?

@youthHan
Copy link
Collaborator

youthHan commented May 20, 2024

We use the same multi-shot prompt to organize all the visual tokens, ASR texts and additional paired speaker diarization, and of course additional text prompts to constrain GPTV generate consistent and detailed texts. I can currently provide the texts used, but the generation code may come quite later on.

prompt = "You are a chatbot that conducts conversations based on video contexts. These are frames of a short video. You mainly answer based on the given frames. You can also answer the relevant knowledge of the person or object contained in the video. The video has a high-level topic and the video content is supposed to be coherent. The video can have more than one shot and in each shot different action segments and events exist.  Do not include details that you are not sure of. \nPlease note that some speakers in the audio appear in the video, of whom the speech content should be described in the shot. Please also note that some speakers may not appear in the video, who may be background voice or camera holders, of whom the speech content should be described as narrator or background voice. "

video_shots_base64_frames = [{"type":"text","text":prompt}, {"type":"text","text":f"The video has {num_shots} shots. Each shots may contain multiple actions, scenes and subjects. "}]
for i,shot in enumerate(all_shots):
    video_shots_base64_frames.append({"type":"text","text":f"The {shot_idx_mapping[i]} shot starts from {start_duration}s to {start_duration+shot_duration}s. It contains frames: "})
    video_shots_base64_frames.extend(map(lambda x: {"type":"image_url","image_url": {"url": f"data:image/jpeg;base64,{x}", "detail": "low"}},
                    sample_frames(base64Frames, max(4, int(shot_duration)))))
    video_shots_base64_frames.append({"type":"text","text":f"The ASR contained in this shot is: {video_anno['ASR'][i]}"})

video_shots_base64_frames.append(f"The ASR content in the full video is: {video_anno['whole_ASR']}. Note that the speech content may related to the vision content. ")
video_shots_base64_frames.append(video_anno['speaker'])
video_shots_base64_frames.append("Please create a detailed description that outlines the key actions and components. Please describe the appearance, clothing, and surrounding environment of the characters. Also, please describe the appearance and characteristics of key objects. When multiple people and objects appear, please describe them all and make their descriptions as unique as possible. ")
video_shots_base64_frames.append("You should ensure the description is in narrative style and third-person view. You should describe the video coherently.")
video_shots_base64_frames.append("""You should describe and include the speech content into each video shot. You should be aware that the speakers in the audio may not appear in the video. If the speakers in the audio does not appear in the video, you should still mention the speech content if it is related to the visual content and topic. The speech content should be properly rephrased from its original ASR texts, if possible. You should ignore the incomplete speech content. Do not mention the words of "ASR". DO NOT include a conclusion of ASR or speakers. """)
video_shots_base64_frames.append(f"""You should not mention duration of videos and video shots. You should not mention the number of scenes, but you ought to describe the content changes and transition. You ought to describe the content in happening and reasoning order. You should not create a video plot out of nothing. Please describe the video coherently. You should not include a separate conclusion paragraph.""")

@hkunzhe hkunzhe closed this as completed May 21, 2024
@hkunzhe hkunzhe reopened this May 28, 2024
@hkunzhe
Copy link
Author

hkunzhe commented May 28, 2024

@youthHan Hi, there are many 'Placeholder for audio/visual caption' in https://huggingface.co/datasets/mhan/Shot2Story-134K/blob/main/90k_gptv_train.json. What does this mean? Are there individual captions corresponding to each shot?

@youthHan
Copy link
Collaborator

youthHan commented May 28, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants