-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Details about GPTV generation? #7
Comments
We use the same multi-shot prompt to organize all the visual tokens, ASR texts and additional paired speaker diarization, and of course additional text prompts to constrain GPTV generate consistent and detailed texts. I can currently provide the texts used, but the generation code may come quite later on. prompt = "You are a chatbot that conducts conversations based on video contexts. These are frames of a short video. You mainly answer based on the given frames. You can also answer the relevant knowledge of the person or object contained in the video. The video has a high-level topic and the video content is supposed to be coherent. The video can have more than one shot and in each shot different action segments and events exist. Do not include details that you are not sure of. \nPlease note that some speakers in the audio appear in the video, of whom the speech content should be described in the shot. Please also note that some speakers may not appear in the video, who may be background voice or camera holders, of whom the speech content should be described as narrator or background voice. "
video_shots_base64_frames = [{"type":"text","text":prompt}, {"type":"text","text":f"The video has {num_shots} shots. Each shots may contain multiple actions, scenes and subjects. "}]
for i,shot in enumerate(all_shots):
video_shots_base64_frames.append({"type":"text","text":f"The {shot_idx_mapping[i]} shot starts from {start_duration}s to {start_duration+shot_duration}s. It contains frames: "})
video_shots_base64_frames.extend(map(lambda x: {"type":"image_url","image_url": {"url": f"data:image/jpeg;base64,{x}", "detail": "low"}},
sample_frames(base64Frames, max(4, int(shot_duration)))))
video_shots_base64_frames.append({"type":"text","text":f"The ASR contained in this shot is: {video_anno['ASR'][i]}"})
video_shots_base64_frames.append(f"The ASR content in the full video is: {video_anno['whole_ASR']}. Note that the speech content may related to the vision content. ")
video_shots_base64_frames.append(video_anno['speaker'])
video_shots_base64_frames.append("Please create a detailed description that outlines the key actions and components. Please describe the appearance, clothing, and surrounding environment of the characters. Also, please describe the appearance and characteristics of key objects. When multiple people and objects appear, please describe them all and make their descriptions as unique as possible. ")
video_shots_base64_frames.append("You should ensure the description is in narrative style and third-person view. You should describe the video coherently.")
video_shots_base64_frames.append("""You should describe and include the speech content into each video shot. You should be aware that the speakers in the audio may not appear in the video. If the speakers in the audio does not appear in the video, you should still mention the speech content if it is related to the visual content and topic. The speech content should be properly rephrased from its original ASR texts, if possible. You should ignore the incomplete speech content. Do not mention the words of "ASR". DO NOT include a conclusion of ASR or speakers. """)
video_shots_base64_frames.append(f"""You should not mention duration of videos and video shots. You should not mention the number of scenes, but you ought to describe the content changes and transition. You ought to describe the content in happening and reasoning order. You should not create a video plot out of nothing. Please describe the video coherently. You should not include a separate conclusion paragraph.""") |
@youthHan Hi, there are many 'Placeholder for audio/visual caption' in https://huggingface.co/datasets/mhan/Shot2Story-134K/blob/main/90k_gptv_train.json. What does this mean? Are there individual captions corresponding to each shot? |
For GPTV generated video summary, we don’t have per-shot captions.
…On Tue, 28 May 2024 at 1:42 PM, hkz ***@***.***> wrote:
@youthHan <https://github.com/youthHan> Hi, there are many 'Placeholder
for audio/visual caption' in
https://huggingface.co/datasets/mhan/Shot2Story-134K/blob/main/90k_gptv_train.json.
What does this mean? Are there individual captions corresponding to each
shot?
—
Reply to this email directly, view it on GitHub
<#7 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEOOMC42YPARHHDIP2TBSD3ZERGPPAVCNFSM6AAAAABH7SNSXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZUG44DENZRHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thank you for sharing the valuable dataset with the community! Could you please explain the specific GPTV annotation process differs from Sec2.3?
The text was updated successfully, but these errors were encountered: