Details about GPTV generation? #7

hkunzhe · 2024-05-20T11:50:39Z

Thank you for sharing the valuable dataset with the community! Could you please explain the specific GPTV annotation process differs from Sec2.3?

youthHan · 2024-05-20T12:19:12Z

We use the same multi-shot prompt to organize all the visual tokens, ASR texts and additional paired speaker diarization, and of course additional text prompts to constrain GPTV generate consistent and detailed texts. I can currently provide the texts used, but the generation code may come quite later on.

prompt = "You are a chatbot that conducts conversations based on video contexts. These are frames of a short video. You mainly answer based on the given frames. You can also answer the relevant knowledge of the person or object contained in the video. The video has a high-level topic and the video content is supposed to be coherent. The video can have more than one shot and in each shot different action segments and events exist.  Do not include details that you are not sure of. \nPlease note that some speakers in the audio appear in the video, of whom the speech content should be described in the shot. Please also note that some speakers may not appear in the video, who may be background voice or camera holders, of whom the speech content should be described as narrator or background voice. "

video_shots_base64_frames = [{"type":"text","text":prompt}, {"type":"text","text":f"The video has {num_shots} shots. Each shots may contain multiple actions, scenes and subjects. "}]
for i,shot in enumerate(all_shots):
    video_shots_base64_frames.append({"type":"text","text":f"The {shot_idx_mapping[i]} shot starts from {start_duration}s to {start_duration+shot_duration}s. It contains frames: "})
    video_shots_base64_frames.extend(map(lambda x: {"type":"image_url","image_url": {"url": f"data:image/jpeg;base64,{x}", "detail": "low"}},
                    sample_frames(base64Frames, max(4, int(shot_duration)))))
    video_shots_base64_frames.append({"type":"text","text":f"The ASR contained in this shot is: {video_anno['ASR'][i]}"})

video_shots_base64_frames.append(f"The ASR content in the full video is: {video_anno['whole_ASR']}. Note that the speech content may related to the vision content. ")
video_shots_base64_frames.append(video_anno['speaker'])
video_shots_base64_frames.append("Please create a detailed description that outlines the key actions and components. Please describe the appearance, clothing, and surrounding environment of the characters. Also, please describe the appearance and characteristics of key objects. When multiple people and objects appear, please describe them all and make their descriptions as unique as possible. ")
video_shots_base64_frames.append("You should ensure the description is in narrative style and third-person view. You should describe the video coherently.")
video_shots_base64_frames.append("""You should describe and include the speech content into each video shot. You should be aware that the speakers in the audio may not appear in the video. If the speakers in the audio does not appear in the video, you should still mention the speech content if it is related to the visual content and topic. The speech content should be properly rephrased from its original ASR texts, if possible. You should ignore the incomplete speech content. Do not mention the words of "ASR". DO NOT include a conclusion of ASR or speakers. """)
video_shots_base64_frames.append(f"""You should not mention duration of videos and video shots. You should not mention the number of scenes, but you ought to describe the content changes and transition. You ought to describe the content in happening and reasoning order. You should not create a video plot out of nothing. Please describe the video coherently. You should not include a separate conclusion paragraph.""")

hkunzhe · 2024-05-28T09:41:52Z

@youthHan Hi, there are many 'Placeholder for audio/visual caption' in https://huggingface.co/datasets/mhan/Shot2Story-134K/blob/main/90k_gptv_train.json. What does this mean? Are there individual captions corresponding to each shot?

youthHan · 2024-05-28T10:32:34Z

For GPTV generated video summary, we don’t have per-shot captions.

…

On Tue, 28 May 2024 at 1:42 PM, hkz ***@***.***> wrote: @youthHan <https://github.com/youthHan> Hi, there are many 'Placeholder for audio/visual caption' in https://huggingface.co/datasets/mhan/Shot2Story-134K/blob/main/90k_gptv_train.json. What does this mean? Are there individual captions corresponding to each shot? — Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEOOMC42YPARHHDIP2TBSD3ZERGPPAVCNFSM6AAAAABH7SNSXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZUG44DENZRHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

hkunzhe closed this as completed May 21, 2024

hkunzhe reopened this May 28, 2024

hkunzhe closed this as completed May 29, 2024

youthHan pinned this issue Jul 4, 2024

youthHan added the data-related label Jul 24, 2024

yellow-binary-tree mentioned this issue Sep 7, 2024

Erroneous data in 90k_gptv_train.json #14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Details about GPTV generation? #7

Details about GPTV generation? #7

hkunzhe commented May 20, 2024

youthHan commented May 20, 2024 •

edited

Loading

hkunzhe commented May 28, 2024

youthHan commented May 28, 2024 via email

Details about GPTV generation? #7

Details about GPTV generation? #7

Comments

hkunzhe commented May 20, 2024

youthHan commented May 20, 2024 • edited Loading

hkunzhe commented May 28, 2024

youthHan commented May 28, 2024 via email

youthHan commented May 20, 2024 •

edited

Loading