Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Video downloader #303

Merged
merged 111 commits into from
Jan 6, 2023
Merged

Video downloader #303

merged 111 commits into from
Jan 6, 2023

Conversation

dale-wahl
Copy link
Member

The only part that currently works without issue is the video downloader.

Right now, I am stuck on an issue with ffmpeg which is annoyingly the basis of a very different processors. It seems that when running ffmpeg as a python subprocess, it produces errors such as the following:

[h264 @ 0x559bdfb273c0] Invalid NAL unit size (2767169 > 10809).
[h264 @ 0x559bdfb273c0] Error splitting the input into NAL units.
[h264 @ 0x559bdfb44040] Invalid NAL unit size (3041857 > 11882).
[h264 @ 0x559bdfb44040] Error splitting the input into NAL units.

Eventually ending as such:

Error while decoding stream #0:0: Invalid data found when processing input
    Last message repeated 7 times
frame=    3 fps=0.0 q=1.6 Lsize=N/A time=00:00:00.60 bitrate=N/A dup=0 drop=3 speed=5.05x    
video:20kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
Conversion failed!

As far as I can tell, running the same command as being used in subprocess does not produce these errors if run directly. This seems to be true on Linux at least. The command creates the desired number of frames (depending on provided framerate) while as a python subprocess only a few (3-5) frames are created before failure. I was able to determine that it has nothing to do with the unzipping process by hardcoding an unzipped folder and testing that as the source.

The processors/visualisation/video_frames.py is a bare bones wrapper for subprocess running ffmpeg.

@dale-wahl
Copy link
Member Author

Even more confusion. I made a simple bash script wrapper.

#!/bin/bash

ffmpeg -i $1 -s 144x144 -r $2 $3 >> $4 2>>$4

I can run that with expected variables and get a good result. But not when 4CAT runs it. I can hardcode the unzipped videos and have 4CAT use them as parameters and still get the same NAL unit errors resulting in return code 69.

Tue Oct 25 14:22:11 2022: Ran command: /usr/src/app/ffmpeg_wrapper.sh /usr/src/app/test/https_video_twimg_com_ext_tw_video_1576577278083010561_pu_vid_576x772_74u7etzog9wuumoz_mp4_tag_12.mp4 5.0 /usr/src/app/result/video_frame_%07d.jpeg /usr/src/app/result/wrapper.log
Tue Oct 25 14:22:11 2022: Error Return Code with video /usr/src/app/test/https_video_twimg_com_ext_tw_video_1576577278083010561_pu_vid_576x772_74u7etzog9wuumoz_mp4_tag_12.mp4: 69

While I can literally open up python3

import subprocess
import shlex
command = '/usr/src/app/ffmpeg_wrapper.sh /usr/src/app/test/https_video_twimg_com_ext_tw_video_1576577278083010561_pu_vid_576x772_74u7etzog9wuumoz_mp4_tag_12.mp4 5.0 /usr/src/app/result/video_frame_%07d.jpeg /usr/src/app/result/wrapper.log'
result = subprocess.run(shlex.split(command))

Return code 0 and good results. Clearly I've lost my mind.

@dale-wahl
Copy link
Member Author

I've no idea. The video_frames processor refuses to work even if I extract videos myself and just feed it that directory. I made a simple ffmpeg_wrapper.py that does, as far as I can tell, exactly the same thing. It works without issue in the same environment. Something must be screwing around in the environment/libraries/something that 4CAT loads, but I haven't a clue what it could be. The Invalid NAL unit size error seems to be rare and I cannot pin down what it really means. Something a about start bytes perhaps? What is causing the difference is a mystery. Unsure how to proceed.

@dale-wahl dale-wahl requested a review from stijn-uva October 26, 2022 08:46
@dale-wahl
Copy link
Member Author

Thinking specifically about the ffmpeg error being related to somehow the byte sequence being off. Then looking at how we use subprocess across 4CAT, I found this https://stackoverflow.com/a/52008583/8683110. And I'll be damned but it worked.

@dale-wahl
Copy link
Member Author

dale-wahl commented Oct 27, 2022

Currently the videohash library will not work with 4CAT due to the subprocess bug. I have a PR request and a fork that works, so we just have to install that version.

@stijn-uva
Copy link
Member

OK, I've tested this and I think it's mostly ready for merging (aside from the hash processor). The migrate script can install ffmpeg in existing Docker containers, I haven't tested other options very extensively. Let's take a last look after the winter break.

@dale-wahl
Copy link
Member Author

dale-wahl commented Dec 22, 2022

OK, I've tested this and I think it's mostly ready for merging (aside from the hash processor). The migrate script can install ffmpeg in existing Docker containers, I haven't tested other options very extensively. Let's take a last look after the winter break.

Ok. I haven’t looked through all your commits and must have missed commenting earlier, but I had added my working videohash library to setup.py already so there shouldn’t be any issue there with that processor either manual install or Docker. Also Docker setup was already updated to install ffmpeg, the only issue was needing an additional step with a manual install which could have been solved with first run. Adding it to the newest migrate makes sense for upgrades.

noticed some processors (that used iterate_archive_contents were not removing staging areas if one was provided
This is a proxy check since we are using the .env file copied into the Docker container. It will always be the version used to create the Docker container, but if a user already updated the .env file and for some reason has not yet used that .env file to create the 4CAT container, this message would still appear.
@dale-wahl
Copy link
Member Author

I see that the allow-indirect admin setting disables ytdlp entirely. I'm wondering if we shouldn't be more explicit in the setting description. Right now it just mentions "e.g. embedded in a linked tweet", but I think we should at least mention YouTube and perhaps even link to ytdlp supported sites. Hmmm, I should turn off that reference if allow-indirect is not selected.

had a set where the last timeline was the widest and the canvas was being cut short.
Still seems that if the last is the widest, the thumbnails layer on top of "made with 4CAT".
@dale-wahl
Copy link
Member Author

last commit fixes issue with timelines, but the "made with 4CAT" is overwritten if the last timeline is the full width of the canvas.

@stijn-uva stijn-uva merged commit 302205b into master Jan 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
big A big project that would require more than trivial fixes and enhancements. dependencies Pull requests that update a dependency file enhancement New feature or request processors Involves self-contained analyticalprocessors.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants