Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle large number of files #270

Open
jtmoon79 opened this issue Mar 26, 2024 · 0 comments
Open

Handle large number of files #270

jtmoon79 opened this issue Mar 26, 2024 · 0 comments
Labels
code improvement enhancement not seen by the user difficult A difficult problem; a major coding effort or difficult algorithm to perfect enhancement New feature or request P1 important

Comments

@jtmoon79
Copy link
Owner

jtmoon79 commented Mar 26, 2024

Summary

Running s4.exe C:/Windows/ resulted in most file processing threads returning "Error too many files open", and a failure to process that file.

Suggested behavior

Better handle processing many files. Some suggestions:

  • During file preprocessing (filepreprocessor.rs), eliminate more files from consideration by removing files with file name extensions that are likely to fail processing, e.g. *.dll on a Windows platform is very likely a Dynamic Link Library file and will fail to process so do not attempt to process it.
  • During file preprocessing (filepreprocessor.rs), eliminate more files from consideration by analyzing the first 1024 bytes during filepreprocessor.rs processing stage. This preprocessor analysis relates to FileType guesstimating needs refactoring #257
  • The preprocessing stage is currently single-threaded. It should become a thread pool of preprocessing threads. This could speed initial pre-processing times. However, it would not fix "Error too many files open" and so is outside the scope of this specific Issue.

Workarounds

Limit the number of files opened by s4 using some external mechanism, for example PowerShell cmdlet Get-ChildItem with a -Filter

Get-ChildItem -Filter '*.log' -File -Path "C:\Windows" -Recurse -ErrorAction SilentlyContinue `
  | Select-Object -ExpandProperty FullName `
  | s4.exe -

In Unix, use find

find /var -type f -name '*log' | s4 -
@jtmoon79 jtmoon79 added enhancement New feature or request code improvement enhancement not seen by the user P1 important bug Something isn't working difficult A difficult problem; a major coding effort or difficult algorithm to perfect labels Mar 26, 2024
@jtmoon79 jtmoon79 changed the title Handle very large file sets Handle large number of files Apr 10, 2024
jtmoon79 added a commit that referenced this issue Apr 20, 2024
jtmoon79 added a commit that referenced this issue Apr 21, 2024
jtmoon79 added a commit that referenced this issue Apr 21, 2024
jtmoon79 added a commit that referenced this issue May 29, 2024
Check for empty files, small files before spawning threads.
This can lead to far less overhead when there are many small files.

Issue #270
@jtmoon79 jtmoon79 removed the bug Something isn't working label Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code improvement enhancement not seen by the user difficult A difficult problem; a major coding effort or difficult algorithm to perfect enhancement New feature or request P1 important
Projects
None yet
Development

No branches or pull requests

1 participant