-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added PDF Censorship Detector Module #1586
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: kunalsz <[email protected]>
@@ -1,56 +0,0 @@ | |||
repos: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why that was removed?
CENSORSHIP_WEAKNESS = "censorship_weakness" | ||
|
||
|
||
class PDFCensorScanner(ArtemisBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather name that LeakScanner, as badly censored PDFs may be only one of types of leaked data, and all leak scans will require the same core functionality of crawling
|
||
def run(self, current_task: Task) -> None: | ||
url = get_target_url(current_task) | ||
if url.endswith(".pdf"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid it would never run, as the URLs are website root URLs and these are never PDF urls. What do you think about
- crawling for all PDFs on a website (or using open-source sources such as archive.org or Common Crawl - https://github.com/lc/gau)
- running the check on all PDFs?
def run(self, current_task: Task) -> None: | ||
url = get_target_url(current_task) | ||
if url.endswith(".pdf"): | ||
self.log.info(f"PDF Censorship Scanner Scanning:{url}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing space
|
||
self.db.save_task_result( | ||
task=current_task, | ||
data={"detected_text": detected_text}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add URL where it was detected
Report( | ||
top_level_target=get_top_level_target(task_result), | ||
target=pdf_url, | ||
report_type=PDFCensorScannerReporter.CENSORSHIP_WEAKNESS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after renaming the module to LEakDetector, you may rename the report type to LEAKED_SENSITIVE_DATA
] | ||
|
||
@staticmethod | ||
def get_normal_form_rules() -> Dict[ReportType, Callable[[Report], NormalForm]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this method is needed , the inherited one is enough
karton_class = PDFCensorScanner | ||
|
||
def test_simple(self) -> None: | ||
pdf_url = "file:///home/zeit/Downloads/rectangles_yes_2.pdf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not a proper URL in the container ;) add the file to the repo
|
||
task = Task( | ||
{"type": TaskType.SERVICE.value, "service": Service.HTTP.value}, | ||
data={"detected_text": {1: [{"bbox": (105.4800033569336, 75.0, 119.63999938964844, 87.0), "text": "def"}]}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, if the key is detected_text, the value should be text ;) what do you think about renaming that key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, you need to add an assertion that the module returns the proper result, not provide this as data
Can you paste the pdf censorship detector container logs? |
"""Returns the email template fragment for redaction warnings.""" | ||
return [ | ||
ReportEmailTemplateFragment.from_file( | ||
os.path.join(os.path.dirname(__file__), "template.jinja2"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name the template like another ones are named: template_leaked_sensitive_data.jinja2
hello, friendly ping :) do you plan to submit an updated version of the pull request? |
In reference to the issue #327
Changes made:
pdf_censor_scanner
added usingxray
to scan for censorships in the pdfsErrors and future work
Tests are not running properly,your help will be really appreciated. I am attaching the logs here.
test.log
Even after updating the
docker compose
file. I can't see the new module in the frontend when i run./scripts/start_dev
More robust testing and reporting will be added.
@kazet looking forward to your insights !