Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added PDF Censorship Detector Module #1586

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kunalsz
Copy link

@kunalsz kunalsz commented Mar 6, 2025

In reference to the issue #327

Changes made:

  • New module pdf_censor_scanner added using xray to scan for censorships in the pdfs
  • tests written
  • report written

Errors and future work

  • Tests are not running properly,your help will be really appreciated. I am attaching the logs here.
    test.log

  • Even after updating the docker compose file. I can't see the new module in the frontend when i run ./scripts/start_dev

  • More robust testing and reporting will be added.

@kazet looking forward to your insights !

@@ -1,56 +0,0 @@
repos:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why that was removed?

CENSORSHIP_WEAKNESS = "censorship_weakness"


class PDFCensorScanner(ArtemisBase):
Copy link
Member

@kazet kazet Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather name that LeakScanner, as badly censored PDFs may be only one of types of leaked data, and all leak scans will require the same core functionality of crawling


def run(self, current_task: Task) -> None:
url = get_target_url(current_task)
if url.endswith(".pdf"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid it would never run, as the URLs are website root URLs and these are never PDF urls. What do you think about

  1. crawling for all PDFs on a website (or using open-source sources such as archive.org or Common Crawl - https://github.com/lc/gau)
  2. running the check on all PDFs?

def run(self, current_task: Task) -> None:
url = get_target_url(current_task)
if url.endswith(".pdf"):
self.log.info(f"PDF Censorship Scanner Scanning:{url}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space


self.db.save_task_result(
task=current_task,
data={"detected_text": detected_text},
Copy link
Member

@kazet kazet Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add URL where it was detected

Report(
top_level_target=get_top_level_target(task_result),
target=pdf_url,
report_type=PDFCensorScannerReporter.CENSORSHIP_WEAKNESS,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after renaming the module to LEakDetector, you may rename the report type to LEAKED_SENSITIVE_DATA

]

@staticmethod
def get_normal_form_rules() -> Dict[ReportType, Callable[[Report], NormalForm]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this method is needed , the inherited one is enough

karton_class = PDFCensorScanner

def test_simple(self) -> None:
pdf_url = "file:///home/zeit/Downloads/rectangles_yes_2.pdf"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a proper URL in the container ;) add the file to the repo


task = Task(
{"type": TaskType.SERVICE.value, "service": Service.HTTP.value},
data={"detected_text": {1: [{"bbox": (105.4800033569336, 75.0, 119.63999938964844, 87.0), "text": "def"}]}},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, if the key is detected_text, the value should be text ;) what do you think about renaming that key?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, you need to add an assertion that the module returns the proper result, not provide this as data

@kazet
Copy link
Member

kazet commented Mar 13, 2025

Even after updating the docker compose file. I can't see the new module in the frontend when i run ./scripts/start_dev

Can you paste the pdf censorship detector container logs?

"""Returns the email template fragment for redaction warnings."""
return [
ReportEmailTemplateFragment.from_file(
os.path.join(os.path.dirname(__file__), "template.jinja2"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name the template like another ones are named: template_leaked_sensitive_data.jinja2

@kazet
Copy link
Member

kazet commented Apr 1, 2025

hello,

friendly ping :) do you plan to submit an updated version of the pull request?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants