Added PDF Censorship Detector Module #1586

kunalsz · 2025-03-06T20:28:46Z

In reference to the issue #327

Changes made:

New module pdf_censor_scanner added using xray to scan for censorships in the pdfs
tests written
report written

Errors and future work

Tests are not running properly,your help will be really appreciated. I am attaching the logs here.
test.log
Even after updating the docker compose file. I can't see the new module in the frontend when i run ./scripts/start_dev
More robust testing and reporting will be added.

@kazet looking forward to your insights !

Signed-off-by: kunalsz <[email protected]>

kazet · 2025-03-13T09:28:27Z

.pre-commit-config.yaml

@@ -1,56 +0,0 @@
-repos:


why that was removed?

kazet · 2025-03-13T09:30:16Z

artemis/modules/pdf_censor_scanner.py

+    CENSORSHIP_WEAKNESS = "censorship_weakness"
+
+
+class PDFCensorScanner(ArtemisBase):


I would rather name that LeakScanner, as badly censored PDFs may be only one of types of leaked data, and all leak scans will require the same core functionality of crawling

kazet · 2025-03-13T09:31:24Z

artemis/modules/pdf_censor_scanner.py

+
+    def run(self, current_task: Task) -> None:
+        url = get_target_url(current_task)
+        if url.endswith(".pdf"):


I'm afraid it would never run, as the URLs are website root URLs and these are never PDF urls. What do you think about

crawling for all PDFs on a website (or using open-source sources such as archive.org or Common Crawl - https://github.com/lc/gau)

running the check on all PDFs?

kazet · 2025-03-13T09:31:36Z

artemis/modules/pdf_censor_scanner.py

+    def run(self, current_task: Task) -> None:
+        url = get_target_url(current_task)
+        if url.endswith(".pdf"):
+            self.log.info(f"PDF Censorship Scanner Scanning:{url}")


missing space

kazet · 2025-03-13T09:32:00Z

artemis/modules/pdf_censor_scanner.py

+
+        self.db.save_task_result(
+            task=current_task,
+            data={"detected_text": detected_text},


add URL where it was detected

kazet · 2025-03-13T09:32:23Z

artemis/reporting/modules/pdf_censor_scanner/reporter.py

+            Report(
+                top_level_target=get_top_level_target(task_result),
+                target=pdf_url,
+                report_type=PDFCensorScannerReporter.CENSORSHIP_WEAKNESS,


after renaming the module to LEakDetector, you may rename the report type to LEAKED_SENSITIVE_DATA

kazet · 2025-03-13T09:32:47Z

artemis/reporting/modules/pdf_censor_scanner/reporter.py

+        ]
+
+    @staticmethod
+    def get_normal_form_rules() -> Dict[ReportType, Callable[[Report], NormalForm]]:


I don't think this method is needed , the inherited one is enough

kazet · 2025-03-13T09:33:21Z

test/modules/test_pdf_censor_scanner.py

+    karton_class = PDFCensorScanner
+
+    def test_simple(self) -> None:
+        pdf_url = "file:///home/zeit/Downloads/rectangles_yes_2.pdf"


this is not a proper URL in the container ;) add the file to the repo

kazet · 2025-03-13T09:35:09Z

test/modules/test_pdf_censor_scanner.py

+
+        task = Task(
+            {"type": TaskType.SERVICE.value, "service": Service.HTTP.value},
+            data={"detected_text": {1: [{"bbox": (105.4800033569336, 75.0, 119.63999938964844, 87.0), "text": "def"}]}},


hmm, if the key is detected_text, the value should be text ;) what do you think about renaming that key?

BTW, you need to add an assertion that the module returns the proper result, not provide this as data

kazet · 2025-03-13T09:37:00Z

Even after updating the docker compose file. I can't see the new module in the frontend when i run ./scripts/start_dev

Can you paste the pdf censorship detector container logs?

kazet · 2025-03-13T09:37:51Z

artemis/reporting/modules/pdf_censor_scanner/reporter.py

+        """Returns the email template fragment for redaction warnings."""
+        return [
+            ReportEmailTemplateFragment.from_file(
+                os.path.join(os.path.dirname(__file__), "template.jinja2"),


name the template like another ones are named: template_leaked_sensitive_data.jinja2

kazet · 2025-04-01T07:07:14Z

hello,

friendly ping :) do you plan to submit an updated version of the pull request?

Added PDF Censorship Detector Module

3ba74b7

Signed-off-by: kunalsz <[email protected]>

kazet reviewed Mar 13, 2025

View reviewed changes

.pre-commit-config.yaml

@@ -1,56 +0,0 @@

repos:

Copy link

Member

kazet Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why that was removed?

kazet reviewed Mar 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added PDF Censorship Detector Module #1586

Added PDF Censorship Detector Module #1586

kunalsz commented Mar 6, 2025

kazet Mar 13, 2025

kazet Mar 13, 2025 •

edited

Loading

kazet Mar 13, 2025

kazet Mar 13, 2025

kazet Mar 13, 2025 •

edited

Loading

kazet Mar 13, 2025

kazet Mar 13, 2025

kazet Mar 13, 2025

kazet Mar 13, 2025

kazet Mar 13, 2025

kazet commented Mar 13, 2025

kazet Mar 13, 2025

kazet commented Apr 1, 2025

		CENSORSHIP_WEAKNESS = "censorship_weakness"


		class PDFCensorScanner(ArtemisBase):

Added PDF Censorship Detector Module #1586

Are you sure you want to change the base?

Added PDF Censorship Detector Module #1586

Conversation

kunalsz commented Mar 6, 2025

Changes made:

Errors and future work

Choose a reason for hiding this comment

kazet Mar 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kazet Mar 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kazet commented Mar 13, 2025

Choose a reason for hiding this comment

kazet commented Apr 1, 2025

kazet Mar 13, 2025 •

edited

Loading

kazet Mar 13, 2025 •

edited

Loading