Confluence scanning: for large sites not all spaces/pages are scanned without error #26

tallandtree · 2024-08-08T06:05:35Z

I use a fork of your n0s1 code to scan our (large) confluence cloud instance. Thanks for that, it is very useful.

However, I found out that not all spaces are being scanned, but I didn't get an error message or timeout. I just noticed that a test space I added was not in the report. The total scan took about 5 hours. I figured it was caused by somehow the connection being closed and the client object to become empty. I saw that you recently added error handling and did some refactoring. But the strange thing is, we didn't get errors. But I will adopt the error handling in any case.
For now, I solved the issue with missing spaces by adding a self.connect() in the method 'get_data' for every batch of spaces to be collected. There might be a better way though, but for now this works.

    def set_config(self, config):
        from atlassian import Confluence
        SERVER = config.get("server", "")
        EMAIL = config.get("email", "")
        TOKEN = config.get("token", "")
        LABEL_FALSE_POSITIVE = config.get("label_false_positive", "cict-no-secrets-confirmed")
        self._url = SERVER
        self._user = EMAIL
        self._password = TOKEN
        self.label_false_positive = LABEL_FALSE_POSITIVE
        self._connect()
        return self.is_connected()
        
    def _connect(self):
        from atlassian import Confluence
        if self._user and len(self._user) > 0:
            self._client = Confluence(url=self._url, username=self._user, password=self._password)
        else:
            self._client = Confluence(url=SERVER, token=TOKEN)

and in get_data:

    def get_data(self, include_comments=False, test=""):
        if not self._client:
            return None, None, None, None, None, None
        start = 0
        limit = 50

        finished = False
        while not finished:
            logging.info(f"Spaces batch: {start} - {start+limit}")
            # reconnect for every batch
            self._connect()
            if not test:
                res = self._client.get_all_spaces(
                    start=start, limit=limit, expand="history"
                )
                start += limit
                spaces = res.get("results", [])
            else:
                key = test
                res = self._client.get_space(key, expand="history")
                finished = True
                spaces = [res]

I also added a possibility to only test with one space as the total scan takes such a long time via the parameter test.

For your interest, another improvement I made for our use case, is a change to the config.yaml: id: generic-api-key as we got tons of false positives due to this regex finding the confluence user macro and link macro in combination with 'key'.

  - id: generic-api-key
    description: Generic API Key
    regex: >-
      (?i)(?<!ri:user|CDATA\[\<add )(?:key|api|token|secret|client|passwd|password|auth|access)(?:[0-9a-z\-_\t
      .]{0,20})(?:[\s|']|[\s|"]){0,3}(?:=|>|:{1,3}=|\|\|:|<=|=>|:|\?=)(?:'|\"|\s|=|\x60){0,5}([0-9a-z\-_.=]{10,150})(?:['|\"|\n|\r|\s|\x60|;]|$)

And we added a method to skip a page if a label was set to indicate the page is a false positive, because the found secret is just meant as an example. In that case, the user can add a specific label to indicate that it is a false positive.

   def is_false_positive(self, page_id):
        labels_json = self._client.get_page_labels(page_id)
        labels = labels_json.get("results", [])
        for label in labels:
            if label["name"] == self.label_false_positive:
                logging.info(f"INFO: page {page_id} is false positive due to label {label}")
                return True
        return False

And in the method get_data:

                        for p in pages:
                            comments = []
                            title = p.get("title", "")
                            page_id = p.get("id", "")
                            if self.is_false_positive(page_id):
                                continue

In any case, thanks for your code. Hope my comments are useful.
Kind regards,
Mariska

The text was updated successfully, but these errors were encountered:

blupants · 2024-08-22T22:06:05Z

Thank you for reporting the issue and for the proposed enhancement. I will add it to the next release.
Did you have the chance to test your enhancements on top of the latest main branch? Does it fix your bug, or are you still having issues?

Apologies for the late response. I am back to business now, and I should be way more responsive from now on.

tallandtree · 2024-08-23T15:10:32Z

Hi, No problem. I've not yet had the time to test your latest version. I've planned this for the first week of September. With the reconnect I implemented, it works in any case, but I'll let you know what the results are after I've tested with your latest version again.

blupants · 2025-01-14T04:06:52Z

Hi @tallandtree, I just wanted to let you know that I have a new PR #31 out that implements a solution for this issue. Please let me know if it addresses your use cases.
Thanks

blupants mentioned this issue Aug 22, 2024

Changed get_data() to automatically reconnect client #27

Merged

blupants mentioned this issue Jan 14, 2025

Customized Scan Scope for Large Size Jira/Confluence/Asana Sites #31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confluence scanning: for large sites not all spaces/pages are scanned without error #26

Confluence scanning: for large sites not all spaces/pages are scanned without error #26

tallandtree commented Aug 8, 2024

blupants commented Aug 22, 2024

tallandtree commented Aug 23, 2024

blupants commented Jan 14, 2025

Confluence scanning: for large sites not all spaces/pages are scanned without error #26

Confluence scanning: for large sites not all spaces/pages are scanned without error #26

Comments

tallandtree commented Aug 8, 2024

blupants commented Aug 22, 2024

tallandtree commented Aug 23, 2024

blupants commented Jan 14, 2025