Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confluence scanning: for large sites not all spaces/pages are scanned without error #26

Open
tallandtree opened this issue Aug 8, 2024 · 3 comments

Comments

@tallandtree
Copy link

I use a fork of your n0s1 code to scan our (large) confluence cloud instance. Thanks for that, it is very useful.

However, I found out that not all spaces are being scanned, but I didn't get an error message or timeout. I just noticed that a test space I added was not in the report. The total scan took about 5 hours. I figured it was caused by somehow the connection being closed and the client object to become empty. I saw that you recently added error handling and did some refactoring. But the strange thing is, we didn't get errors. But I will adopt the error handling in any case.
For now, I solved the issue with missing spaces by adding a self.connect() in the method 'get_data' for every batch of spaces to be collected. There might be a better way though, but for now this works.

    def set_config(self, config):
        from atlassian import Confluence
        SERVER = config.get("server", "")
        EMAIL = config.get("email", "")
        TOKEN = config.get("token", "")
        LABEL_FALSE_POSITIVE = config.get("label_false_positive", "cict-no-secrets-confirmed")
        self._url = SERVER
        self._user = EMAIL
        self._password = TOKEN
        self.label_false_positive = LABEL_FALSE_POSITIVE
        self._connect()
        return self.is_connected()
        
    def _connect(self):
        from atlassian import Confluence
        if self._user and len(self._user) > 0:
            self._client = Confluence(url=self._url, username=self._user, password=self._password)
        else:
            self._client = Confluence(url=SERVER, token=TOKEN)

and in get_data:

    def get_data(self, include_comments=False, test=""):
        if not self._client:
            return None, None, None, None, None, None
        start = 0
        limit = 50

        finished = False
        while not finished:
            logging.info(f"Spaces batch: {start} - {start+limit}")
            # reconnect for every batch
            self._connect()
            if not test:
                res = self._client.get_all_spaces(
                    start=start, limit=limit, expand="history"
                )
                start += limit
                spaces = res.get("results", [])
            else:
                key = test
                res = self._client.get_space(key, expand="history")
                finished = True
                spaces = [res]

I also added a possibility to only test with one space as the total scan takes such a long time via the parameter test.

For your interest, another improvement I made for our use case, is a change to the config.yaml: id: generic-api-key as we got tons of false positives due to this regex finding the confluence user macro and link macro in combination with 'key'.

  - id: generic-api-key
    description: Generic API Key
    regex: >-
      (?i)(?<!ri:user|CDATA\[\<add )(?:key|api|token|secret|client|passwd|password|auth|access)(?:[0-9a-z\-_\t
      .]{0,20})(?:[\s|']|[\s|"]){0,3}(?:=|>|:{1,3}=|\|\|:|<=|=>|:|\?=)(?:'|\"|\s|=|\x60){0,5}([0-9a-z\-_.=]{10,150})(?:['|\"|\n|\r|\s|\x60|;]|$)

And we added a method to skip a page if a label was set to indicate the page is a false positive, because the found secret is just meant as an example. In that case, the user can add a specific label to indicate that it is a false positive.

   def is_false_positive(self, page_id):
        labels_json = self._client.get_page_labels(page_id)
        labels = labels_json.get("results", [])
        for label in labels:
            if label["name"] == self.label_false_positive:
                logging.info(f"INFO: page {page_id} is false positive due to label {label}")
                return True
        return False

And in the method get_data:

                        for p in pages:
                            comments = []
                            title = p.get("title", "")
                            page_id = p.get("id", "")
                            if self.is_false_positive(page_id):
                                continue

In any case, thanks for your code. Hope my comments are useful.
Kind regards,
Mariska

@blupants
Copy link
Collaborator

Thank you for reporting the issue and for the proposed enhancement. I will add it to the next release.
Did you have the chance to test your enhancements on top of the latest main branch? Does it fix your bug, or are you still having issues?

Apologies for the late response. I am back to business now, and I should be way more responsive from now on.

@tallandtree
Copy link
Author

Hi, No problem. I've not yet had the time to test your latest version. I've planned this for the first week of September. With the reconnect I implemented, it works in any case, but I'll let you know what the results are after I've tested with your latest version again.

@blupants
Copy link
Collaborator

Hi @tallandtree, I just wanted to let you know that I have a new PR #31 out that implements a solution for this issue. Please let me know if it addresses your use cases.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants