Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WALL-E role updates #7

Open
wants to merge 36 commits into
base: main
Choose a base branch
from
Open

Conversation

neoformit
Copy link

You may wish to have a discussion about this, but I made quite a few changes to get WALL-E working on Galaxy AU. I think these changes should improve interoperability between Galaxy servers, but would require a minor update to EU's playbook to work (additional vars).

  • Add debug logging with --debug option for walle.py
  • Modify cron job to fix environment variable issues
  • Add all required env variables to walle_bashrc
  • Catch permission error in walle.py
  • Add --kill option for walle.py to kill malicious jobs with gxadmin
  • Include a modified version of galay_jwd.py that accepts XML or YAML format for object_store_conf
  • Ignore Ansible failure of "Clone malware database (WallE)"
  • Update README

Add debug logging

walle.py --debug was really useful for getting this going. It's very verbose so probably don't want to leave it on in production.

Modify cron job

I found that cron is not able to source our Galaxy .bashrc file, probably because it contains code that is specific to an interactive shell. The result is that none of the env vars that are set in wall_bashrc make it into WALL-E's env. I fixed this by creating a new .bashrc file for WALL-E (with all required env variables) and sourcing it in crontab like so:

# crontab
0 */1 * * * BASH_ENV=/mnt/galaxy/walle/.bashrc bash -c "  source /mnt/galaxy/venv/bin/activate; /mnt/galaxy/venv/bin/python  /usr/local/bin/walle.py  --tool interactive_tool    --max-size 10   --since 24    --debug   --kill  >> /var/log/walle/walle.log 2>&1"

Add all required env variables

To enable the dedicated walle_bashrc, all required env vars are added by the walle role. This requires additional Ansible variables, which are documented in README.md. Additional env vars can be easily added in the playbook with walle_extra_env_vars.

Permission error in walle.py

Instead of failing, log a warning if PermissionDenied is raised when reading a JWD file.

Add --kill option for walle.py

Optional, of course. We won't use this yet but perhaps in future when we're confident in WALL-E's abilities. We also would like to add an --alert option to notify us in Slack when malicious files are detected - this will most likely be our default. The kill option assumes that gxadmin is accessible (can point to it with env var GXADMIN_PATH), which will be run like:

gxadmin mutate fail-job $JOB_ID --commit
gxadmin mutate fail-terminal-datasets --commit

Ignore Ansible failure of "Clone malware database (WallE)"

Useful if you want to make a local modification of checksums.yml for testing.

@neoformit
Copy link
Author

neoformit commented Sep 20, 2024

Oh dear, I just noticed that this overlaps a bit with your existing PR @mira-miracoli 😬

@neoformit neoformit marked this pull request as ready for review September 23, 2024 04:14
@mira-miracoli
Copy link
Contributor

Hi, thank you for your contibution! :)
Especially thank you for integrating the galaxy_jwd.py part into the script, I shied away from that task😅
The --kill option is a cool feature. I thought about it, but dropped it in favor of the --delete-user option,
because Galaxy automatically kills all running jobs then. And I did not see many cases for EU where we would not delete a user when we find stuff that would make us want to kill the job.
Would it be okay for you if we merge my PR first and then you rebase?

@neoformit
Copy link
Author

Hey Mira, yes please feel free to merge your PR first and I'll clean up mine. The --kill option I thought was just a bit less aggressive than --delete-user but really we should not be doing either of these until we're confident of no type 1 errors! Nice to give admins a choice of actions I suppose. I hope to push something like --slack-alert in the next week or two.

@mira-miracoli
Copy link
Contributor

@neoformit
I merged my PR now, please feel free to rebase and then we can merge yours :)

@neoformit
Copy link
Author

Hey @mira-miracoli I've rebased, there were lots of conflicts that I think I've resolved correctly but please check the diff on walle.py. It was a bit confusing at times - I noticed that in some diffs my changes were marked as "current" and other times "incoming" so it was hard to tell whose work I was accepting 🙄 I'm having a look over the diff now.

Copy link
Author

@neoformit neoformit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks ok now, I'll try running updating the role and running our playbook to check for issues.

while chunk := specimen.read(chunksize):
sha1.update(chunk)
except PermissionError:
logger.warning(f"Permission denied for file: {path}")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two cases where I caught permission errors here, in my case it was only for one file in the JWD (I think command.sh) but this error could be fatal if all JWD files are PermissionDenied. I guess in that case it would be pretty obvious in walle.log that walle is not working.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, in my walle.log I get:

2024-10-11 04:48 - WARNING - Permission denied for file: /mnt/galaxy/tmp/job_working_directory/_interactive/24546/command.sh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I think walle should run as root, because the jupyter users have root access inside their jupyter notebook and can save files as uid and gid 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WallE could cleanup everything non-root and the rest we could leave to our normal cleanup scripts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Walle does not clean up, it just scans files. But in order to do so, it needs read access

@neoformit
Copy link
Author

Tested on the AU dev server and working ok with --verbose --debug

Copy link
Contributor

@mira-miracoli mira-miracoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry some of the comments are just ideas, not worth changing.
I am not sure how --kill and --debug will be used in the script, I don't see a a change in the main function
(You're probably still working on it?)

tasks/main.yml Outdated Show resolved Hide resolved
@@ -30,6 +31,13 @@ walle_envs_database:
value: "{{ galaxy_config_dir }}/galaxy.yml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
value: "{{ galaxy_config_dir }}/galaxy.yml"
value: "{{ galaxy_config_file }}"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also change (but gh does not allow suggestions on unchanged code parts :/ (or I don't know how)

-  - key: PGHOST
-    value: 127.0.0.1
-  - key: PGUSER
-    value: galaxy
-  - key: PGDATABASE
-    value: galaxy
+  - key: PGHOST
+    value: "{{ galaxy_pg_host }}"
+  - key: PGUSER
+    value: "{{ galaxy_pg_user }}"
+  - key: PGDATABASE
+    value: "{{ galaxy_pg_db }}"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

galaxy_pg_host, galaxy_pg_user and galaxy_pg_db seem to be EU-specific playbook vars, we don't have them in AU or in the galaxyproject.galaxy role? I assumed that admins would change these values with walle_extra_env_vars if they wanted to customize them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh okay, to me it looked like you added them in the README

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep you're right, I did 🤦 I have committed this suggestion: 4ce5886. It does make it easier for admins to control these basic vars, but can still get more flexibility with walle_extra_env_vars.

defaults/main.yml Show resolved Hide resolved
defaults/main.yml Show resolved Hide resolved
while chunk := specimen.read(chunksize):
sha1.update(chunk)
except PermissionError:
logger.warning(f"Permission denied for file: {path}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Walle does not clean up, it just scans files. But in order to do so, it needs read access

files/walle.py Show resolved Hide resolved
files/walle.py Outdated Show resolved Hide resolved
files/walle.py Show resolved Hide resolved
tasks/main.yml Outdated Show resolved Hide resolved
@neoformit
Copy link
Author

Thanks a lot for the thorough review @mira-miracoli 🎉
It looks like --debug is redundant now, so I have dropped that in favour of --verbose.
The --kill logic was lost in the merge so I've replaced that - thanks for noticing. I need to get better at doing a complex rebase 🙄

I think we just have to decide how to handle the walle_env_vars which don't always have a playbook variable to default to.

@mira-miracoli
Copy link
Contributor

Of course, many thanks for the great contribution! :)

I think we just have to decide how to handle the walle_env_vars which don't always have a playbook variable to default to.

I am not happy with the dictionaries either. Ansible does not allow changing only specific k/v pairs from default, so you need to copy everything and then do your changes.
I am not sure if there are variables for all database k/v pairs. maybe we could leave it this way with hardcoded default values? I think

  - key: PGHOST
    value: 127.0.0.1
  - key: PGUSER
    value: galaxy
  - key: PGDATABASE
    value: galaxy
  - key: GXADMIN_PATH
    value: /usr/local/bin/gxadmin

should work for most Galaxy instances(?)
Only thing is if people rely on --kill to work when Ansible runs without errors and they don't actually have gxadmin.
But I think if we write that in the docs that would be users responsibility.

@neoformit
Copy link
Author

neoformit commented Oct 15, 2024

Ansible does not allow changing only specific k/v pairs from default

True, but I think they can just override them by appending to walle_extra_env_vars? Here's my logic:

walle_envs_database: 
  - key: PGHOST
    value: 127.0.0.1
  - key: PGUSER
    value: galaxy

walle_extra_env_vars:  # Defined in playbook
  - key: PGUSER
    value: custom_db_user

walle_env_vars: "{{ walle_envs_database + walle_extra_env_vars }}"
# Results in:
walle_env_vars:
  - key: PGHOST
    value: 127.0.0.1
  - key: PGUSER
    value: galaxy
  - key: PGUSER
    value: custom_db_user

If you do lineinfile with walle_env_vars, PGUSER=custom_db_user would overwrite PGUSER=galaxy, right?

@neoformit neoformit closed this Oct 15, 2024
@neoformit neoformit reopened this Oct 15, 2024
@neoformit
Copy link
Author

Sorry, didn't mean to close this. I did a typing blooper and hit some shortcut by accident 💩

@mira-miracoli
Copy link
Contributor

mira-miracoli commented Oct 15, 2024

If you do lineinfile with walle_env_vars, PGUSER=custom_db_user would overwrite PGUSER=galaxy, right?

oh smart! I did not think of this, that you could use the walle_extra_env_vars to overwrite.
Maybe we could include that in the README?

@neoformit
Copy link
Author

Maybe we could include that in the README?

Yeah for sure! Do you think it's good to also have the required playbook vars that I defined in the README?

galaxy_config_file: /path/to/galaxy.yml
galaxy_log_dir: /path/to/galaxy/log/dir
galaxy_pg_db: galaxy
galaxy_pg_user: galaxy
galaxy_pg_host: my-db-server.usegalaxy.org
galaxy_pulsar_app_conf: /path/to/pulsar/app.yml

@mira-miracoli
Copy link
Contributor

If they only appear in eu's playbooks, I think we could remove that

@mira-miracoli
Copy link
Contributor

Sorry I know this is very annoying but pyright gave me some errors.
Could you check with pyright (I just pushed a gh action) and format with black?
I can also do that, I thought it is nicer to ask before pushing stuff to your PR

@mira-miracoli
Copy link
Contributor

Closed and reopened, so the pyright action triggers (I don't know a more subtle way 🙄 )

@neoformit
Copy link
Author

That is annoying, it seems that pyright raises linting errors for type hints on dependencies (i.e. we can't add type hinting to os.environ.get). But I can see how the linting is useful. I've done the best I can there - feel free to push if you see any others that need fixing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants