Skip to content

Commit

Permalink
conan: fix ongoing cleanup errors (#85)
Browse files Browse the repository at this point in the history
- bump aws-nuke to v3.26.0
- Instances setup with the disable-stop-protection were not deleted by aws-nuke.
  => Enable the DisableStopProtection option for aws-nuke.
- add a 'debug' environment variable to better control output of conan
  by default improve output of conan by being a little bit less verbose.
- EC2Images: include disabled and deprecated images + disable deregistration protection
  disabled, deprecated images  or images with deregistration protection weren't deleted by aws-nuke
- `manual_cleanup.py`: Release EIP that are in a NetworkBorderGroup  - aws-nuke misses them.
- `manual_cleanup.py`: VPC can't be deleted when they have a VPC Lattice target group registered. Delete VPC Lattice target groups and targets and deregister it from the VPC.
- Improve output of the ansible playbook by reducing noise:
  * add the `--quiet` option to the aws-nuke command
  * do not include `stdout` and `stderr` in the output of the register for the aws-nuke task
    `stdout_lines` and `stderr_lines` are enough and more readable.
- `requirements.txt`: do not pin versions of python modules. Instead, use the latest version of each module
  those will  be baked into the container image.
  That is useful here to have the DeletionMode option for the `delete_stack()` function for deleting faulty cloudformation stacks.
- Add duration of the "cleanup" run at the end for each sandbox.
  ```
  2024-10-09T06:39:11+00:00 sandbox123 reset took 30m20s
  ```
- Cloudformation stacks are sometimes stuck in DELETE_FAILED because a resource part of the stack is already deleted.
  in `manual_cleanup.py` use the `FORCE_DELETE_STACK` option.
- Fix some Ansible deprecation warnings
  • Loading branch information
fridim authored Oct 10, 2024
1 parent 03dff42 commit e854837
Show file tree
Hide file tree
Showing 9 changed files with 219 additions and 55 deletions.
2 changes: 1 addition & 1 deletion Containerfile.conan
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ RUN make sandbox-list
FROM registry.access.redhat.com/ubi8/ubi:latest
MAINTAINER Guillaume Coré <[email protected]>

ARG AWSNUKE_VERSION=v3.22.0
ARG AWSNUKE_VERSION=v3.26.0
ARG AWSNUKE_LEGACY_VERSION=v2.25.0
ARG RUSH_VERSION=v0.5.4

Expand Down
5 changes: 5 additions & 0 deletions conan/conan.sh
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,10 @@ fi
# the conan process owning the lock.
lock_timeout=${lock_timeout:-2}


# Variable to manage output loglevel
debug=false

##############

export AWSCLI
Expand All @@ -87,6 +91,7 @@ export threads
export vault_file
export workdir
export sandbox_filter
export debug

ORIG="$(cd "$(dirname "$0")" || exit; pwd)"

Expand Down
74 changes: 37 additions & 37 deletions conan/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,37 +1,37 @@
ansible-core==2.15.6
boto3==1.29.5
botocore==1.32.5
cffi==1.16.0
colorama==0.4.6
cryptography==41.0.5
decorator==5.1.1
distro==1.8.0
dnspython==2.4.2
docutils==0.20.1
gssapi==1.8.3
importlib-resources==5.0.7
ipa==4.10.2
ipaclient==4.10.2
ipalib==4.10.2
ipaplatform==4.10.2
ipapython==4.10.2
Jinja2==3.1.2
jmespath==1.0.1
MarkupSafe==2.1.3
netaddr==0.9.0
packaging==23.2
psutil==5.9.6
pyasn1==0.5.1
pyasn1-modules==0.3.0
pycparser==2.21
pypng==0.20220715.0
python-dateutil==2.8.2
PyYAML==6.0.1
qrcode==7.4.2
resolvelib==1.0.1
rsa==4.9
s3transfer==0.7.0
selinux==0.3.0
six==1.16.0
typing_extensions==4.8.0
urllib3==1.26.18
ansible-core
boto3
botocore
cffi
colorama
cryptography
decorator
distro
dnspython
docutils
gssapi
importlib-resources
ipa
ipaclient
ipalib
ipaplatform
ipapython
Jinja2
jmespath
MarkupSafe
netaddr
packaging
psutil
pyasn1
pyasn1-modules
pycparser
pypng
python-dateutil
PyYAML
qrcode
resolvelib
rsa
s3transfer
selinux
six
typing_extensions
urllib3
16 changes: 15 additions & 1 deletion conan/wipe_sandbox.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ max_retries=${max_retries:-2}
aws_nuke_retries=${aws_nuke_retries:-0}
# retry after 48h
TTL_EVENTLOG=$((3600*24))
debug=${debug:-false}


# Mandatory ENV variables
Expand Down Expand Up @@ -125,7 +126,9 @@ EOM
fi

if grep -q ConditionalCheckFailedException "${errlog}"; then
echo "$(date -uIs) Another process is already cleaning up ${sandbox}: skipping"
if [ "${debug}" = "true" ]; then
echo "$(date -uIs) Another process is already cleaning up ${sandbox}: skipping"
fi
rm "${errlog}"
return 1
else
Expand Down Expand Up @@ -201,6 +204,7 @@ sandbox_reset() {
echo "$(date -uIs) reset sandbox${s}" >> "${eventlog}"

echo "$(date -uIs) ${sandbox} reset starting..."
start_time=$(date +%s)

export ANSIBLE_NO_TARGET_SYSLOG=True

Expand Down Expand Up @@ -234,8 +238,18 @@ sandbox_reset() {
-e kerberos_password="${kerberos_password:-}" \
reset_single.yml > "${logfile}"; then
echo "$(date -uIs) ${sandbox} reset OK"
end_time=$(date +%s)
duration=$((end_time - start_time))
# Calculate the time it took
echo "$(date -uIs) ${sandbox} reset took $((duration / 60))m$((duration % 60))s"

rm "${eventlog}"
else
end_time=$(date +%s)
duration=$((end_time - start_time))
# Calculate the time it took
echo "$(date -uIs) ${sandbox} reset took $((duration / 60))m$((duration % 60))s"

echo "$(date -uIs) ${sandbox} reset FAILED." >&2
echo "$(date -uIs) =========BEGIN========== ${logfile}" >&2
cat "${logfile}" >&2
Expand Down
110 changes: 105 additions & 5 deletions playbooks/roles/infra-aws-sandbox/files/manual_cleanup.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@
with open('/tmp/aws_nuke_filters.json', 'r') as f:
aws_nuke_filter.update(json.load(f))

clientlaticce = boto3.client('vpc-lattice')

# Delete all EC2VPC

client = boto3.client('ec2')
Expand All @@ -23,6 +25,8 @@
response = client.describe_vpcs()

for vpc in response['Vpcs']:

print("Deleting VPC: " + vpc['VpcId'])
# Delete all subnets
response2 = client.describe_subnets(
Filters=[
Expand Down Expand Up @@ -105,6 +109,33 @@
print("Disassociated route table: " + association['RouteTableAssociationId'])
changed = True

# deregister all VPC lattice target groups

response5 = clientlaticce.list_target_groups(
vpcIdentifier=vpc['VpcId']
)

for target_group in response5['items']:
# remove all targets from the target group

response6 = clientlaticce.list_targets(
targetGroupIdentifier=target_group['arn']
)

if len(response6['items']) != 0:
clientlaticce.deregister_targets(
targetGroupIdentifier=target_group['arn'],
targets=[
{ 'id': y['id'], 'port': y['port'] } for y in response6['items']
]
)
print("Deregistered targets: " + response6['items'])

clientlaticce.delete_target_group(
targetGroupIdentifier=target_group['arn']
)
print("Deregistered target group: " + target_group['arn'])
changed = True

# Delete VPC

Expand All @@ -113,12 +144,38 @@
)

print("Deleted VPC: " + vpc['VpcId'])

changed = True

except botocore.exceptions.ClientError as e:
print(e)

try:
response = client.describe_images(Owners=['self'], IncludeDeprecated=True, IncludeDisabled=True)

for image in response['Images']:
print("Deregistering AMI: " + image['ImageId'])
client.deregister_image(
ImageId=image['ImageId']
)
print("Deregistered AMI: " + image['ImageId'])
for device in image.get('BlockDeviceMappings', []):
snapshot_id = device.get('Ebs', {}).get('SnapshotId')
if snapshot_id:
print("Deleting snapshot: %s associated with AMI: %s" % (snapshot_id, image['ImageId']))
client.delete_snapshot(SnapshotId=snapshot_id)
print("Successfully deleted snapshot: %s" % (snapshot_id))
changed = True
# Delete all snapshots
response = client.describe_snapshots(OwnerIds=['self'])

for snapshot in response['Snapshots']:
client.delete_snapshot(
SnapshotId=snapshot['SnapshotId']
)
print("Deleted snapshot: " + snapshot['SnapshotId'])
changed = True
except botocore.exceptions.ClientError as e:
print(e)

# Delete all Cognito User Pools

Expand Down Expand Up @@ -280,10 +337,35 @@
except botocore.exceptions.ClientError as e:
print(e)

# Cleanup Public ECR
client = boto3.client('ecr-public')

if os.environ.get('AWS_REGION') == 'us-east-1':

# Release all Elastic IPs

try:
response = client.describe_addresses()

for address in response['Addresses']:
# Disassociate address
if address.get('AssociationId'):
client.disassociate_address(
AssociationId=address['AssociationId']
)
print("Disassociated Elastic IP: " + address['AllocationId'])

client.release_address(
AllocationId=address['AllocationId'],
NetworkBorderGroup=address.get('NetworkBorderGroup', '')
)
print("Released Elastic IP: " + address['AllocationId'])
changed = True
except botocore.exceptions.ClientError as e:
print(e)



if os.environ.get('AWS_DEFAULT_REGION') == 'us-east-1':
# Cleanup Public ECR
client = boto3.client('ecr-public')
try:
response = client.describe_repositories()

Expand Down Expand Up @@ -361,8 +443,26 @@
changed = True
# UninitializedAccountException
except client.exceptions.UninitializedAccountException:
print("MGNSourceServer is not supported in this region")
pass
#print("MGNSourceServer is not supported in this region")

# Delete cloudformation stack
client = boto3.client('cloudformation')

try:
response = client.describe_stacks()

for stack in response['Stacks']:
# Check if stack is in DELETE_FAILED state
if stack['StackStatus'] == 'DELETE_FAILED':
client.delete_stack(
StackName=stack['StackName'],
DeletionMode='FORCE_DELETE_STACK'
)
print("Deleted stack: " + stack['StackName'])
changed = True
except botocore.exceptions.ClientError as e:
print(e)



Expand Down
28 changes: 24 additions & 4 deletions playbooks/roles/infra-aws-sandbox/tasks/iam.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,27 @@
template_body: "{{ lookup('file', 'CF-IAM.json') }}"
region: "{{ aws_region }}"
stack_name: roles
register: _cfiamrole
until: _cfiamrole is succeeded
delay: 60
retries: 5
register: r_cf
ignore_errors: yes

- when: r_cf is failed
block:
- name: Delete IAM role Cloudformation stack
cloudformation:
profile: "{{ account_profile }}"
region: "{{ aws_region }}"
stack_name: roles
state: absent

- name: Delete the config-rule-role role
iam_role:
profile: "{{ account_profile }}"
name: config-rule-role
state: absent

- name: Retry create IAM role using Cloudformation
cloudformation:
profile: "{{ account_profile }}"
template_body: "{{ lookup('file', 'CF-IAM.json') }}"
region: "{{ aws_region }}"
stack_name: roles
4 changes: 2 additions & 2 deletions playbooks/roles/infra-aws-sandbox/tasks/keypair.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
key_material: "{{ opentlc_admin_backdoor }}"
aws_access_key: "{{ assumed_role.sts_creds.access_key }}"
aws_secret_key: "{{ assumed_role.sts_creds.secret_key }}"
security_token: "{{ assumed_role.sts_creds.session_token }}"
session_token: "{{ assumed_role.sts_creds.session_token }}"
loop: "{{ all_regions }}"
loop_control:
loop_var: _region
Expand All @@ -23,7 +23,7 @@
key_material: "{{ ocpkey }}"
aws_access_key: "{{ assumed_role.sts_creds.access_key }}"
aws_secret_key: "{{ assumed_role.sts_creds.secret_key }}"
security_token: "{{ assumed_role.sts_creds.session_token }}"
session_token: "{{ assumed_role.sts_creds.session_token }}"
loop: "{{ all_regions }}"
loop_control:
loop_var: _region
Expand Down
Loading

0 comments on commit e854837

Please sign in to comment.