Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too big /var/log/messages on vmhost-x86-copr01 #2799

Closed
FrostyX opened this issue Jul 1, 2023 · 8 comments
Closed

Too big /var/log/messages on vmhost-x86-copr01 #2799

FrostyX opened this issue Jul 1, 2023 · 8 comments
Labels
fedora-copr-admin Tasks that need to be done by Fedora Copr administrator

Comments

@FrostyX
Copy link
Member

FrostyX commented Jul 1, 2023

<zodbot>  PROBLEM -
    vmhost-x86-copr01.rdu-cc.fedoraproject.org/Disk_Space_/ is
    WARNING: DISK WARNING - free space: / 4326 MiB (14.28% inode=59%):
    (noc01)
<nirik>  so thats because /var/log/messages is really big... it seems
    like lots of copr sessions. I'll rotate the file, but FrostyX or
    other copr folks would need to look into the underlying
    problem. I'll compress the file for now.
@FrostyX FrostyX added the fedora-copr-admin Tasks that need to be done by Fedora Copr administrator label Jul 1, 2023
@FrostyX
Copy link
Member Author

FrostyX commented Jul 4, 2023

At the moment the disk space is fine

Image

But I will try to setup some rotation for /var/log/messages that nirik had to do manually.

@FrostyX
Copy link
Member Author

FrostyX commented Jul 4, 2023

But I will try to setup some rotation for /var/log/messages that nirik had to do manually.

This might solve it
https://pagure.io/fedora-infra/ansible/pull-request/1499

@nirik
Copy link

nirik commented Jul 5, 2023

Note that it's "fine" now because I keep deleting logs so they stop alerting.

I'll just ack them so you can look more closely.

This is not just that virthost, its all of them... it seems like the backend is connecting a LOT so the messages are mostly ssh connections...

@FrostyX
Copy link
Member Author

FrostyX commented Jul 6, 2023

Note that it's "fine" now because I keep deleting logs so they stop alerting.

Thank you @nirik, I didn't know. I thought you only deleted it once.

This is not just that virthost, its all of them... it seems like the backend is connecting a LOT so the messages are mostly ssh connections...

It looks like resalloc went crazy. There are tens of

 INFO 2023-07-06 14:39:02,011 PID:3939926:140290543318720(Allocator-10992094) Allocating copr_hv_x86_64_01_prod_10992094_20230706_143901 finished => ENDED
 INFO 2023-07-06 14:39:02,015 PID:3939926:140289436006080(Allocator-10992093) Allocating copr_hv_x86_64_01_prod_10992093_20230706_143901 finished => ENDED
 INFO 2023-07-06 14:39:02,021 PID:3939926:140291608667840(Allocator-10992095) Allocating copr_hv_x86_64_01_prod_10992095_20230706_143901 finished => ENDED

rows in /var/log/resallocserver/main.log every second. When stopped resalloc for a few seconds, the SSH connections stopped appearing in the hypervisor log.

I am trying to investigate what is wrong with resalloc.

@FrostyX
Copy link
Member Author

FrostyX commented Jul 6, 2023

Ah, that's because all of those allocations fail

INFO:root:Hypervisor ID: 1
INFO:root:Using an IPv6 ending with ':c148'
DEBUG:root:Logging to qemu+ssh://[email protected]/system
DEBUG:root:cmd: mkisofs -o /tmp/tmpdhja99z3/config.iso -V eimg_config -r -J --quiet /tmp/tmpdhja99z3/config
xorriso 1.5.4 : RockRidge filesystem manipulator, libburnia project.

DEBUG:root: -> exit_status=0, time=0.105s
INFO:root:using config image /tmp/tmpdhja99z3/config.iso
DEBUG:root:cmd: virsh -c qemu+ssh://[email protected]/system vol-create-as images copr_hv_x86_64_02_prod_10992112_20230706_143902_config 1M
error: failed to connect to the hypervisor
error: End of file while reading data: Warning: Permanently added 'vmhost-x86-copr02.rdu-cc.fedoraproject.org' (ED25519) to the list of known hosts.^M
virt-ssh-helper: cannot connect to '/var/run/libvirt/libvirt-sock': Failed to connect socket to '/var/run/libvirt/libvirt-sock': Connection refused: Input/output error
DEBUG:root: -> exit_status=1, time=0.608s
DEBUG:root:Cleaning up ...
Traceback (most recent call last):
  File "/var/lib/resallocserver/provision/libvirt-new", line 461, in <module>
    _main()
  File "/var/lib/resallocserver/provision/libvirt-new", line 452, in _main
    spawner.spawn()
  File "/var/lib/resallocserver/provision/libvirt-new", line 342, in spawn
    self.create_volume_from_iso(config_vol_name, '1M', config_iso,
  File "/var/lib/resallocserver/provision/libvirt-new", line 226, in create_volume_from_iso
    self.alloc_disk(name, prealloc_size, pool)
  File "/var/lib/resallocserver/provision/libvirt-new", line 152, in alloc_disk
    raise Exception("can't create '{0}' disk".format(name))
Exception: can't create 'copr_hv_x86_64_02_prod_10992112_20230706_143902_config' disk

@nirik
Copy link

nirik commented Jul 6, 2023

Strange. It looks like they are in a weird state. :(

If you like I can upgrade/reboot them and we can see if that clears things up? Or do you want to investigate more first?

@FrostyX
Copy link
Member Author

FrostyX commented Jul 6, 2023

The libvirtd service died 4 days ago, and therefore all the virsh commands failed. I don't know the reason why it died though

● libvirtd.service - Virtualization daemon
   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sun 2023-07-02 06:36:27 GMT; 4 days ago
     Docs: man:libvirtd(8)
           https://libvirt.org
  Process: 3321797 ExecStart=/usr/sbin/libvirtd $LIBVIRTD_ARGS (code=exited, status=6)
 Main PID: 3321797 (code=exited, status=6)
    Tasks: 2 (limit: 32768)
   Memory: 64.3G
   CGroup: /system.slice/libvirtd.service
           ├─3506 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/libexec/libvirt_leaseshelper
           └─3509 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/libexec/libvirt_leaseshelper

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

I would say it was OOM kill but since it died on 3/4 hypervisors at the same time, IMHO something else happened. We discussed with @nirik that it would be a good idea to update and reboot those machines, but I would like to do that once @praiskup is back from vacation since the libvirt setup is a bit tricky for me.

Until then, I did systemctl start libvirtd.service and it seems to work fine. New builders are spawning, no nonsense in /var/log/messages.

@praiskup
Copy link
Member

praiskup commented Apr 8, 2024

It seems we don't experience /var/log/messages message floods nowadays. Closing as WORKSFORME

@praiskup praiskup closed this as completed Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fedora-copr-admin Tasks that need to be done by Fedora Copr administrator
Projects
Archived in project
Development

No branches or pull requests

3 participants