Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overlaybd stuck in infinite loop in pread #255

Closed
1 task
tianouya-db opened this issue Aug 24, 2023 · 3 comments
Closed
1 task

overlaybd stuck in infinite loop in pread #255

tianouya-db opened this issue Aug 24, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@tianouya-db
Copy link

What happened in your environment?

We encountered an error a few times where containerd fails to start a container with overlaybd. The error we see in containerd is like:

Aug 23 00:21:21 containerd[752]: time="2023-08-23T00:21:21.739154861Z" level=error msg="CreateContainer within sandbox \"89966fc4e8a15447b8f6711c424b0a5069ff8c71efb2a2cf47960a453c1efd13\" for &ContainerMetadata{Name:driver,Attempt:0,} failed" error="rpc error: code = Canceled desc = failed to create containerd container: context canceled"
Aug 23 00:23:35 containerd[752]: time="2023-08-23T00:23:35.118577799Z" level=error msg="CreateContainer within sandbox \"89966fc4e8a15447b8f6711c424b0a5069ff8c71efb2a2cf47960a453c1efd13\" for &ContainerMetadata{Name:driver,Attempt:0,} failed" error="rpc error: code = DeadlineExceeded desc = failed to create containerd container: context deadline exceeded"

From overlaybd.log, we see it received 503 when downloading a blob, but then ran into an unrecoverable state - it's stuck in an infinite loop in pread in sure_file.cpp. These are the logs:

2023/08/22 23:39:22|ERROR|th=00007F9282FF4E80|/src/src/overlaybd/registryfs/registryfs.cpp:432|preadv:failed to perform HTTP GET [m_url=https://harbor.xyz.com/v2/driver/blobs/sha256:df6cf190e90e2f81df91fc5dd5374c74bc21bb4d652aae4fae9869430dcb4882][offset=0]
2023/08/22 23:39:22|ERROR|th=00007F9282FF4E80|/src/src/overlaybd/cache/frontend/cached_file.cpp:181|preadvInternal:src file read failed, read : -1, expectRead : 3584, size_ : 3584, offset : 0, sum : 3584 errno=2(No such file or directory)
2023/08/22 23:39:23|INFO |th=00007F9284805280|/src/src/overlaybd/registryfs/registryfs.cpp:198|operator():getActualUrl for: https://harbor.xyz.com/v2/driver/blobs/sha256:564783e1e1be393d2c553fceb8a199f3fe25259fe2ff4f34f6d549161a7aa1f6, time used: 128 ms
2023/08/22 23:39:23|ERROR|th=00007F9284805280|/src/src/overlaybd/registryfs/registryfs.cpp:166|GET:Failed to fetch data [ret=503][url=https://harbor.xyz.com/v2/driver/blobs/sha256:564783e1e1be393d2c553fceb8a199f3fe25259fe2ff4f34f6d549161a7aa1f6]
2023/08/22 23:39:23|WARN |th=00007F9284805280|/src/src/overlaybd/registryfs/registryfs.cpp:426|preadv:failed to perform HTTP GET, going to retry [code=503][offset=0][count=1048576][ret_len=0]errno=125(Operation canceled)
2023/08/22 23:39:23|INFO |th=00007F9282FF4E80|/src/src/overlaybd/registryfs/registryfs.cpp:198|operator():getActualUrl for: https://harbor.xyz.com/v2/driver/blobs/sha256:df6cf190e90e2f81df91fc5dd5374c74bc21bb4d652aae4fae9869430dcb4882, time used: 138 ms
2023/08/22 23:39:23|INFO |th=00007F9284805280|/src/src/overlaybd/registryfs/registryfs.cpp:198|operator():getActualUrl for: https://harbor.xyz.com/v2/driver/blobs/sha256:564783e1e1be393d2c553fceb8a199f3fe25259fe2ff4f34f6d549161a7aa1f6, time used: 125 ms
2023/08/22 23:39:23|ERROR|th=00007F9284805280|/src/src/overlaybd/registryfs/registryfs.cpp:166|GET:Failed to fetch data [ret=503][url=https://harbor.xyz.com/v2/driver/blobs/sha256:564783e1e1be393d2c553fceb8a199f3fe25259fe2ff4f34f6d549161a7aa1f6]
2023/08/22 23:39:23|ERROR|th=00007F9284805280|/src/src/overlaybd/registryfs/registryfs.cpp:432|preadv:failed to perform HTTP GET [m_url=https://harbor.xyz.com/v2/driver/blobs/sha256:564783e1e1be393d2c553fceb8a199f3fe25259fe2ff4f34f6d549161a7aa1f6][offset=0]
2023/08/22 23:39:23|ERROR|th=00007F9284805280|/src/src/overlaybd/cache/frontend/cached_file.cpp:181|preadvInternal:src file read failed, read : -1, expectRead : 1048576, size_ : 721721856, offset : 0, sum : 1048576 errno=2(No such file or directory)
2023/08/22 23:39:23|INFO |th=00007F9284805280|/src/src/overlaybd/registryfs/registryfs.cpp:198|operator():getActualUrl for: https://harbor.xyz.com/v2/driver/blobs/sha256:564783e1e1be393d2c553fceb8a199f3fe25259fe2ff4f34f6d549161a7aa1f6, time used: 200 ms
2023/08/22 23:39:23|INFO |th=00007F928601D300|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F9285815AC0|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F928500EB00|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F9284809300|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F9284002B00|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F92837FB2C0|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F9282FF7B40|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F92827F1300|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F9281FE92C0|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F92817E6F80|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F9280FE1B40|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F92807DEB80|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|ERROR|th=00007F9284809300|/src/src/overlaybd/lsmt/file.cpp:1158|verify_ht:header magic/type don't match
2023/08/22 23:39:23|ERROR|th=00007F9284809300|/src/src/overlaybd/lsmt/file.cpp:1552|do_parallel_load_index:failed to load index from 3-th file
2023/08/22 23:39:24|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:24|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:25|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:25|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:26|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:27|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:27|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:28|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:29|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:29|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:30|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:31|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:31|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:32|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:32|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:33|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:34|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:34|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:35|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125

Environment: Azure.

What did you expect to happen?

Restarting overlaybd-tcmu and overlaybd-snapshotter does fix the issue. However, we expect overlaybd to handle the errors properly and not run into an infinite loop..

How can we reproduce it?

We don't have a consistent repro. It seems the issue is triggered in a rare race condition. However, because of our large load, this happens quite frequently.

What is the version of your Overlaybd?

v0.6.12

What is your OS environment?

Ubuntu 20.04

Are you willing to submit PRs to fix it?

  • Yes, I am willing to fix it.
@tianouya-db tianouya-db added the bug Something isn't working label Aug 24, 2023
@yuchen0cc
Copy link
Contributor

It's indeed a defect. It happens when loading overlaybd lsmt metas encounter network failure.
Overlaybd failed to test tar/zfile header, and the opened file is treate as raw lsmt file (a 4K alignment format), however the file size is smaller than 4K. lsmt file read has a logic to make sure it reads 4K data, so it fall into an infinite loop.
#253 and #257 enhance tar/zfile checking and logging
#254 refactor sure behavior

@tianouya-db
Copy link
Author

@yuchen0cc I assume the 0.6.13 version contains the fixes?

@yuchen0cc
Copy link
Contributor

@yuchen0cc I assume the 0.6.13 version contains the fixes?

Yes, it does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants