You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We encountered an error a few times where containerd fails to start a container with overlaybd. The error we see in containerd is like:
Aug 23 00:21:21 containerd[752]: time="2023-08-23T00:21:21.739154861Z" level=error msg="CreateContainer within sandbox \"89966fc4e8a15447b8f6711c424b0a5069ff8c71efb2a2cf47960a453c1efd13\" for &ContainerMetadata{Name:driver,Attempt:0,} failed" error="rpc error: code = Canceled desc = failed to create containerd container: context canceled"
Aug 23 00:23:35 containerd[752]: time="2023-08-23T00:23:35.118577799Z" level=error msg="CreateContainer within sandbox \"89966fc4e8a15447b8f6711c424b0a5069ff8c71efb2a2cf47960a453c1efd13\" for &ContainerMetadata{Name:driver,Attempt:0,} failed" error="rpc error: code = DeadlineExceeded desc = failed to create containerd container: context deadline exceeded"
From overlaybd.log, we see it received 503 when downloading a blob, but then ran into an unrecoverable state - it's stuck in an infinite loop in pread in sure_file.cpp. These are the logs:
2023/08/22 23:39:22|ERROR|th=00007F9282FF4E80|/src/src/overlaybd/registryfs/registryfs.cpp:432|preadv:failed to perform HTTP GET [m_url=https://harbor.xyz.com/v2/driver/blobs/sha256:df6cf190e90e2f81df91fc5dd5374c74bc21bb4d652aae4fae9869430dcb4882][offset=0]
2023/08/22 23:39:22|ERROR|th=00007F9282FF4E80|/src/src/overlaybd/cache/frontend/cached_file.cpp:181|preadvInternal:src file read failed, read : -1, expectRead : 3584, size_ : 3584, offset : 0, sum : 3584 errno=2(No such file or directory)
2023/08/22 23:39:23|INFO |th=00007F9284805280|/src/src/overlaybd/registryfs/registryfs.cpp:198|operator():getActualUrl for: https://harbor.xyz.com/v2/driver/blobs/sha256:564783e1e1be393d2c553fceb8a199f3fe25259fe2ff4f34f6d549161a7aa1f6, time used: 128 ms
2023/08/22 23:39:23|ERROR|th=00007F9284805280|/src/src/overlaybd/registryfs/registryfs.cpp:166|GET:Failed to fetch data [ret=503][url=https://harbor.xyz.com/v2/driver/blobs/sha256:564783e1e1be393d2c553fceb8a199f3fe25259fe2ff4f34f6d549161a7aa1f6]
2023/08/22 23:39:23|WARN |th=00007F9284805280|/src/src/overlaybd/registryfs/registryfs.cpp:426|preadv:failed to perform HTTP GET, going to retry [code=503][offset=0][count=1048576][ret_len=0]errno=125(Operation canceled)
2023/08/22 23:39:23|INFO |th=00007F9282FF4E80|/src/src/overlaybd/registryfs/registryfs.cpp:198|operator():getActualUrl for: https://harbor.xyz.com/v2/driver/blobs/sha256:df6cf190e90e2f81df91fc5dd5374c74bc21bb4d652aae4fae9869430dcb4882, time used: 138 ms
2023/08/22 23:39:23|INFO |th=00007F9284805280|/src/src/overlaybd/registryfs/registryfs.cpp:198|operator():getActualUrl for: https://harbor.xyz.com/v2/driver/blobs/sha256:564783e1e1be393d2c553fceb8a199f3fe25259fe2ff4f34f6d549161a7aa1f6, time used: 125 ms
2023/08/22 23:39:23|ERROR|th=00007F9284805280|/src/src/overlaybd/registryfs/registryfs.cpp:166|GET:Failed to fetch data [ret=503][url=https://harbor.xyz.com/v2/driver/blobs/sha256:564783e1e1be393d2c553fceb8a199f3fe25259fe2ff4f34f6d549161a7aa1f6]
2023/08/22 23:39:23|ERROR|th=00007F9284805280|/src/src/overlaybd/registryfs/registryfs.cpp:432|preadv:failed to perform HTTP GET [m_url=https://harbor.xyz.com/v2/driver/blobs/sha256:564783e1e1be393d2c553fceb8a199f3fe25259fe2ff4f34f6d549161a7aa1f6][offset=0]
2023/08/22 23:39:23|ERROR|th=00007F9284805280|/src/src/overlaybd/cache/frontend/cached_file.cpp:181|preadvInternal:src file read failed, read : -1, expectRead : 1048576, size_ : 721721856, offset : 0, sum : 1048576 errno=2(No such file or directory)
2023/08/22 23:39:23|INFO |th=00007F9284805280|/src/src/overlaybd/registryfs/registryfs.cpp:198|operator():getActualUrl for: https://harbor.xyz.com/v2/driver/blobs/sha256:564783e1e1be393d2c553fceb8a199f3fe25259fe2ff4f34f6d549161a7aa1f6, time used: 200 ms
2023/08/22 23:39:23|INFO |th=00007F928601D300|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F9285815AC0|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F928500EB00|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F9284809300|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F9284002B00|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F92837FB2C0|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F9282FF7B40|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F92827F1300|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F9281FE92C0|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F92817E6F80|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F9280FE1B40|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|INFO |th=00007F92807DEB80|/src/src/overlaybd/lsmt/file.cpp:1531|do_parallel_load_index:check file if normalfile or LSMTFile
2023/08/22 23:39:23|ERROR|th=00007F9284809300|/src/src/overlaybd/lsmt/file.cpp:1158|verify_ht:header magic/type don't match
2023/08/22 23:39:23|ERROR|th=00007F9284809300|/src/src/overlaybd/lsmt/file.cpp:1552|do_parallel_load_index:failed to load index from 3-th file
2023/08/22 23:39:24|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:24|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:25|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:25|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:26|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:27|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:27|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:28|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:29|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:29|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:30|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:31|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:31|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:32|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:32|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:33|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:34|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:34|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
2023/08/22 23:39:35|ERROR|th=00007F9282FF7B40|/src/src/sure_file.cpp:110|pread:pread read partial data. count:4096, offset:0, ret:0, got_cnt:3584, errno:125
Environment: Azure.
What did you expect to happen?
Restarting overlaybd-tcmu and overlaybd-snapshotter does fix the issue. However, we expect overlaybd to handle the errors properly and not run into an infinite loop..
How can we reproduce it?
We don't have a consistent repro. It seems the issue is triggered in a rare race condition. However, because of our large load, this happens quite frequently.
What is the version of your Overlaybd?
v0.6.12
What is your OS environment?
Ubuntu 20.04
Are you willing to submit PRs to fix it?
Yes, I am willing to fix it.
The text was updated successfully, but these errors were encountered:
It's indeed a defect. It happens when loading overlaybd lsmt metas encounter network failure.
Overlaybd failed to test tar/zfile header, and the opened file is treate as raw lsmt file (a 4K alignment format), however the file size is smaller than 4K. lsmt file read has a logic to make sure it reads 4K data, so it fall into an infinite loop. #253 and #257 enhance tar/zfile checking and logging #254 refactor sure behavior
What happened in your environment?
We encountered an error a few times where containerd fails to start a container with overlaybd. The error we see in containerd is like:
From overlaybd.log, we see it received 503 when downloading a blob, but then ran into an unrecoverable state - it's stuck in an infinite loop in
pread
insure_file.cpp
. These are the logs:Environment: Azure.
What did you expect to happen?
Restarting overlaybd-tcmu and overlaybd-snapshotter does fix the issue. However, we expect overlaybd to handle the errors properly and not run into an infinite loop..
How can we reproduce it?
We don't have a consistent repro. It seems the issue is triggered in a rare race condition. However, because of our large load, this happens quite frequently.
What is the version of your Overlaybd?
v0.6.12
What is your OS environment?
Ubuntu 20.04
Are you willing to submit PRs to fix it?
The text was updated successfully, but these errors were encountered: