Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Loss on IO errors #13

Open
fia0 opened this issue Nov 11, 2022 · 0 comments
Open

Data Loss on IO errors #13

fia0 opened this issue Nov 11, 2022 · 0 comments
Labels
bug Something isn't working T3.3 - Robustness

Comments

@fia0
Copy link
Contributor

fia0 commented Nov 11, 2022

We may construct a situation where we incur some data loss in our data storage when operations in the IO queue fail.
Following is a construction of the abstracted case:

                                                          ┌──────────────┐                                  ┌────────────┐
                                                          │              │                                  │            │
                                         ┌───────────────►│   Success    ├─────────────────────────────────►│   Valid    │
                                         │                │              │                                  │            │
┌──────────────┐           ┌─────────────┴─┐              └──────────────┘                                  └────────────┘
│              │           │               │                                                                      ▲
│              │           │               │                                                                      │
│    Insert    ├──────────►│     Write     │                                       ┌──────────────┐               │
│              │           │               │                           ┌──────────►│              ├───────────────┘
│              │           │               │                           │           │   In-cache   │
└──────────────┘           └─────────────┬─┘              ┌────────────┴─┐         │              │
                                         │                │              │         └──────────────┘
                                         └───────────────►│   Failure    │
                                                          │              │         ┌──────────────┐        ┌───────────────┐
                                                          └────────────┬─┘         │              │        │               │
                                                                       │           │   Evicted    ├───────►│    Invalid    │
                                                                       └──────────►│              │        │               │
                                                                                   └──────────────┘        └───────────────┘

The main problem lies in the condition of the cache after an IO operation has finished erroneously. If the value is still present all is well and we can try to write the data again. Problematic is the eviction of data which can happen after an operation has failed. This can happen due to two factors:

  1. The unpinning of data in the cache.
  2. The preemptive movement of nodes from the InWriteback state to Unmodified.

We will explain and discuss both here shortly.

Unpinning of data

Entries of the cache are held in a special Pinned state if cache references exist to them. This is done transparently to the user whenever a value is retrieved from cache and is bound to the lifetime of the returned CacheValueRef. This way we can ensure that a reference is always valid and may never be evicted from cache when it is required at some point in the stack.

The IO write backs fetch the desired references from cache and apply compression and checksumming to the data. Afterwards, this cache reference is dropped and we hold a buffer representing our changed data. This buffer is then given to a closure in the IO queue in the following style:

fn begin_write(&self, data: Buf, offset: DiskOffset) -> Result<(), VdevError> {
    let inner = self.inner.clone();

    let (enqueue_done, wait_for_enqueue) = futures::channel::oneshot::channel();
    let write = self.inner.pool.spawn_with_handle(async move {
        wait_for_enqueue.await.unwrap();

        let res = inner
            .by_offset(offset)
            .write(data, offset.block_offset())
            .await;

        inner.write_back_queue.mark_completed(&offset).await;
        res
    })?;

    let ret = self.inner.write_back_queue.enqueue(offset, Box::pin(write));

    // Sending fails if receiver is dropped at this point,
    // which means the future
    enqueue_done
        .send(())
        .expect("Couldn't unlock enqueued write task");

    ret
}

After the operation has completed, successfully or not, the data value runs out of scope and is dropped. The reader might recognize a problem with this, if no reference to the cache exists and the buffer is dropped we lose the state of whichever data was stored there. Briefly, we notice an error with the storage device but may never recover or reenqueue the write operation as the state of the data is now corrupted.

Preemptive node state movement

Another mechanic to prevent early eviction of non-persistent data is the ObjectKey variant Modifiied and InWriteback. The main problem is here that the movement between node states does not await the results of the IO queue but moves after the operation has been committed to the IO queue the nodes immediately, either to evict them directly from cache or to the third ObjectKey variant Unmodified which allows unrestricted evictions.

let was_present;
{
    let mut cache = self.cache.write();
    // We can safely ignore pins.
    // If it's pinned, it must be a readonly request.
    was_present = if evict {
        cache.force_remove(&ObjectKey::InWriteback(mid), object_size)
    } else {
        cache.force_change_key(
            &ObjectKey::InWriteback(mid),
            ObjectKey::Unmodified {
                offset: obj_ptr.offset,
                generation: obj_ptr.generation,
            },
        )
    };
    if was_present {
        debug!("Inserted {mid:?} into written_back");
        self.written_back.lock().insert(mid, obj_ptr.clone());
    }
}

Available Avenues

There are some possibilities of improvement in terms of data safety revealed in this process. We might either keep data pinned until the IO operation has successfully finished or persist the compressed buffer to try writing anew at a later point. This carries some implications on the tree structure and interlinking between nodes to avoid loss of coherence between them.

@fia0 fia0 added the bug Something isn't working label Nov 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working T3.3 - Robustness
Projects
None yet
Development

No branches or pull requests

1 participant