Data Loss on IO errors #13

fia0 · 2022-11-11T09:32:09Z

We may construct a situation where we incur some data loss in our data storage when operations in the IO queue fail.
Following is a construction of the abstracted case:

                                                          ┌──────────────┐                                  ┌────────────┐
                                                          │              │                                  │            │
                                         ┌───────────────►│   Success    ├─────────────────────────────────►│   Valid    │
                                         │                │              │                                  │            │
┌──────────────┐           ┌─────────────┴─┐              └──────────────┘                                  └────────────┘
│              │           │               │                                                                      ▲
│              │           │               │                                                                      │
│    Insert    ├──────────►│     Write     │                                       ┌──────────────┐               │
│              │           │               │                           ┌──────────►│              ├───────────────┘
│              │           │               │                           │           │   In-cache   │
└──────────────┘           └─────────────┬─┘              ┌────────────┴─┐         │              │
                                         │                │              │         └──────────────┘
                                         └───────────────►│   Failure    │
                                                          │              │         ┌──────────────┐        ┌───────────────┐
                                                          └────────────┬─┘         │              │        │               │
                                                                       │           │   Evicted    ├───────►│    Invalid    │
                                                                       └──────────►│              │        │               │
                                                                                   └──────────────┘        └───────────────┘

The main problem lies in the condition of the cache after an IO operation has finished erroneously. If the value is still present all is well and we can try to write the data again. Problematic is the eviction of data which can happen after an operation has failed. This can happen due to two factors:

The unpinning of data in the cache.
The preemptive movement of nodes from the InWriteback state to Unmodified.

We will explain and discuss both here shortly.

Unpinning of data

Entries of the cache are held in a special Pinned state if cache references exist to them. This is done transparently to the user whenever a value is retrieved from cache and is bound to the lifetime of the returned CacheValueRef. This way we can ensure that a reference is always valid and may never be evicted from cache when it is required at some point in the stack.

The IO write backs fetch the desired references from cache and apply compression and checksumming to the data. Afterwards, this cache reference is dropped and we hold a buffer representing our changed data. This buffer is then given to a closure in the IO queue in the following style:

fn begin_write(&self, data: Buf, offset: DiskOffset) -> Result<(), VdevError> {
    let inner = self.inner.clone();

    let (enqueue_done, wait_for_enqueue) = futures::channel::oneshot::channel();
    let write = self.inner.pool.spawn_with_handle(async move {
        wait_for_enqueue.await.unwrap();

        let res = inner
            .by_offset(offset)
            .write(data, offset.block_offset())
            .await;

        inner.write_back_queue.mark_completed(&offset).await;
        res
    })?;

    let ret = self.inner.write_back_queue.enqueue(offset, Box::pin(write));

    // Sending fails if receiver is dropped at this point,
    // which means the future
    enqueue_done
        .send(())
        .expect("Couldn't unlock enqueued write task");

    ret
}

After the operation has completed, successfully or not, the data value runs out of scope and is dropped. The reader might recognize a problem with this, if no reference to the cache exists and the buffer is dropped we lose the state of whichever data was stored there. Briefly, we notice an error with the storage device but may never recover or reenqueue the write operation as the state of the data is now corrupted.

Preemptive node state movement

Another mechanic to prevent early eviction of non-persistent data is the ObjectKey variant Modifiied and InWriteback. The main problem is here that the movement between node states does not await the results of the IO queue but moves after the operation has been committed to the IO queue the nodes immediately, either to evict them directly from cache or to the third ObjectKey variant Unmodified which allows unrestricted evictions.

let was_present;
{
    let mut cache = self.cache.write();
    // We can safely ignore pins.
    // If it's pinned, it must be a readonly request.
    was_present = if evict {
        cache.force_remove(&ObjectKey::InWriteback(mid), object_size)
    } else {
        cache.force_change_key(
            &ObjectKey::InWriteback(mid),
            ObjectKey::Unmodified {
                offset: obj_ptr.offset,
                generation: obj_ptr.generation,
            },
        )
    };
    if was_present {
        debug!("Inserted {mid:?} into written_back");
        self.written_back.lock().insert(mid, obj_ptr.clone());
    }
}

Available Avenues

There are some possibilities of improvement in terms of data safety revealed in this process. We might either keep data pinned until the IO operation has successfully finished or persist the compressed buffer to try writing anew at a later point. This carries some implications on the tree structure and interlinking between nodes to avoid loss of coherence between them.

The text was updated successfully, but these errors were encountered:

fia0 added the bug Something isn't working label Nov 11, 2022

fia0 added the T3.3 - Robustness label Dec 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Loss on IO errors #13

Data Loss on IO errors #13

fia0 commented Nov 11, 2022

Data Loss on IO errors #13

Data Loss on IO errors #13

Comments

fia0 commented Nov 11, 2022

Unpinning of data

Preemptive node state movement

Available Avenues