Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Saffron] Add update function #3004

Closed
wants to merge 13 commits into from
Closed

[Saffron] Add update function #3004

wants to merge 13 commits into from

Conversation

martyall
Copy link
Contributor

@martyall martyall commented Feb 5, 2025

Summary

The user needs to be able to update data with the storage provider. This is done by posting "diff" objects, not by simply posting the raw updated data. This PR implements this feature.

Changes

  • add Diff type and utils function for computing a Diff over binary data.
  • add update method on FieldBlob type. Add tests for random updates on random data.
  • add two cli commands:
    1. calculate-diff: used by the user to calculate the Diff object using the old and new files as input 6eeb13e
    2. update: Ask the storage provider to update some given data using the Diff object calculated by (1) 61bb85f
  • update e2e test script to perform a diff operation and check that the data was updated correctly. 2ba1a3f

@martyall martyall marked this pull request as ready for review February 5, 2025 18:39
Copy link

codecov bot commented Feb 5, 2025

Codecov Report

Attention: Patch coverage is 44.53441% with 137 lines in your changes missing coverage. Please review.

Project coverage is 76.72%. Comparing base (27404af) to head (18371d0).
Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
saffron/src/main.rs 0.00% 90 Missing ⚠️
saffron/src/utils.rs 29.41% 36 Missing ⚠️
saffron/src/cli.rs 0.00% 10 Missing ⚠️
saffron/src/blob.rs 98.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3004      +/-   ##
==========================================
- Coverage   76.80%   76.72%   -0.08%     
==========================================
  Files         261      261              
  Lines       61882    62079     +197     
==========================================
+ Hits        47530    47633     +103     
- Misses      14352    14446      +94     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

fi
}

perturb_bytes() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this works?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, no it wasn't. I have replaced it with a python script instead because perl is like hieroglyphics to me. But actually this makes me realize that the user will should keep all the commitments on their end, not just the folded commitment with powers of alpha. As of right now, they cannot update them without recomputing which is dumb.

ENCODED_FILE="${INPUT_FILE%.*}.bin"
DECODED_FILE="${INPUT_FILE%.*}-decoded${INPUT_FILE##*.}"
PERTURBED_FILE="${INPUT_FILE%.*}_perturbed${INPUT_FILE##*.}"
ENCODED_DIFF_FILE="${ENCODED_FILE%.*}_diff.bin"
DECODED_PERTURBED_FILE="${PERTURBED_FILE}-decoded${INPUT_FILE##*.}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you want to remove the extension? i.e. ${PERTURBED_FILE%.*}

DECODED_PERTURBED_FILE="${PERTURBED_FILE%.*}-decoded${PERTURBED_FILE##*.}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok all the file names should be fixed now

ENCODED_FILE="${INPUT_FILE%.*}.bin"
DECODED_FILE="${INPUT_FILE%.*}-decoded${INPUT_FILE##*.}"
PERTURBED_FILE="${INPUT_FILE%.*}_perturbed${INPUT_FILE##*.}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unify the usage of "-" and "_".

perturb_bytes "$INPUT_FILE" "$PERTURBED_FILE" 0.1

echo "Calculating diff for upated $INPUT_FILE (stored updated data in $PERTURBED_FILE)"
cargo run --release --bin saffron calculate-diff --old "$INPUT_FILE" --new "$PERTURBED_FILE" -o "$ENCODED_DIFF_FILE" $SRS_ARG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cargo run --release --bin saffron calculate-diff --old "$INPUT_FILE" --new "$PERTURBED_FILE" -o "$ENCODED_DIFF_FILE" $SRS_ARG
cargo run --release --bin saffron calculate-diff --old "$INPUT_FILE" --new "$PERTURBED_FILE" -o "$ENCODED_DIFF_FILE" "$SRS_ARG"

(and the others below, to make shellcheck happy)

.map(|i| {
d.get(&i)
.copied()
.unwrap_or(<G as AffineRepr>::ScalarField::zero())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.unwrap_or(<G as AffineRepr>::ScalarField::zero())
.unwrap_or(G::ScalarField::zero())

And you can remove 7 | use ark_ec::AffineRepr;

#[serde(bound = "F: CanonicalDeserialize + CanonicalSerialize")]
pub struct Diff<F> {
#[serde_as(as = "Vec<HashMap<_, o1_utils::serialization::SerdeAs>>")]
pub evaluation_diffs: Vec<HashMap<usize, F>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that a vector is more efficient, if you don't need to access by the modified index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that updates can be sparse enough that using a full vector here would be overkill?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use a vector of type Vec<(idx, F)>, or a linear memory layout where idx and F are inlined (tuples are not, I think).

#[serde_as]
#[derive(Clone, Debug, Serialize, Deserialize, PartialEq)]
#[serde(bound = "F: CanonicalDeserialize + CanonicalSerialize")]
pub struct Diff<F> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can restrict it to PrimeField directly.

.par_iter()
.zip(old_elems)
.map(|(n, o)| {
n.iter()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to optimize further by keeping the hashmap, you can also parallelize here.

Copy link
Contributor Author

@martyall martyall Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a rayon pro, but I have a bad feeling about using par_iter with something that can be really long (like thousands of elements) where the work is trivial (in this case subtracting field elems). On top of that, multiple nested par_iter.

I don't have a problem using par_iter on the chunks of size domain_size since even a 500MB blob is like 260 chunks (i.e. polynomials) and the work per chunk can be substantial.

If you know a reason I shouldn't worry about this, can you post a doc link? otherwise I would test it / time it before agreeing


fn random_perturbation(threshold: f64, data: &[u8]) -> Vec<u8> {
let mut rng = rand::thread_rng();
data.iter()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

par_iter also here I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

saffron/src/blob.rs Outdated Show resolved Hide resolved
}))
{
// start with some random user data
let mut xs_blob = FieldBlob::<Vesta>::encode::<_, DefaultFqSponge<VestaParameters, PlonkSpongeConstantsKimchi>>(&*SRS, *DOMAIN, &xs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you can also add a type alias for DefaultFqSponge<VestaParameters, PlonkSpongeConstantsKimchi> at the top to get this cleaner.

@@ -11,8 +11,34 @@ SRS_ARG=""
if [ $# -eq 2 ]; then
SRS_ARG="--srs-filepath $2"
fi

ENCODED_FILE="${INPUT_FILE%.*}.bin"
DECODED_FILE="${INPUT_FILE%.*}-decoded${INPUT_FILE##*.}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this doesn't handle the missing dot. And what about file without dot?

&srs, &domain, diff,
);
args.assert_commitment
.into_iter()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type is Option<T>. I won't use a into_ter(), even if it works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is dumb, it's a haskell pattern but I forgot about if let in rust

domain: &Radix2EvaluationDomain<G::ScalarField>,
diff: Diff<G::ScalarField>,
) {
let updates: Vec<(usize, PolyComm<G>, DensePolynomial<G::ScalarField>)> = diff
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not what we want to do, it is not efficient. You only want to use the Lagrange polynomials.

Copy link
Contributor Author

@martyall martyall Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is what I figured as well, but I saw while working on this that it isn't part of the SRS (just the commitments). I didn't see another type in the codebase that manages this cache, am i missing something?

If we need to make something, this update needs the lagrange polynomials in the monomial basis since this is what the data is stored in. If it's an expensive one time set up (seems like it to me), can I suggest leaving a TODO and a follow up PR which:

  • creates a type that manages the cache of lagrange polynomials for a given domain
  • provied (de)serialize instances
  • add a cli arg to load the cache from a file, apply it where necessary
  • add a script that generates it and serializes it, cache it for CI like we do with srs

domain: &Radix2EvaluationDomain<G::ScalarField>,
diff: Diff<G::ScalarField>,
) {
let updates: Vec<(usize, PolyComm<G>, DensePolynomial<G::ScalarField>)> = diff
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add also some timers, activated while in debug more. We want to know how fast it is.

Copy link
Member

@dannywillems dannywillems left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments.

@martyall
Copy link
Contributor Author

martyall commented Feb 6, 2025

@dannywillems

Thanks for the thorough review! Your comments made me realize I need to do a little work RE managing commitments on the user and server side. I would like to put this back in draft for now, and submit that work as a separate PR before continuing here.

In the meantime can you address questions about rayon and the lagrange polynomials?

@martyall martyall marked this pull request as draft February 7, 2025 03:48
@martyall
Copy link
Contributor Author

martyall commented Feb 7, 2025

@dannywillems I managed to correct my errors and get this working, am currently breaking it up into smaller pieces. Starting with #3005 and #3006. Going to close this as review is no longer needed. I will cherry-pick the scripts and some of the CLI stuff in more follow ups

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants