Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trust - User has to trust that the host of the PIS will not use their SSO credentials for nefarious purposes #21

Open
carrollgt91 opened this issue Mar 24, 2020 · 3 comments
Labels
open question Type: Discussion 🔈 When further discussion and debate is required

Comments

@carrollgt91
Copy link
Contributor

In addition to being an attractive target for hackers, storing these SSO credentials presents an interesting trust problem from the perspective of more privacy-conscious users.

We are just committing to the user that we are not storing their information. We are not providing strong guarantees, cryptographic or otherwise, that we will not use this information for our own gain.

Especially for more powerful SSO integrations, such as bank accounts, it might be hard to convince folks to trust us.

@carrollgt91 carrollgt91 added Type: Discussion 🔈 When further discussion and debate is required open question labels Mar 24, 2020
@PlamenHristov
Copy link
Contributor

PlamenHristov commented Apr 4, 2020

Solution:

I would suggest we keep everything encrypted. What this will ensure:

  1. Even if the identity server gets hacked (even though anonymised) user data will not be compromised.
  2. Even if we want to, we cannot do anything nefarious.

Implementation

Encryption

Let's put in place the facility for a user to generate a public/private key pair. Specifically let's use BIP-39. This will allow to generate a random seed phrase for the user from which we can combine with BIP-32 to generate the master pub/private key (m/0'/0'/0'):

  • The user will encrypt the data client side before sending it to us.

  • If we integrate directly with a medical records provider, we can send the data to the user for encryption before storing it.

Data consumer integration

From here we can either:

  • Provide the data (after user's consent) encrypted and data consumers (using OpenMined infrastructure) can learn on it.

  • Provide a way for the data consumer party decrypt the data the following way: When a data consumer comes in the user can generate then next pub/priv key pair (m/0'/0'/1'), encrypt the data locally, send the newly encrypted data to us and send the private public/private key to the data consumer (on a url which we have verified during data consumer onboarding). From there the data consumer can download and decrypt.

Consequence:

  1. We'll have one encrypted copy (each with different public/private key pair) for each data consumer per user. So if we have m data consumers and n users, (at some point in time) we may need to store m*n copies +1 copy encrypted with the master (root) key.
  2. If you know the seed phrase (optional combined with a password for extra security) it can be restored on whatever device the user is using.
  3. The user experience should be quite sleek and fairly intuitive
  4. It should also solve this issue

@NiWaRe
Copy link

NiWaRe commented Apr 6, 2020

@PlamenHristov although I thought about a different approach your approach with encryption also sounds good! :)
Picking up on your first option in the section "Data consumer integration" I thought about using the OpenMined PySyft and PyGrid libraries (Including for encryption, etc.)

Goal

As discussed on the slack channel also with @carrollgt91 I understood the goal of this team (this repo) to build a server which should be the middle-man between the sensitive data of the user who wants to be automatically authenticated and some data-consumer who wants an authentication or some data (e.g.: another app which wants to train on some sensitive data)
So based on the blog entry I imagined the SSO credentials not to be stored on the PIS bur rather being part of the client-sided data-scraping directly on the user device (also leveraging the possibility that the user is still signed in in different apps, as @carrollgt91 suggested in the #covid_mobile_data_collection channel) which then is send to the PIS on demand.
The PIS (our work) should then:

  1. provide the SSI (Self-Sovereign Identity) team with the necessary data from service-endpoints (which then would be on the clients themselves as explained above) to be able to populate some DID documents or do whatever they need to do to issue the authenticating credentials.
  2. provide data to other COVID-apps which can then learn on the sensitive data, scraped from services or stored from device-sensors by the data-mining team.

Suggestion

If my description of the goal of this specific repo is correct I thought about using the Public or Private Grid Platform using the PyGrid-library to make the exchange of sensitive data from service endpoints or training data possible.

  • The second use-case of the server (sensitive (scraped) user data on the device) would simply use the encrypted FL which should already be implemented in PyGrid. Doing the tutorials I only saw the vanilla FL being implemented but @hericlesme stated here that Enrypted FL is also already implemented. So here no extra encryption from our side would be needed, as PyGrid already uses Additive Secret Sharing technique to encrypt the data/models. (I believe)
  • For the first use-case (providing data from service endpoints for DID-credentials for the SSI team) I actually don't know if things could work the way I described in the first paragraph, but I imagined that instead of focusing mainly on hosting and training ML-models securely with decentralized data we could alter PyGrid such that either the gateway (or the user directly, being a worker in the grid) can send a pointer to the data on the client. Or that instead of models the data of the user could be hosted in a decentralized manner (encrypting it via Additive Secret Sharing) by other worker nodes (potentially specifically destined for hosting) and would be updated on-demand with newly scraped data from the user-worker-node and then again a pointer to this data is shared and made available through some API for the SSI-team.

If I understood our goal correctly in both scenarios no direct channel would need to be established between the user and the data-consumer because either the data-consumer would train their model using PyGrid (the second use-case) or the data is only provided to the SSI team which would then do the issuing, validation, etc. of credentials for authentication with the data-consumer.

I may have too limited knowledge about the detailed working of PyGrid and the specific data needs of the SSI team, but potentially this could help us use much of the already existing code from other OpenMined projects.

@carrollgt91
Copy link
Contributor Author

#21

@PlamenHristov Great thoughts here - I think some form of user-managed encryption scheme does solve a lot of the issues here, this one and the security breach piece, which is great. Just to make sure I understand your proposal, it seems that

  • the user will have to manage these key pairs on their device (something we could implement via a client-side application)
  • the keys will be complex enough that they would need to transfer those keys from one device to another if they wanted to manage their data on more than one device
  • when a data consumer wants access to some data, they would need to do a key exchange with the user in order to get a key pair that will allow them to decrypt the data

Assuming those assumptions are correct...

One thing I really like about this proposal is how easy the UX is for the user to share data with data consumers when they're on the same device that has the key pair on it. It's not meaningfully different from an SSO handshake where the app you're signing into is requesting certain data from the sign on provider - i.e. sign in with facebook -> provide your name and profile photos.

However, there are some additional challenges we'd need to overcome with this strategy. I'm not as familiar with what we'll need to do to hook into the rest of OpenMined infrastructure (i.e. PySyft), so I'm not going to comment much on that piece, and instead I'll focus on the data consumer use case.

  1. The key exchange would need to be implemented in such a way that would allow for the immediate use of the data within a data consumer. I think we'd want to supply client libraries that would make this process very easy, similar to how there are tons of off-the-shelf client libraries for the OAuth and OpenID protocol. Ease of integration for the data consumer is really important, and if we have to ask them to implement custom decryption, I think that will reduce the number of applications willing to integrate. The more we can lean on existing libraries for this, the better - there's a lot to like about the BIP-based crypto you linked to. There are a good number of libraries for it in different ecosystems. However, I think it's worth examining alternative options for the encryption scheme that would be easiest for the data consumer to integrate with.

  2. We'll definitely need to have more robust client-side applications built to generate/manage these keys, as well as house the user's sensitive data. Here are a few things that we'd need:

  • Key invalidation (this would also need to be implemented server-side to allow for data to be re-encrypted under the new key)
  • Multi-Device syncing - this would be tricky to do in a secure way without some sort of peer-to-peer handshake or a "master password" concept akin to how password managers implement key sharing
  • Client-side data storage - in order for key invalidation to work, the client will need to store all of the user's sensitive data locally. Given that we're going to be obtaining much of this data via web-scraping, we're already going to need to solve the problem of ensuring that the data we're storing client-side is verifiable, but this is doubly true given this approach. In a situation where the server is allowed to at least momentarily gain access to unencrypted data, it can compare the hash of that data to the hash that was generated during the initial collection of the data to ensure that the user hasn't tampered with the data in the mean-time. There is likely a way to accomplish this with modern crypto, but I am not aware of a solution to that problem off the top of my head.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
open question Type: Discussion 🔈 When further discussion and debate is required
Projects
None yet
Development

No branches or pull requests

3 participants