Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload recording feature #787

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

KIRA009
Copy link
Contributor

@KIRA009 KIRA009 commented Jun 21, 2024

What kind of change does this PR introduce?
This PR addresses #724

Summary
This PR adds a script to deploy an app to AWS lambda, that is then used by the openadapt app to upload zipfiles of recordings from users.

Checklist

  • My code follows the style guidelines of OpenAdapt
  • I have performed a self-review of my code
  • If applicable, I have added tests to prove my fix is functional/effective
  • I have linted my code locally prior to submission
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (e.g. README.md, requirements.txt)
  • New and existing unit tests pass locally with my changes

How can your code be run and tested?
From the project root, run python -m scripts.recording_uploader.deploy (ensure that you have the necessary aws creds configured). Once the command completes, note the api url in the output, and paste that onto the config.py's RECORDING_UPLOAD_URL variable. Once that is done, start the app, navigate to a recording detail page, and click on the "Upload recording" button. Check the s3 bucket to confirm that the recording has been uploaded (in the form of a zip file)

Other information

Copy link
Member

@abrichr abrichr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting this together @KIRA009 ! Just left a few small comments, happy to chat about any of it if you like! 🙏 😄

with open(file_path, "rb") as file:
files = {"file": (filename, file)}
resp = requests.put(upload_url, files=files)
resp.raise_for_status()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about returning the response here?

@@ -0,0 +1,244 @@

# Created by https://www.gitignore.io/api/osx,linux,python,windows,pycharm,visualstudiocode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, can you please clarify why this is necessary / preferable to a minimal .gitignore? What did you need to ignore here? Why not keep it in the root .gitignore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is autogenerated from the template. You are right though, this isn't needed


## Deploy the application

There is a `deploy` script that creates the s3 bucket and deploys the application using the SAM CLI (included as part of the dev dependencies of this project). The bucket name is hardcoded in the script. The SAM CLI is set up to run in `guided` mode, which will prompt the user every time befor deploying, in case the user wants to change the default values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: befor -> before

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the bucket name configurable, with the aws region etc?

@@ -0,0 +1 @@
"""Init file for the recording_uploader package."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a package, should we move it outside of scripts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be named as a module instead 😅

def get_presigned_url() -> dict:
"""Generate a presigned URL for uploading a recording to S3."""
bucket = "openadapt"
region_name = "us-east-1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about putting these in NAMED_CONSTANTS at the top of this file, and passing them in as default keyword arguments to get_presigned_url?

if guided:
commands.append("--guided")
subprocess.run(commands, cwd=CURRENT_DIR, check=True)
print("Lambda function deployed successfully.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about using logger here?

Bucket=bucket,
)
except (s3.exceptions.BucketAlreadyExists, s3.exceptions.BucketAlreadyOwnedByYou):
proceed = input(f"Bucket '{bucket}' already exists. Proceed? [y/N] ")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? What happens if the user proceeds if the bucket has already been created?

Copy link
Contributor Author

@KIRA009 KIRA009 Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this would be good to have in case the user doesn't want to overwrite existing buckets, but I realise that is quite a niche case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this remove existing data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its unlikely that it will, given we are generating random filenames, so yes maybe we can remove this part

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed this

"Bucket": bucket,
"Key": key,
},
ExpiresIn=3600,
Copy link
Member

@abrichr abrichr Jun 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please make this a named constant, e.g. ONE_HOUR_IN_SECONDS = 60 * 60?

region_name=region_name,
endpoint_url=f"https://s3.{region_name}.amazonaws.com",
)
key = f"recordings/{uuid4()}.zip"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding the user's unique id to the path, e.g. recordings/{user_id}/{upload_id}.zip?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

@abrichr
Copy link
Member

abrichr commented Jun 22, 2024

Once we start scaling we should consider supporting B2, e.g. app.py:

"""Lambda-like function for generating a presigned URL for uploading a recording to B2."""

from typing import Any
from uuid import uuid4
import json
from b2sdk.v2 import B2Api, InMemoryAccountInfo

def get_b2_client() -> B2Api:
    """Create and return a B2 client."""
    info = InMemoryAccountInfo()
    b2_api = B2Api(info)
    b2_api.authorize_account("production", "applicationKeyId", "applicationKey")
    return b2_api

def lambda_handler(*args: Any, **kwargs: Any) -> dict:
    """Main entry point for the function."""
    return {
        "statusCode": 200,
        "body": json.dumps(get_presigned_url()),
    }

def get_presigned_url() -> dict:
    """Generate a presigned URL for uploading a recording to B2."""
    bucket_name = "openadapt"
    b2_api = get_b2_client()
    bucket = b2_api.get_bucket_by_name(bucket_name)
    file_name = f"recordings/{uuid4()}.zip"
    file_info = {'how': 'good-file'}

    presigned_url = bucket.get_upload_url(file_name, file_info=file_info)
    
    return {"url": presigned_url['upload_url'], "upload_auth_token": presigned_url['authorization_token']}

For now let's stick with s3.

@KIRA009
Copy link
Contributor Author

KIRA009 commented Jul 1, 2024

Before this is merged, we need to setup the upload url and update in on config.py - RECORDING_UPLOAD_URL

region_name (str): The AWS region to deploy the Lambda function to.
guided (bool): Whether to use the guided SAM deployment.
"""
s3 = boto3.client(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KIRA009 can you please modify this to use the credentials specified in openadapt.config? Specifically we should add AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY or similar.

Also, please add some documentation regarding the permissions required for this IAM user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add these keys to the config? They won't be used anywhere else in the project, and I think when you run the deploy script, if boto3 doesn't find appropriate keys in the place where its looking, the user is notified of that.

For the access keys, I followed this without any explicit permissions - https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey. These keys are the ones that you need to run the deploy script, which then creates appropriate iam users on its own. Once its deployed, if you want you can delete the original access keys. Should I add this in the readme?

@@ -0,0 +1 @@
boto3==1.34.84
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving all of this outside of /scripts, and into e.g. /admin or similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the uploader module has to remain inside the recording_uploader module. If you are talking about the recording_uploader module, we could, but I think scripts is also a good enough place for it to be in. Let me know

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I meant to move recording_uploader into a new directory, /admin or similar.

If you want to self host the app, you should run the following scripts

**recording_uploader**
- Ensure that you have valid AWS credentials added in your environment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about loading the AWS credentials from config.py?

@KIRA009 KIRA009 force-pushed the feature/upload-recording branch from bfffb05 to 3554bb4 Compare November 9, 2024 15:46
except Exception as exc:
logger.exception(exc)

Thread(target=_inner).start()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add some error handling / retrying / reporting around this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added retries

file_path (str): The path to the file to upload.
body (dict): The body of the request.
"""
filename = os.path.basename(file_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
filename = os.path.basename(file_path)
file_name = os.path.basename(file_path)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about logging this path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@abrichr abrichr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting this together @KIRA009 !

What do you think about something like this:

class Recording(db.Base):
    __tablename__ = "recording"
...
    s3_url = sa.Column(sa.String, nullable=True)

And perhaps display this value in the Dashboard.

@abrichr
Copy link
Member

abrichr commented Nov 10, 2024

What do you think about renaming the directory to simply uploader instead of recording_uploader?

@abrichr
Copy link
Member

abrichr commented Nov 10, 2024

@KIRA009 what do you think about https://www.cloudflare.com/en-ca/developer-platform/products/r2/:

Forever Free
10 GB / month

# check if aws credentials are set
if os.getenv("AWS_ACCESS_KEY_ID") is None:
raise ValueError("AWS_ACCESS_KEY_ID is not set")
if os.getenv("AWS_SECRET_ACCESS_KEY") is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not read this from config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script is not supposed to be part of the OpenAdapt app, its an admin script that needs to be run by the owner (you) on a machine that has the relevant aws creds in its environment. From the PR description

From the project root, run python -m scripts.recording_uploader.deploy (ensure that you have the necessary aws creds configured). Once the command completes, note the api url in the output, and paste that onto the config.py's RECORDING_UPLOAD_URL variable

Because config.py is more closely related to settings of the app, I didn't think it'd be useful to add these there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to read from config.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you suggest a user override these settings? The default values will be empty in config.py and config.defaults.json, so the only ways are to manually edit the config.json file in the data folder, or we expose it in the dashboard settings page? A regular user won't be needing to edit this, and might get confused if its in the dashboard settings

region_name=region_name,
endpoint_url=f"https://s3.{region_name}.amazonaws.com",
)
bucket = "openadapt"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about defining this is config.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same reason as above, plus this is hardcoded because this script will be run once in a while (only when the lambda function is changed). And ideally we won't be changing bucket names between runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants