Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkoff 2 #18

Open
sheikhshack opened this issue Nov 24, 2020 · 2 comments
Open

Checkoff 2 #18

sheikhshack opened this issue Nov 24, 2020 · 2 comments
Assignees
Labels
question Further information is requested
Milestone

Comments

@sheikhshack
Copy link
Owner

Hey guys, can comment whatever questions here to be asked to prof. Thanks!

@sheikhshack sheikhshack added the question Further information is requested label Nov 24, 2020
@sheikhshack sheikhshack added this to the Checkoff 2 milestone Nov 24, 2020
@sheikhshack
Copy link
Owner Author

sheikhshack commented Nov 25, 2020

Production

  1. Do we need to use elastic IPs? Can we use placement groups for increasing throughput ?
  2. Are we graded based on speed of deployment? Or will just a deployment script that works be sufficient
  3. For 'take in credentials as input', are we assuming user feeds via .\aws\creds or must we let user specify each argument via CLI?
  4. For teardown, is terminating instances sufficient

HDFS

  1. Can we use libraries like this flintrock? Or others like in this article

Spark

  1. Can we just install mongo on the namenode and do an import direct via mongo cli?
  2. Or can we import mongo db data via spark?

General

  1. Any performance metric/benchmark we should compare with to evaluate how good our script is?
  2. What is meant by code quality?

@sheikhshack
Copy link
Owner Author

sheikhshack commented Nov 26, 2020

Prof Reply

Checkpoint 2 DB
Creds

  • Can be either
    Elastic IP
  • Just continue as we were
    Speed
  • Minimum requirements is functional
  • If you can do it fast, more points
    Flintrock
  • You are allowed to use it!
    Logs
  • Yep this is what we want
    Hadoop and Spark
  • Store output of TF-IDF on a file or on the CLI
  • TF-IDF might be very big, might want to store it somewhere
    Timing of performance
  • Set up of infra, and running processes
  • Prof will specify how many nodes he wants
    Startup when
  • Separate scripts for production and analytics
  • Give prof the option to startup up the analytics when production is done

Repository owner deleted a comment from muhammadmnorouzi Feb 23, 2024
Repository owner deleted a comment from naudachu Mar 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants