Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Docs #91

Open
pottekkat opened this issue Oct 4, 2020 · 46 comments
Open

Improve Docs #91

pottekkat opened this issue Oct 4, 2020 · 46 comments
Assignees
Labels
good first issue Good for newcomers hacktoberfest help wanted Extra attention is needed

Comments

@pottekkat
Copy link
Contributor

The docs of learningOrchestra need to be validated, tested and improved by test users. The doc could be the README and the docs page.

@pottekkat pottekkat added help wanted Extra attention is needed good first issue Good for newcomers hacktoberfest labels Oct 4, 2020
@LaChapeliere
Copy link
Contributor

Hi, I'm happy to help with the doc, because I had trouble understanding what the project was about when looking through the readme. Focussing on user doc (rather than dev doc), I'm proposing to go over the Readme with the view of a user who knows how to do data mining scripts but might not be familiar with infrastructure, cloud services, microservices...

@LaChapeliere
Copy link
Contributor

Proposed outline:

  1. One-sentence summary
  2. Thumbnail and indicators
  3. Introduction: what is the project, who is it for, why should I use it?
  4. Table of Contents
  5. Quick-start
  6. Installation instructions
  7. Usage instructions
  8. About learningOrchestra

@LaChapeliere
Copy link
Contributor

I'll start a branch and update here when I'm missing info

@pottekkat
Copy link
Contributor Author

@LaChapeliere Thank you for contributing. Sure you can make the changes to the README and the docs repo. You can maybe start a draft PR so we can track the progress.

@LaChapeliere
Copy link
Contributor

First question, probably for @riibeirogabriel
"learningOrchestra facilitates and streamlines iterative processes in a Data Science project pipeline " but "learningOrchestra is a distributed Machine Learning processing tool"
After reading your monograph, I think I understand why you used data science in one case and machine learning in another, but for user-friendliness purposes, I strongly recommend picking one to describe your project. Personally, I'd use data science/data mining because you're talking to data scientists. Then you can refer to machine learning methods for analysis microservices, of course

@pottekkat
Copy link
Contributor Author

I think so too. I also think that the exact purpose of this project gets lost somewhere in the docs. How can we clear things in the README and the docs?

PS: We can first fix the README and then move on to the rest of the docs.

@LaChapeliere
Copy link
Contributor

@navendu-pottekkat I've shared a first proposition for the intro in this draft PR ⬆️
You are right, let's not try to change everything at the same time. Plus improving the docs requires to understand the code so it's more work ^^

@riibeirogabriel
Copy link
Member

First question, probably for @riibeirogabriel
"learningOrchestra facilitates and streamlines iterative processes in a Data Science project pipeline " but "learningOrchestra is a distributed Machine Learning processing tool"
After reading your monograph, I think I understand why you used data science in one case and machine learning in another, but for user-friendliness purposes, I strongly recommend picking one to describe your project. Personally, I'd use data science/data mining because you're talking to data scientists. Then you can refer to machine learning methods for analysis microservices, of course

Yep, I also guess. We can make the readme with data scientist porpose, the "distributed machine learning processing tool" was provided in project begin, when us wanted make a distributed tool, but this mindset was changed.

@LaChapeliere
Copy link
Contributor

First question, probably for @riibeirogabriel
"learningOrchestra facilitates and streamlines iterative processes in a Data Science project pipeline " but "learningOrchestra is a distributed Machine Learning processing tool"
After reading your monograph, I think I understand why you used data science in one case and machine learning in another, but for user-friendliness purposes, I strongly recommend picking one to describe your project. Personally, I'd use data science/data mining because you're talking to data scientists. Then you can refer to machine learning methods for analysis microservices, of course

Yep, I also guess. We can make the readme with data scientist porpose, the "distributed machine learning processing tool" was provided in project begin, when us wanted make a distributed tool, but this mindset was changed.

I'll write the Readme from that perspective, and you can decide whether to change the motto and thumbnail later on that way

@LaChapeliere
Copy link
Contributor

@riibeirogabriel I'm trying to figure out the installation process. There's one big thing that is unclear for me: you mention Linux hostS and clusters, so I'm guessing you can run learningOrchestra in multithread over several machines? If that's the case, how/where do you link the machines together?
Also, you need to already own the machines on which you are running learningOrchestra, right? So you have to rent some VMs from some cloud provider first? Do you plan to add a feature where learningOrchestra can facilitate that?

@riibeirogabriel
Copy link
Member

We link the machines with docker swarm, a requirement to run the learningOrchestra is a docker swarm cluster provided by user, with this cluster, we can run the learningOrchestra without worry with infrastructure, and yes, the user need rent machines in cloud to use learningOrchestra, but if already have local machines, is possible run locally, What kind of feature you think to can facilitate this infra and cloud environment? Did you understand?

@riibeirogabriel
Copy link
Member

The user need create a docker swarm cluster to link the machines.

@LaChapeliere
Copy link
Contributor

Yup, I understand, thank you! I have no idea how to set up a cluster swarm but guess I'll have to read up on it then :D
I'm not sure what kind of feature would facilitate that, but I know it would make a huge difference for non-"System and architecture" oriented data scientists like me.
Honestly, even ssh-ing into the university-maintained cluster makes some of us run away screaming 🙀

@riibeirogabriel
Copy link
Member

riibeirogabriel commented Oct 4, 2020

We put in requeriments a link teaching how create a docker swarm cluster, it is easy! i agree with you, we need make more easy the architecture/infresctructure part from learningOrchestra, I will create an issue!

@LaChapeliere
Copy link
Contributor

@riibeirogabriel The microservices rest API has to be called with curl, right? Could you give me any example of a complete command you would enter in the terminal of your manager instance to use one of the microservices?
(Sorry for my many tech questions)

@riibeirogabriel
Copy link
Member

riibeirogabriel commented Oct 4, 2020

@LaChapeliere curl don't give a friendly use, we can use other programs with GUI like postman or insomnia to use the REST APIs, the calls to each API is showed in each microservice docs, take a look in a GUI REST API caller (postman) using a learningOrchestra microservice:
image
You can see each microservice API calls at https://learningorchestra.github.io/learningOrchestra-docs/database-api/

@LaChapeliere
Copy link
Contributor

@riibeirogabriel I'm not sure whether to cry or be happy, I've always muddled through with curl because I thought I didn't have any other option... Thanks!

@riibeirogabriel
Copy link
Member

hahaha, the python package was created to abstract the calls to user, the user only need call the methods from each microservices Class and it is done!

@LaChapeliere
Copy link
Contributor

@riibeirogabriel I've pushed the quick start to PR #95, could you please check that I haven't missed anything? That way I'll dive into the more detailed instructions

@LaChapeliere
Copy link
Contributor

@riibeirogabriel Would learningOrchestra work on a single-node swarm or is it a deal breaker?

@riibeirogabriel
Copy link
Member

If this node has around 12 gb of RAM, a quad-core processor and 100 gb of disk, it can work with small datasets, but to treat with big data, a cluster is necessary!

@LaChapeliere
Copy link
Contributor

If this node has around 12 gb of RAM, a quad-core processor and 100 gb of disk, it can work with small datasets, but to treat with big data, a cluster is necessary!

Wow, quad-core processor?

Yeah, I guessed resources would be the problem, though I didn't imagine it to require that much. I just wanted to check that learningOrchestra wouldn't crash on a single node.

@LaChapeliere
Copy link
Contributor

Another noob question, because I don't have the setup to run it on my own computer: I imagine when we run sudo ./run.sh, it runs on a loop and we have to run the python client commands/the REST calls from another terminal? And we can terminate learningOrchestra like any command-line program, with Ctrl^C?
Bonus question: Have you tried to see what happens with terminal multiplexers that enable persistent session, like tmux?

@riibeirogabriel
Copy link
Member

riibeirogabriel commented Oct 4, 2020

Yeah, I guessed resources would be the problem, though I didn't imagine it to require that much. I just wanted to check that learningOrchestra wouldn't crash on a single node.

Yep, learningOrchestra was doesn't planned to run in a single node.

@riibeirogabriel
Copy link
Member

riibeirogabriel commented Oct 4, 2020

Another noob question, because I don't have the setup to run it on my own computer: I imagine when we run sudo ./run.sh, it runs on a loop and we have to run the python client commands/the REST calls from another terminal? And we can terminate learningOrchestra like any command-line program, with Ctrl^C?
Bonus question: Have you tried to see what happens with terminal multiplexers that enable persistent session, like tmux?

Doesn't works in this way, the run make the deploy and it finish haha, to shutdown the learningOrchestra is ncessary run docker stack rm microservice, you remind me to put this in the docs, there is no information of how shutdown this software.

@riibeirogabriel
Copy link
Member

Bonus question: Have you tried to see what happens with terminal multiplexers that enable persistent session, like tmux?, it apply if the run.sh execution is endless, right?

@LaChapeliere
Copy link
Contributor

Yes, it doesn't matter since the deploy is one-time.

@LaChapeliere
Copy link
Contributor

How long is the deploy, typically?

@riibeirogabriel
Copy link
Member

Around 10 minutes, but it rely from machine resources.

@LaChapeliere
Copy link
Contributor

@riibeirogabriel Could you confirm the categories I've assigned to each microservice, please?

Database API- Gather data
Projection API- Visualize data and results (I'm assuming we're talking about mapping data points into display points?)
Data type API- Clean data
Histogram API- Visualize data and results
t-SNE API- Train machine learning models + Visualize data and results
PCA API- Train machine learning models + Visualize data and results
Model builder API- Train machine learning models

Any one of them can also be tagged "Evaluate machine learning models"?

@LaChapeliere
Copy link
Contributor

LaChapeliere commented Oct 5, 2020

Just a checklist to keep track of the progress:

  • User-friendly README
  • User doc: microservices
  • User doc: python package
  • Clarify the different repos
  • Dev doc + contributing guide
  • Going back to usage instructions in Readme to update with links to the rest of the doc
  • Get a new user (or several) to follow the doc on a simple use case
  • Solve detected problems
  • Unify doc for all repos

@riibeirogabriel
Copy link
Member

Database API- CRUD operations from preprocessed data/new data and results (Excepting t-sne and PCA APIs, each of them has own CRUD operations)
Projection API- Preprocessing data, creating a projection from a stored dataset, i don't understand your ask.
Data type API- Preprocessing data, changing the type of data fields (between text and number type, it is main of JSON types)
Histogram API- Preprocessing data, creating a histogram from some fields in a stored dataset.
t-SNE API- Preprocessing data, creating a t-SNE image plot from a dataset content.
PCA API- Preprocessing data, creating a PCA image plot from a dataset content.
Model builder API- Train, evaluete and predict machine learning models with several classifiers.

@riibeirogabriel
Copy link
Member

riibeirogabriel commented Oct 5, 2020

@LaChapeliere Do you understand each microservice function?

@riibeirogabriel
Copy link
Member

All microservices, excepting t-SNE and PCA make CRUD from your data using the database API mircoservice, then the user must use the database API microservice to visualize and handling results from ohter microservices.

@riibeirogabriel
Copy link
Member

the t-SNE and the PCA don't store your data in mongoDB, then them have own CRUD operations in each.

@LaChapeliere
Copy link
Contributor

I understand the microservices better now, thanks 👍
I was actually trying to sort them into categories corresponding to the data science pipeline steps you are covering. So could you label each microservice with those: Gather data, Clean data, Train machine learning models, Evaluate machine learning models, Visualize data and results?
Or tell me how else you'd like to categorise them if this doesn't work 🌵

@riibeirogabriel
Copy link
Member

When we written the first monograph , we catalog the mircoservices in 8 types, they are: Load Data; Load Model, Pre-
processing; Tuning; Training; Evaluation; Production; and Observer, but nowadays there are not all microservices types createds, we plan create all types till second monograph, but the exsting microservice are cataloged in this types:
Database API- Load Data
Projection API- Preprocessing
Data type API- Preprocessing
Histogram API- Preprocessing
t-SNE API- Preprocessing
PCA API- Preprocessing
Model builder API- pre-processing, training and evaluation. (we will desacouple this microservice till second monograph)

@riibeirogabriel
Copy link
Member

Then, maybe you meant categorize the types in preprocessing , right?

@LaChapeliere
Copy link
Contributor

Hum, is your advisor a data scientist? Your preprocessing sounds weird to me. For me, preprocessing means preparing the data so that you can run analyses on it. Data cleaning, type casting, date formatting, stop word removal, ...

@riibeirogabriel
Copy link
Member

riibeirogabriel commented Oct 5, 2020

One of my advisors is a data scientist, learningOrchestra needs more microservices to also make this steps, but you need know how model builder microservice works, the model builder has as parameter a python 3 code, this code is written by user, with this code the data scientis can treat the particularities from a dataset, we did think which none microservice can treat the particularities from all datasets, then we create the preprocess code param. Did you see the preprocess code created to titanic dataset? we make all mentioned steps by you (Data cleaning, type casting ( data type handler microservice haha), date formatting, stop word removal) and more! please take a quick view https://learningorchestra.github.io/learningOrchestra-docs/modelbuilder-api/. Model builder needs a string( the python 3 preprocess code), and then model builder interpretes this string as python code and make this instruction on dataset (a pyspark dataframe), if you scroll down to preprocessor_code Example section, you will see the titanic code, this code is sended as string (the python client package is the best way to make a request with this code). What do you think?

@riibeirogabriel
Copy link
Member

And i will work in your PR tomorrow!

@LaChapeliere
Copy link
Contributor

I think that you definitively need to cut that microservice into several ones 🤣
But I agree with your labelling for that microservice, I was more surprised that you labelled Projection API, Histogram API, t-SNE API, and PCA API as "Preprocessing".
For me t-SNE and PCA are models.
The projection can be simple visualisation, or machine learning model depending on what kind of projection it is. Histograms are definitively visualisation. I understand that they don't actually product a graph in this microservice setup, so I'm not sure how to label them, but I think "Preprocessing" is too
confusing.

@riibeirogabriel
Copy link
Member

You is right! i also think that preprocessing was confused, but we will improve in next releases to second monograph, and yes, we will cut the model builder microservice in several microservices.

@riibeirogabriel
Copy link
Member

PCA and t-SNE definitively are not preprocessing microservices, my advisor have thinked in use them to visualizate the state of a dataset in each step on pipeline.

@LaChapeliere
Copy link
Contributor

I'll give some thought to the category names tonight and propose something else tomorrow?

@riibeirogabriel
Copy link
Member

Sounds good, there is no haste, and thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers hacktoberfest help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants