Improve Docs #91

pottekkat · 2020-10-04T13:34:15Z

The docs of learningOrchestra need to be validated, tested and improved by test users. The doc could be the README and the docs page.

LaChapeliere · 2020-10-04T15:54:15Z

Hi, I'm happy to help with the doc, because I had trouble understanding what the project was about when looking through the readme. Focussing on user doc (rather than dev doc), I'm proposing to go over the Readme with the view of a user who knows how to do data mining scripts but might not be familiar with infrastructure, cloud services, microservices...

LaChapeliere · 2020-10-04T16:07:40Z

Proposed outline:

One-sentence summary
Thumbnail and indicators
Introduction: what is the project, who is it for, why should I use it?
Table of Contents
Quick-start
Installation instructions
Usage instructions
About learningOrchestra

LaChapeliere · 2020-10-04T16:08:13Z

I'll start a branch and update here when I'm missing info

pottekkat · 2020-10-04T16:37:30Z

@LaChapeliere Thank you for contributing. Sure you can make the changes to the README and the docs repo. You can maybe start a draft PR so we can track the progress.

LaChapeliere · 2020-10-04T16:41:01Z

First question, probably for @riibeirogabriel
"learningOrchestra facilitates and streamlines iterative processes in a Data Science project pipeline " but "learningOrchestra is a distributed Machine Learning processing tool"
After reading your monograph, I think I understand why you used data science in one case and machine learning in another, but for user-friendliness purposes, I strongly recommend picking one to describe your project. Personally, I'd use data science/data mining because you're talking to data scientists. Then you can refer to machine learning methods for analysis microservices, of course

pottekkat · 2020-10-04T16:50:56Z

I think so too. I also think that the exact purpose of this project gets lost somewhere in the docs. How can we clear things in the README and the docs?

PS: We can first fix the README and then move on to the rest of the docs.

LaChapeliere · 2020-10-04T17:15:14Z

@navendu-pottekkat I've shared a first proposition for the intro in this draft PR ⬆️
You are right, let's not try to change everything at the same time. Plus improving the docs requires to understand the code so it's more work ^^

riibeirogabriel · 2020-10-04T17:19:51Z

First question, probably for @riibeirogabriel
"learningOrchestra facilitates and streamlines iterative processes in a Data Science project pipeline " but "learningOrchestra is a distributed Machine Learning processing tool"
After reading your monograph, I think I understand why you used data science in one case and machine learning in another, but for user-friendliness purposes, I strongly recommend picking one to describe your project. Personally, I'd use data science/data mining because you're talking to data scientists. Then you can refer to machine learning methods for analysis microservices, of course

Yep, I also guess. We can make the readme with data scientist porpose, the "distributed machine learning processing tool" was provided in project begin, when us wanted make a distributed tool, but this mindset was changed.

LaChapeliere · 2020-10-04T17:21:13Z

First question, probably for @riibeirogabriel
"learningOrchestra facilitates and streamlines iterative processes in a Data Science project pipeline " but "learningOrchestra is a distributed Machine Learning processing tool"
After reading your monograph, I think I understand why you used data science in one case and machine learning in another, but for user-friendliness purposes, I strongly recommend picking one to describe your project. Personally, I'd use data science/data mining because you're talking to data scientists. Then you can refer to machine learning methods for analysis microservices, of course

Yep, I also guess. We can make the readme with data scientist porpose, the "distributed machine learning processing tool" was provided in project begin, when us wanted make a distributed tool, but this mindset was changed.

I'll write the Readme from that perspective, and you can decide whether to change the motto and thumbnail later on that way

LaChapeliere · 2020-10-04T17:29:22Z

@riibeirogabriel I'm trying to figure out the installation process. There's one big thing that is unclear for me: you mention Linux hostS and clusters, so I'm guessing you can run learningOrchestra in multithread over several machines? If that's the case, how/where do you link the machines together?
Also, you need to already own the machines on which you are running learningOrchestra, right? So you have to rent some VMs from some cloud provider first? Do you plan to add a feature where learningOrchestra can facilitate that?

riibeirogabriel · 2020-10-04T17:44:26Z

We link the machines with docker swarm, a requirement to run the learningOrchestra is a docker swarm cluster provided by user, with this cluster, we can run the learningOrchestra without worry with infrastructure, and yes, the user need rent machines in cloud to use learningOrchestra, but if already have local machines, is possible run locally, What kind of feature you think to can facilitate this infra and cloud environment? Did you understand?

riibeirogabriel · 2020-10-04T17:44:46Z

The user need create a docker swarm cluster to link the machines.

LaChapeliere · 2020-10-04T17:47:28Z

Yup, I understand, thank you! I have no idea how to set up a cluster swarm but guess I'll have to read up on it then :D
I'm not sure what kind of feature would facilitate that, but I know it would make a huge difference for non-"System and architecture" oriented data scientists like me.
Honestly, even ssh-ing into the university-maintained cluster makes some of us run away screaming 🙀

riibeirogabriel · 2020-10-04T17:51:14Z

We put in requeriments a link teaching how create a docker swarm cluster, it is easy! i agree with you, we need make more easy the architecture/infresctructure part from learningOrchestra, I will create an issue!

LaChapeliere · 2020-10-04T18:09:13Z

@riibeirogabriel The microservices rest API has to be called with curl, right? Could you give me any example of a complete command you would enter in the terminal of your manager instance to use one of the microservices?
(Sorry for my many tech questions)

riibeirogabriel · 2020-10-04T18:14:34Z

@LaChapeliere curl don't give a friendly use, we can use other programs with GUI like postman or insomnia to use the REST APIs, the calls to each API is showed in each microservice docs, take a look in a GUI REST API caller (postman) using a learningOrchestra microservice:

You can see each microservice API calls at https://learningorchestra.github.io/learningOrchestra-docs/database-api/

LaChapeliere · 2020-10-04T18:19:14Z

@riibeirogabriel I'm not sure whether to cry or be happy, I've always muddled through with curl because I thought I didn't have any other option... Thanks!

riibeirogabriel · 2020-10-04T18:21:19Z

hahaha, the python package was created to abstract the calls to user, the user only need call the methods from each microservices Class and it is done!

LaChapeliere · 2020-10-04T18:27:19Z

@riibeirogabriel I've pushed the quick start to PR #95, could you please check that I haven't missed anything? That way I'll dive into the more detailed instructions

LaChapeliere · 2020-10-04T19:09:08Z

@riibeirogabriel Would learningOrchestra work on a single-node swarm or is it a deal breaker?

riibeirogabriel · 2020-10-04T19:13:38Z

If this node has around 12 gb of RAM, a quad-core processor and 100 gb of disk, it can work with small datasets, but to treat with big data, a cluster is necessary!

LaChapeliere · 2020-10-04T19:17:16Z

If this node has around 12 gb of RAM, a quad-core processor and 100 gb of disk, it can work with small datasets, but to treat with big data, a cluster is necessary!

Wow, quad-core processor?

Yeah, I guessed resources would be the problem, though I didn't imagine it to require that much. I just wanted to check that learningOrchestra wouldn't crash on a single node.

LaChapeliere · 2020-10-04T19:22:52Z

Another noob question, because I don't have the setup to run it on my own computer: I imagine when we run sudo ./run.sh, it runs on a loop and we have to run the python client commands/the REST calls from another terminal? And we can terminate learningOrchestra like any command-line program, with Ctrl^C?
Bonus question: Have you tried to see what happens with terminal multiplexers that enable persistent session, like tmux?

riibeirogabriel · 2020-10-04T19:29:01Z

Yeah, I guessed resources would be the problem, though I didn't imagine it to require that much. I just wanted to check that learningOrchestra wouldn't crash on a single node.

Yep, learningOrchestra was doesn't planned to run in a single node.

riibeirogabriel · 2020-10-04T19:34:25Z

Another noob question, because I don't have the setup to run it on my own computer: I imagine when we run sudo ./run.sh, it runs on a loop and we have to run the python client commands/the REST calls from another terminal? And we can terminate learningOrchestra like any command-line program, with Ctrl^C?
Bonus question: Have you tried to see what happens with terminal multiplexers that enable persistent session, like tmux?

Doesn't works in this way, the run make the deploy and it finish haha, to shutdown the learningOrchestra is ncessary run docker stack rm microservice, you remind me to put this in the docs, there is no information of how shutdown this software.

riibeirogabriel · 2020-10-04T19:37:10Z

Bonus question: Have you tried to see what happens with terminal multiplexers that enable persistent session, like tmux?, it apply if the run.sh execution is endless, right?

LaChapeliere · 2020-10-04T19:41:40Z

Yes, it doesn't matter since the deploy is one-time.

LaChapeliere · 2020-10-04T20:01:40Z

How long is the deploy, typically?

riibeirogabriel · 2020-10-04T20:08:52Z

Around 10 minutes, but it rely from machine resources.

LaChapeliere · 2020-10-05T15:39:11Z

@riibeirogabriel Could you confirm the categories I've assigned to each microservice, please?

Database API- Gather data
Projection API- Visualize data and results (I'm assuming we're talking about mapping data points into display points?)
Data type API- Clean data
Histogram API- Visualize data and results
t-SNE API- Train machine learning models + Visualize data and results
PCA API- Train machine learning models + Visualize data and results
Model builder API- Train machine learning models

Any one of them can also be tagged "Evaluate machine learning models"?

LaChapeliere · 2020-10-05T17:50:35Z

Just a checklist to keep track of the progress:

User-friendly README
User doc: microservices
User doc: python package
Clarify the different repos
Dev doc + contributing guide
Going back to usage instructions in Readme to update with links to the rest of the doc
Get a new user (or several) to follow the doc on a simple use case
Solve detected problems
Unify doc for all repos

riibeirogabriel · 2020-10-05T18:06:04Z

Database API- CRUD operations from preprocessed data/new data and results (Excepting t-sne and PCA APIs, each of them has own CRUD operations)
Projection API- Preprocessing data, creating a projection from a stored dataset, i don't understand your ask.
Data type API- Preprocessing data, changing the type of data fields (between text and number type, it is main of JSON types)
Histogram API- Preprocessing data, creating a histogram from some fields in a stored dataset.
t-SNE API- Preprocessing data, creating a t-SNE image plot from a dataset content.
PCA API- Preprocessing data, creating a PCA image plot from a dataset content.
Model builder API- Train, evaluete and predict machine learning models with several classifiers.

riibeirogabriel · 2020-10-05T18:06:31Z

@LaChapeliere Do you understand each microservice function?

riibeirogabriel · 2020-10-05T18:08:16Z

All microservices, excepting t-SNE and PCA make CRUD from your data using the database API mircoservice, then the user must use the database API microservice to visualize and handling results from ohter microservices.

riibeirogabriel · 2020-10-05T18:09:10Z

the t-SNE and the PCA don't store your data in mongoDB, then them have own CRUD operations in each.

LaChapeliere · 2020-10-05T18:18:14Z

I understand the microservices better now, thanks 👍
I was actually trying to sort them into categories corresponding to the data science pipeline steps you are covering. So could you label each microservice with those: Gather data, Clean data, Train machine learning models, Evaluate machine learning models, Visualize data and results?
Or tell me how else you'd like to categorise them if this doesn't work 🌵

riibeirogabriel · 2020-10-05T18:49:26Z

When we written the first monograph , we catalog the mircoservices in 8 types, they are: Load Data; Load Model, Pre-
processing; Tuning; Training; Evaluation; Production; and Observer, but nowadays there are not all microservices types createds, we plan create all types till second monograph, but the exsting microservice are cataloged in this types:
Database API- Load Data
Projection API- Preprocessing
Data type API- Preprocessing
Histogram API- Preprocessing
t-SNE API- Preprocessing
PCA API- Preprocessing
Model builder API- pre-processing, training and evaluation. (we will desacouple this microservice till second monograph)

riibeirogabriel · 2020-10-05T18:50:33Z

Then, maybe you meant categorize the types in preprocessing , right?

LaChapeliere · 2020-10-05T19:07:46Z

Hum, is your advisor a data scientist? Your preprocessing sounds weird to me. For me, preprocessing means preparing the data so that you can run analyses on it. Data cleaning, type casting, date formatting, stop word removal, ...

riibeirogabriel · 2020-10-05T21:50:24Z

One of my advisors is a data scientist, learningOrchestra needs more microservices to also make this steps, but you need know how model builder microservice works, the model builder has as parameter a python 3 code, this code is written by user, with this code the data scientis can treat the particularities from a dataset, we did think which none microservice can treat the particularities from all datasets, then we create the preprocess code param. Did you see the preprocess code created to titanic dataset? we make all mentioned steps by you (Data cleaning, type casting ( data type handler microservice haha), date formatting, stop word removal) and more! please take a quick view https://learningorchestra.github.io/learningOrchestra-docs/modelbuilder-api/. Model builder needs a string( the python 3 preprocess code), and then model builder interpretes this string as python code and make this instruction on dataset (a pyspark dataframe), if you scroll down to preprocessor_code Example section, you will see the titanic code, this code is sended as string (the python client package is the best way to make a request with this code). What do you think?

riibeirogabriel · 2020-10-05T21:58:24Z

And i will work in your PR tomorrow!

LaChapeliere · 2020-10-05T21:59:04Z

I think that you definitively need to cut that microservice into several ones 🤣
But I agree with your labelling for that microservice, I was more surprised that you labelled Projection API, Histogram API, t-SNE API, and PCA API as "Preprocessing".
For me t-SNE and PCA are models.
The projection can be simple visualisation, or machine learning model depending on what kind of projection it is. Histograms are definitively visualisation. I understand that they don't actually product a graph in this microservice setup, so I'm not sure how to label them, but I think "Preprocessing" is too
confusing.

riibeirogabriel · 2020-10-05T22:03:28Z

You is right! i also think that preprocessing was confused, but we will improve in next releases to second monograph, and yes, we will cut the model builder microservice in several microservices.

riibeirogabriel · 2020-10-05T22:06:17Z

PCA and t-SNE definitively are not preprocessing microservices, my advisor have thinked in use them to visualizate the state of a dataset in each step on pipeline.

LaChapeliere · 2020-10-05T22:06:33Z

I'll give some thought to the category names tonight and propose something else tomorrow?

riibeirogabriel · 2020-10-05T22:12:38Z

Sounds good, there is no haste, and thanks!

pottekkat added help wanted Extra attention is needed good first issue Good for newcomers hacktoberfest labels Oct 4, 2020

pottekkat assigned LaChapeliere Oct 4, 2020

LaChapeliere mentioned this issue Oct 4, 2020

Improving the Readme from a user's perspective #95

Merged

5 tasks

Improve Docs #91

Improve Docs #91

Comments

pottekkat commented Oct 4, 2020

LaChapeliere commented Oct 4, 2020

LaChapeliere commented Oct 4, 2020

LaChapeliere commented Oct 4, 2020

pottekkat commented Oct 4, 2020

LaChapeliere commented Oct 4, 2020

pottekkat commented Oct 4, 2020

LaChapeliere commented Oct 4, 2020

riibeirogabriel commented Oct 4, 2020

LaChapeliere commented Oct 4, 2020

LaChapeliere commented Oct 4, 2020

riibeirogabriel commented Oct 4, 2020

riibeirogabriel commented Oct 4, 2020

LaChapeliere commented Oct 4, 2020

riibeirogabriel commented Oct 4, 2020 • edited Loading

LaChapeliere commented Oct 4, 2020

riibeirogabriel commented Oct 4, 2020 • edited Loading

LaChapeliere commented Oct 4, 2020

riibeirogabriel commented Oct 4, 2020

LaChapeliere commented Oct 4, 2020

LaChapeliere commented Oct 4, 2020

riibeirogabriel commented Oct 4, 2020

LaChapeliere commented Oct 4, 2020

LaChapeliere commented Oct 4, 2020

riibeirogabriel commented Oct 4, 2020 • edited Loading

riibeirogabriel commented Oct 4, 2020 • edited Loading

riibeirogabriel commented Oct 4, 2020

LaChapeliere commented Oct 4, 2020

LaChapeliere commented Oct 4, 2020

riibeirogabriel commented Oct 4, 2020

LaChapeliere commented Oct 5, 2020

LaChapeliere commented Oct 5, 2020 • edited Loading

riibeirogabriel commented Oct 5, 2020

riibeirogabriel commented Oct 5, 2020 • edited Loading

riibeirogabriel commented Oct 5, 2020

riibeirogabriel commented Oct 5, 2020

LaChapeliere commented Oct 5, 2020

riibeirogabriel commented Oct 5, 2020

riibeirogabriel commented Oct 5, 2020

LaChapeliere commented Oct 5, 2020

riibeirogabriel commented Oct 5, 2020 • edited Loading

riibeirogabriel commented Oct 5, 2020

LaChapeliere commented Oct 5, 2020

riibeirogabriel commented Oct 5, 2020

riibeirogabriel commented Oct 5, 2020

LaChapeliere commented Oct 5, 2020

riibeirogabriel commented Oct 5, 2020

riibeirogabriel commented Oct 4, 2020 •

edited

Loading

riibeirogabriel commented Oct 4, 2020 •

edited

Loading

riibeirogabriel commented Oct 4, 2020 •

edited

Loading

riibeirogabriel commented Oct 4, 2020 •

edited

Loading

LaChapeliere commented Oct 5, 2020 •

edited

Loading

riibeirogabriel commented Oct 5, 2020 •

edited

Loading

riibeirogabriel commented Oct 5, 2020 •

edited

Loading