-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Docs #91
Comments
Hi, I'm happy to help with the doc, because I had trouble understanding what the project was about when looking through the readme. Focussing on user doc (rather than dev doc), I'm proposing to go over the Readme with the view of a user who knows how to do data mining scripts but might not be familiar with infrastructure, cloud services, microservices... |
Proposed outline:
|
I'll start a branch and update here when I'm missing info |
@LaChapeliere Thank you for contributing. Sure you can make the changes to the README and the docs repo. You can maybe start a draft PR so we can track the progress. |
First question, probably for @riibeirogabriel |
I think so too. I also think that the exact purpose of this project gets lost somewhere in the docs. How can we clear things in the README and the docs? PS: We can first fix the README and then move on to the rest of the docs. |
@navendu-pottekkat I've shared a first proposition for the intro in this draft PR ⬆️ |
Yep, I also guess. We can make the readme with data scientist porpose, the "distributed machine learning processing tool" was provided in project begin, when us wanted make a distributed tool, but this mindset was changed. |
I'll write the Readme from that perspective, and you can decide whether to change the motto and thumbnail later on that way |
@riibeirogabriel I'm trying to figure out the installation process. There's one big thing that is unclear for me: you mention Linux hostS and clusters, so I'm guessing you can run learningOrchestra in multithread over several machines? If that's the case, how/where do you link the machines together? |
We link the machines with docker swarm, a requirement to run the learningOrchestra is a docker swarm cluster provided by user, with this cluster, we can run the learningOrchestra without worry with infrastructure, and yes, the user need rent machines in cloud to use learningOrchestra, but if already have local machines, is possible run locally, What kind of feature you think to can facilitate this infra and cloud environment? Did you understand? |
The user need create a docker swarm cluster to link the machines. |
Yup, I understand, thank you! I have no idea how to set up a cluster swarm but guess I'll have to read up on it then :D |
We put in requeriments a link teaching how create a docker swarm cluster, it is easy! i agree with you, we need make more easy the architecture/infresctructure part from learningOrchestra, I will create an issue! |
@riibeirogabriel The microservices rest API has to be called with curl, right? Could you give me any example of a complete command you would enter in the terminal of your manager instance to use one of the microservices? |
@LaChapeliere curl don't give a friendly use, we can use other programs with GUI like postman or insomnia to use the REST APIs, the calls to each API is showed in each microservice docs, take a look in a GUI REST API caller (postman) using a learningOrchestra microservice: |
@riibeirogabriel I'm not sure whether to cry or be happy, I've always muddled through with curl because I thought I didn't have any other option... Thanks! |
hahaha, the python package was created to abstract the calls to user, the user only need call the methods from each microservices Class and it is done! |
@riibeirogabriel I've pushed the quick start to PR #95, could you please check that I haven't missed anything? That way I'll dive into the more detailed instructions |
@riibeirogabriel Would learningOrchestra work on a single-node swarm or is it a deal breaker? |
If this node has around 12 gb of RAM, a quad-core processor and 100 gb of disk, it can work with small datasets, but to treat with big data, a cluster is necessary! |
Wow, quad-core processor? Yeah, I guessed resources would be the problem, though I didn't imagine it to require that much. I just wanted to check that learningOrchestra wouldn't crash on a single node. |
Another noob question, because I don't have the setup to run it on my own computer: I imagine when we run |
Yep, learningOrchestra was doesn't planned to run in a single node. |
Doesn't works in this way, the run make the deploy and it finish haha, to shutdown the learningOrchestra is ncessary run |
|
Yes, it doesn't matter since the deploy is one-time. |
How long is the deploy, typically? |
Around 10 minutes, but it rely from machine resources. |
@riibeirogabriel Could you confirm the categories I've assigned to each microservice, please? Database API- Gather data Any one of them can also be tagged "Evaluate machine learning models"? |
Just a checklist to keep track of the progress:
|
Database API- CRUD operations from preprocessed data/new data and results (Excepting t-sne and PCA APIs, each of them has own CRUD operations) |
@LaChapeliere Do you understand each microservice function? |
All microservices, excepting t-SNE and PCA make CRUD from your data using the database API mircoservice, then the user must use the database API microservice to visualize and handling results from ohter microservices. |
the t-SNE and the PCA don't store your data in mongoDB, then them have own CRUD operations in each. |
I understand the microservices better now, thanks 👍 |
When we written the first monograph , we catalog the mircoservices in 8 types, they are: Load Data; Load Model, Pre- |
Then, maybe you meant categorize the types in preprocessing , right? |
Hum, is your advisor a data scientist? Your preprocessing sounds weird to me. For me, preprocessing means preparing the data so that you can run analyses on it. Data cleaning, type casting, date formatting, stop word removal, ... |
One of my advisors is a data scientist, learningOrchestra needs more microservices to also make this steps, but you need know how model builder microservice works, the model builder has as parameter a python 3 code, this code is written by user, with this code the data scientis can treat the particularities from a dataset, we did think which none microservice can treat the particularities from all datasets, then we create the preprocess code param. Did you see the preprocess code created to titanic dataset? we make all mentioned steps by you (Data cleaning, type casting ( data type handler microservice haha), date formatting, stop word removal) and more! please take a quick view https://learningorchestra.github.io/learningOrchestra-docs/modelbuilder-api/. Model builder needs a string( the python 3 preprocess code), and then model builder interpretes this string as python code and make this instruction on dataset (a pyspark dataframe), if you scroll down to preprocessor_code Example section, you will see the titanic code, this code is sended as string (the python client package is the best way to make a request with this code). What do you think? |
And i will work in your PR tomorrow! |
I think that you definitively need to cut that microservice into several ones 🤣 |
You is right! i also think that preprocessing was confused, but we will improve in next releases to second monograph, and yes, we will cut the model builder microservice in several microservices. |
PCA and t-SNE definitively are not preprocessing microservices, my advisor have thinked in use them to visualizate the state of a dataset in each step on pipeline. |
I'll give some thought to the category names tonight and propose something else tomorrow? |
Sounds good, there is no haste, and thanks! |
The docs of learningOrchestra need to be validated, tested and improved by test users. The doc could be the README and the docs page.
The text was updated successfully, but these errors were encountered: