-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about DARTS #99
Comments
Thanks for pointing out these questions. (1). (k+1)k/2 is because for the k-th node, you have (k+1) preceding nodes. Selecting two from them has C(K+1, 2) possibilities. 2 input nodes are pre-defined according to human expert's experience. If isomorphism is considered, you need another way to represent this DAG. Before pruning the fully-connected graph into "2-input-nodes version", each node has (k+1) preceding nodes and has (k+1) edges -> (1+1) + (2+1) + (3+1) + (4+1) = 14 learable edges. (2). We hypothesis the normal cell and reduction cell will have a very different topology structure (3). No theoretical guarantee. (4). Because for each iteration, DARTS needs to weighted-sum the architecture parameters and the outputs of every candidate operation -> O(N), but GDAS only needs to "sample" one candidate operation -> O(1). |
Why for the k-th node, you have (k+1) preceding nodes. ? |
Because for each cell, they also allow the output of two previous cells as inputs, so for the 1-th first node in a cell, its preceding nodes are |
The |
Someone told me that the above, but I am not familiar with gumbel and how it actually helps to speed up GDAS with respect to DARTS. I suppose it is the gumbel-max trick mentioned in the paper. I do not quite understand expressions (3) and (5) in the GDAS paper. |
You could have a look at our code: https://github.com/D-X-Y/AutoDL-Projects/blob/main/lib/models/cell_searchs/search_model_gdas.py#L89 |
@D-X-Y Could you comment on this reply on your Gumbel-Max code implementation ? @Unity05 was suggesting to use softargmax |
Hi, |
in your coding, would you be able to describe how the logic of I mean the computation logic for Besides, why would gumbel-max computation need a How exactly gumbel-max transforms equation (3) into equation (5) ? |
For the question on
@D-X-Y by the way, why PNASNet mention |
Solution:
So, in this case, I suppose I could use only single type of weights for both normal cells and reduction cells ? As for algorithm 1, how is A different from W ? |
Sorry for the late reply, I'm a little bit busy these days.
|
Yes, I borrow the idea of how to implement gumbel from PyTorch with a few modifications. For |
Yes, in this case, the architecture weights for normal cells and reduction cells are shared. |
@D-X-Y I am bit confused with the difference between Edit: I think I got it now. A single cell contains 4 distinct nodes By the way, in Algorithm 1, why GDAS updates |
I feel it does not matter? Updating |
For GDAS, would https://networkx.org/documentation/stable/tutorial.html#multigraphs be suitable for both forward inference and backward propagation ? |
I'm not familiar with |
@D-X-Y I am confused as in how https://github.com/D-X-Y/AutoDL-Projects/blob/main/xautodl/models/cell_searchs/search_model_gdas.py implemented multiple parallel connections between nodes |
@D-X-Y I am confused as in how equation (7) is an approximation of equation (5) as described in gdas paper ? |
@ProMach
As you run Eq.(5) infinite times, and run Eq.(7) infinite times, their average results should be very close. |
For #99 (comment) , how do you actually update both Could you point me to the relevant code for the update portion ? |
@ProMach , at a single iteration, we will first update |
If update |
I feel it does not matter for the order of |
the issue lingering in my head is that if |
What do you mean by |
If |
|
You could have a look at the codes here and Would you mind clarifying what do you think the codes should be? |
Let me rephrase my question, how do you define It seems to be different from how DARTS paper originally proposed. See equations (5) and (6) of DARTS paper |
Yes, following the DARTS paper, I should switch the order of updating |
during training for W, should I use a particular found architecture inside that particular epoch ? |
It depends on the NAS algorithm. For DARTS, they use the whole supernet. For GDAS, we use an architecture candidate randomly sampled based on |
The candidate is chosen using gumbel-argmax (equation (5) and (6) of GDAS paper) , instead of chosen randomly. |
|
For #99 (comment) , there are two types of outputs from the blue node. One type of (multiple edges) output connects to the input of the other blue nodes ? Another type of (single edge) output connects directly to the yellow node ? |
@D-X-Y I implemented a draft code on GDAS, However, could you advise whether this edge weight training epoch mechanism will actually work for GDAS ? |
Yes, you are right~ |
@ProMach Yes. DARTS also uses |
I personally feel the implementations are incorrect. I havn't fully checked the codes, but at least, the input for every cell/node should not be the same |
How to code the forward pass function correctly for edge weight training ?
Note: This is for training step 2 inside Algorithm 1 of DARTS paper |
why do we return self.weights? Instead of return the value of using |
I am not sure how to train edge weights, hence the question about
Note:
|
Sorry, I misinterpreted the purpose of the two forward pass functions.
However, I am still not sure how to code for |
@D-X-Y For ordinary NN training operation, we have some feature maps outputs. However for the edge weights (NAS) training operation, there are no feature maps outputs though. So, what should be fed into |
Is using
|
@D-X-Y I managed to get my own GDAS code implementation up and running. However, the loss stay the same which indicates the training process is still incorrect. Could you advise ? |
@D-X-Y looking at output of graph.named_parameters() , some of the internal connections within the super-net architecture are still not connected properly. Any comments ? |
For DARTS complexity analysis, anyone have any idea how to derive the (k+1)*k/2 expression ? Why 2 input nodes ? How will the calculated value change if graph isomorphism is considered ? Why "2+3+4+5" learnable edges ? If there is lack of connection, the paper should not add 1 which does not actually contribute to learnable edges configurations at all ?
Why need to train the weights for normal cells and reduction cells separately as shown in Figures 4 and 5 below ?
How to arrange the nodes such that the NAS search will actually converge with minimum error ? Note: Not all nodes are connected to each and every other nodes
Why is GDAS 10 times faster than DARTS ?
The text was updated successfully, but these errors were encountered: