1. Use proper initialize method to avoid gradient vanish or exploded. For example: layer=[3,5,1], when initialize the second layer, W[2]=np.random.rand(5,3)*sqrt(1/5) It avoid the initial weight too big or to small. we can short the training time significantly by this way 2. Use regularization to reduce the variance (overfitting). L2 regularization: add lambda/(2*m) *sigma( all weights) to the cost function. We need to change the backward propagation formula if we use this regularization. drop out: inverted drop out: generate a matrix in this way: mask[l] = np.random.rand( W[l].shape[0], W[l].shape[0] ) mask[l] = (mask<keep_prob). astype(int) When forward propagation, use W[l]*mask[l]/keep_prob instead of W[l]; When backward propagation, use dW[l]*mask[l]/keep_prob instead of W[l]. 3. Use different optimize method to reduce the cost decay time. Batch gradient descent: traditional way. Mini_batch gradient descent: choose a mini-batch number such as...