1. Use proper initialize method to avoid gradient vanish or exploded.
For example: layer=[3,5,1], when initialize the second layer,
W[2]=np.random.rand(5,3)*sqrt(1/5)
It avoid the initial weight too big or to small.
we can short the training time significantly by this way
2. Use regularization to reduce the variance (overfitting).
L2 regularization: add lambda/(2*m) *sigma( all weights) to the cost function. We need to change the backward propagation formula if we use this regularization.
drop out: inverted drop out: generate a matrix in this way:
mask[l] = np.random.rand( W[l].shape[0], W[l].shape[0] )
mask[l] = (mask<keep_prob). astype(int)
When forward propagation, use W[l]*mask[l]/keep_prob instead of W[l]; When backward propagation, use dW[l]*mask[l]/keep_prob instead of W[l].
3. Use different optimize method to reduce the cost decay time.
Batch gradient descent: traditional way.
Mini_batch gradient descent: choose a mini-batch number such as 32 64 128. Compute forward and backward propagation in each batch, then iterate all batch.
Stochastic gradient descent: mini_batch number=1
momentum: when compute W[l]=W[l] - learning rate*dW[l], replace dW[l] in v[l]=beta*v[l] + (1-beta)*dW[l]
Adam: replace dW[l] in (corrected v[l])/(corrected s[l] +epsilon),
where corrected v[l] = (beta1*v[l] + (1-beta1)*dW[l])/(1-beta1^t) ,
corrected s[l] = (beta2*s[l] + (1-beta2)*(dW[l]^2)) / (1-beta2^t) ,
t is the backward propagation iteration time,
epsilon is a small value to avoid zero.
All of those optimize method aim to reduce or balance the gradient direction, to make the gradient more useful.
Besides, can also use batch normalization to reduce converge time. Batch normalization can reduce the covariance shift
For example: layer=[3,5,1], when initialize the second layer,
W[2]=np.random.rand(5,3)*sqrt(1/5)
It avoid the initial weight too big or to small.
we can short the training time significantly by this way
2. Use regularization to reduce the variance (overfitting).
L2 regularization: add lambda/(2*m) *sigma( all weights) to the cost function. We need to change the backward propagation formula if we use this regularization.
drop out: inverted drop out: generate a matrix in this way:
mask[l] = np.random.rand( W[l].shape[0], W[l].shape[0] )
mask[l] = (mask<keep_prob). astype(int)
When forward propagation, use W[l]*mask[l]/keep_prob instead of W[l]; When backward propagation, use dW[l]*mask[l]/keep_prob instead of W[l].
3. Use different optimize method to reduce the cost decay time.
Batch gradient descent: traditional way.
Mini_batch gradient descent: choose a mini-batch number such as 32 64 128. Compute forward and backward propagation in each batch, then iterate all batch.
Stochastic gradient descent: mini_batch number=1
momentum: when compute W[l]=W[l] - learning rate*dW[l], replace dW[l] in v[l]=beta*v[l] + (1-beta)*dW[l]
Adam: replace dW[l] in (corrected v[l])/(corrected s[l] +epsilon),
where corrected v[l] = (beta1*v[l] + (1-beta1)*dW[l])/(1-beta1^t) ,
corrected s[l] = (beta2*s[l] + (1-beta2)*(dW[l]^2)) / (1-beta2^t) ,
t is the backward propagation iteration time,
epsilon is a small value to avoid zero.
All of those optimize method aim to reduce or balance the gradient direction, to make the gradient more useful.
Besides, can also use batch normalization to reduce converge time. Batch normalization can reduce the covariance shift
评论
发表评论