Trick_1

Batch_Normalization

with tf.variable_scope('fc_1'):
    out = tf.layers.dense(out, 4000)
    out = tf.layers.batch_normalization(out, momentum=bn_momentum, training=is_training)
    out = tf.nn.relu(out)

...

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
    train_op = optimizer.minimize(loss_op, global_step=tf.train.get_global_step())

- could be before or after relu, slight difference

https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md

- is_training = false

  1. use BN internally stored average of mean and variance to normalize the batch, not the batch's own mean and variance.

  2. BN internal variables also don't get updated.

- update_ops

https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization

when training, the moving_mean and moving_variance need to be updated. By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be executed alongside thetrain_op. Also, be sure to add any batch_normalization ops before getting the update_ops collection. Otherwise, update_ops will be empty, and training/inference will not work properly.

No activation layer @ output

In general no other activation function is used after the output layer (beside the softmax itself), slight

Learning rate decay

global_step= tf.train.get_or_create_global_step()

# decayed_learning_rate = learning_rate * decay_rate ^ (global_step / decay_steps)
learning_rate = tf.train.exponential_decay(0.0001, global_step, decay_steps=50, decay_rate=0.1)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

tf.argmax

import tensorflow as tf
import numpy as np

A = [[0, 1, 2, 3, 4, 3, 2, 1, 0]]
B = [
    [1,3,4], 
    [2,4,1],
    [9,4,1]
]

with tf.Session() as sess:
    print(sess.run(tf.argmax(A, 1)))
    print(sess.run(tf.argmax(B, 1)))

# [4]
# [2 1 0]

sparse_softmax_cross_entropy & softmax_cross_entropy

They produce the same result, and the difference is simple.

sparse_*

As such, with sparse functions, the dimensions oflogitsandlabelsare not the same:labelscontain one number per example, whereaslogits is one-hot, denoting probabilities.

>> For sparse_softmax_cross_entropy

  1. labels shape = [batch_size]

  2. label is an int

  3. logits is one-hot

  4. dtype int32 or int64

= = =

>> For softmax_cross_entropy

  1. labels shape = [batch_size, num_classes]

  2. label is one-hot encoding

  3. logits is one-hot

  4. dtype float32 or float64

tf.losses already reduce_mean

logits = build_model(is_training, X)
predictions = tf.argmax(logits, 1)

labels = tf.argmax(Y, 1)

loss_op = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
acc_op = tf.reduce_mean(tf.cast(tf.equal(labels, predictions), tf.float32))

global_step= tf.train.get_or_create_global_step()

optimizer = tf.train.AdamOptimizer(learning_rate=0.00001)

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
    train_op = optimizer.minimize(loss_op, global_step=global_step)

https://stackoverflow.com/questions/47034888/how-to-choose-cross-entropy-loss-in-tensorflow

Sigmoid vs softmax

In simple binary classification, there's no big difference between the two.

In case of multinomial classification, sigmoid allows to deal with non-exclusive labels (a.k.a.multi-labels), while softmax deals with exclusive classes.

Last updated

Was this helpful?