python - Very low GPU usage during training in Tensorflow -

i trying train simple multi-layer perceptron 10-class image classification task, part of assignment udacity deep-learning course. more precise, task classify letters rendered various fonts (the dataset called notmnist).

the code ended looks simple, no matter low gpu usage during training. measure load gpu-z , shows 25-30%.

here current code:

graph = tf.graph() graph.as_default():     tf.set_random_seed(52)      # dataset definition     dataset = dataset.from_tensor_slices({'x': train_data, 'y': train_labels})     dataset = dataset.shuffle(buffer_size=20000)     dataset = dataset.batch(128)     iterator = dataset.make_initializable_iterator()     sample = iterator.get_next()     x = sample['x']     y = sample['y']      # actual computation graph     keep_prob = tf.placeholder(tf.float32)     is_training = tf.placeholder(tf.bool, name='is_training')      fc1 = dense_batch_relu_dropout(x, 1024, is_training, keep_prob, 'fc1')     fc2 = dense_batch_relu_dropout(fc1, 300, is_training, keep_prob, 'fc2')     fc3 = dense_batch_relu_dropout(fc2, 50, is_training, keep_prob, 'fc3')     logits = dense(fc3, num_classes, 'logits')      tf.name_scope('accuracy'):         accuracy = tf.reduce_mean(             tf.cast(tf.equal(tf.argmax(y, 1), tf.argmax(logits, 1)), tf.float32),         )         accuracy_percent = 100 * accuracy      tf.name_scope('loss'):         loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))      update_ops = tf.get_collection(tf.graphkeys.update_ops)     tf.control_dependencies(update_ops):         # ensures execute update_ops before performing train_op         # needed batch normalization (apparently)         train_op = tf.train.adamoptimizer(learning_rate=1e-3, epsilon=1e-3).minimize(loss)  tf.session(graph=graph) sess:     tf.global_variables_initializer().run()     step = 0     epoch = 0     while true:, feed_dict={})         while true:             step += 1             try:       , feed_dict={keep_prob: 0.5, is_training: true})             except tf.errors.outofrangeerror:       'end of epoch #%d', epoch)                 break          # end of epoch         train_l, train_ac =             [loss, accuracy_percent],             feed_dict={x: train_data, y: train_labels, keep_prob: 1, is_training: false},         )         test_l, test_ac =             [loss, accuracy_percent],             feed_dict={x: test_data, y: test_labels, keep_prob: 1, is_training: false},         )'train loss: %f, train accuracy: %.2f%%', train_l, train_ac)'test loss: %f, test accuracy: %.2f%%', test_l, test_ac)          epoch += 1 

here's tried far:

  1. i changed input pipeline simple feed_dict far understood, supposed take care of efficiency of input, e.g. load data in separate thread. there should not bottleneck associated input.

  2. i collected traces suggested here: however, these traces didn't show interesting. >90% of train step matmul operations.

  3. changed batch size. when change 128 512 load increases ~30% ~38%, when increase further 2048, load goes ~45%. have 6gb gpu memory , dataset single channel 28x28 images. supposed use such big batch size? should increase further?

generally, should worry low load, sign training inefficiently?

here's gpu-z screenshots 128 images in batch. can see low load occasional spikes 100% when measure accuracy on entire dataset after each epoch.

gpu load

mnist size networks tiny , it's hard achieve high gpu (or cpu) efficiency them, think 30% not unusual application. higher computational efficiency larger batch size, meaning can process more examples per second, lower statistical efficiency, meaning need process more examples total target accuracy. it's trade-off. tiny character models yours, statistical efficiency drops off after 100, it's not worth trying grow batch size training. inference, should use largest batch size can.


