python - Very low GPU usage during training in Tensorflow -


i trying train simple multi-layer perceptron 10-class image classification task, part of assignment udacity deep-learning course. more precise, task classify letters rendered various fonts (the dataset called notmnist).

the code ended looks simple, no matter low gpu usage during training. measure load gpu-z , shows 25-30%.

here current code:

graph = tf.graph() graph.as_default():     tf.set_random_seed(52)      # dataset definition     dataset = dataset.from_tensor_slices({'x': train_data, 'y': train_labels})     dataset = dataset.shuffle(buffer_size=20000)     dataset = dataset.batch(128)     iterator = dataset.make_initializable_iterator()     sample = iterator.get_next()     x = sample['x']     y = sample['y']      # actual computation graph     keep_prob = tf.placeholder(tf.float32)     is_training = tf.placeholder(tf.bool, name='is_training')      fc1 = dense_batch_relu_dropout(x, 1024, is_training, keep_prob, 'fc1')     fc2 = dense_batch_relu_dropout(fc1, 300, is_training, keep_prob, 'fc2')     fc3 = dense_batch_relu_dropout(fc2, 50, is_training, keep_prob, 'fc3')     logits = dense(fc3, num_classes, 'logits')      tf.name_scope('accuracy'):         accuracy = tf.reduce_mean(             tf.cast(tf.equal(tf.argmax(y, 1), tf.argmax(logits, 1)), tf.float32),         )         accuracy_percent = 100 * accuracy      tf.name_scope('loss'):         loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))      update_ops = tf.get_collection(tf.graphkeys.update_ops)     tf.control_dependencies(update_ops):         # ensures execute update_ops before performing train_op         # needed batch normalization (apparently)         train_op = tf.train.adamoptimizer(learning_rate=1e-3, epsilon=1e-3).minimize(loss)  tf.session(graph=graph) sess:     tf.global_variables_initializer().run()     step = 0     epoch = 0     while true:         sess.run(iterator.initializer, feed_dict={})         while true:             step += 1             try:                 sess.run(train_op, feed_dict={keep_prob: 0.5, is_training: true})             except tf.errors.outofrangeerror:                 logger.info('end of epoch #%d', epoch)                 break          # end of epoch         train_l, train_ac = sess.run(             [loss, accuracy_percent],             feed_dict={x: train_data, y: train_labels, keep_prob: 1, is_training: false},         )         test_l, test_ac = sess.run(             [loss, accuracy_percent],             feed_dict={x: test_data, y: test_labels, keep_prob: 1, is_training: false},         )         logger.info('train loss: %f, train accuracy: %.2f%%', train_l, train_ac)         logger.info('test loss: %f, test accuracy: %.2f%%', test_l, test_ac)          epoch += 1 

here's tried far:

  1. i changed input pipeline simple feed_dict tensorflow.contrib.data.dataset. far understood, supposed take care of efficiency of input, e.g. load data in separate thread. there should not bottleneck associated input.

  2. i collected traces suggested here: https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-225754659 however, these traces didn't show interesting. >90% of train step matmul operations.

  3. changed batch size. when change 128 512 load increases ~30% ~38%, when increase further 2048, load goes ~45%. have 6gb gpu memory , dataset single channel 28x28 images. supposed use such big batch size? should increase further?

generally, should worry low load, sign training inefficiently?

here's gpu-z screenshots 128 images in batch. can see low load occasional spikes 100% when measure accuracy on entire dataset after each epoch.

gpu load

mnist size networks tiny , it's hard achieve high gpu (or cpu) efficiency them, think 30% not unusual application. higher computational efficiency larger batch size, meaning can process more examples per second, lower statistical efficiency, meaning need process more examples total target accuracy. it's trade-off. tiny character models yours, statistical efficiency drops off after 100, it's not worth trying grow batch size training. inference, should use largest batch size can.


Comments

Popular posts from this blog

resizing Telegram inline keyboard -

command line - How can a Python program background itself? -

php - "cURL error 28: Resolving timed out" on Wordpress on Azure App Service on Linux -