Background
While I interned at my current company - Kuaishou, I was in part of the project where I need to set up a LSTM model, and I have to transfer my skill sets to Tensorflow from Pytorch in a short period of time , which was really painful experience for me. So I want to share this blog with you to ease your pain.
Some complaints
Although I have to admit Tensorflow is a very powerful computing framework: it has much stronger and larger community to support it compared with Pytorch; it allows you to get accees to some fancy, high-level attributes like tf.serving, tensorboard, distributed computation, etc, it has very steep learning curve and be super extra unfriendly to new users because of its rapid shift of version. Maybe you are first struggled with tf.placeholder version of data preprocessing and modelling and suddenly you find out it's not how we use it anymore, how would you feel then? Well, if it's the first blog you see, you know what i mean in a few hours...
What's exactly in this blog
In this blog I present most of experiments I did while writing the LSTM model code with Tensorflow version 1.9 and Python 3.6. I split my toy experiments into 4 parts:
- data input process
- LSTM cell test
- Shared embedding test
- Miscellany function
Who should read this blog
You should already get some taste of the tensorflow and want to know more detailed and useful knowledge without writing these toy examples yourself. I'm very confident these functions or tests will be helpful if you are dealing with high volumn of data and want to format your code neatly.
Data input process
tfrecord saving, reading
| 1 | import tensorflow as tf | 
Experiment over this part
- sparse tensor can be used in embedding layer directly,it will ignore default value and also won't be count into denominator when the combineris mean
- sparse_tensor_to_dense must have corresponding type of default_value to it's own data type ,like int,float can use 0 as default value, while tf.string has to be str,e.g. '0'.
- If you use tf.dataset.batch, and have various length data, then you should use parse_example instead of parse_single_example
- tf.pad can also pad outer dimension, which means increase a layer of zero outside your data
| 1 | import tensorflow as tf | 
| 1 | op=tf.feature_column.input_layer(next_batch[0],feature_columns=name_tensor) | 
[array([[ 0.29952478, -0.02905731, -0.14833574, -0.25489837, 0.13668409], [ 0.36393905, -0.18847883, 0.01317748, 0.02921137, -0.3228819 ], [ 0.20875314, -0.471264 , -0.23475473, -0.10564104, -0.1293019 ]], dtype=float32)] 12 x: [[ 1 2 3 0] [ 4 5 6 8] [23 1 0 0]] x_star: [[15. 25. 0. ] [23.1 0. 0. ] [ 1. 2. 3. ]] y: [[1] [2] [3]] testing pad for lstm testing feature columns for lstm testing string feature columns [[b'a' b'b'] [b'a' b'0'] [b'b' b'0']] [4 5 6 8] [2 6] <class 'numpy.ndarray'> 3
Tensorflow LSTM CELL
Single cell case
test the output of LSTM cell
| 1 | import tensorflow as tf | 
(2, 10, 64) (2, 64) Tensor("rnn/transpose_1:0", shape=(2, 10, 64), dtype=float64) LSTMStateTuple(c=<tf.Tensor 'rnn/while/Exit_3:0' shape=(2, 64) dtype=float64>, h=<tf.Tensor 'rnn/while/Exit_4:0' shape=(2, 64) dtype=float64>)
Multicell case
| 1 | tf.reset_default_graph() | 
<class 'numpy.ndarray'> (2, 9, 64) Something related to StateTuple
fast predict
This is a tricky part for my project, I will explain why we need and what it is shortly.
tensorflow fast predict
use generator to keep .predict open
Why you need this? Detailed explanation here
| 1 | import tensorflow as tf | 
tf.data.Dataset().from_generator
| 1 | import numpy as np | 
WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/util/tf_should_use.py:118: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02. Instructions for updating: Use tf.global_variables_initializer instead. {'a': array([[[ 0.6078665 , 0.7673362 ], [-1.095272 , -0.44154257], [ 0.24826635, 1.8101764 ]]], dtype=float32), 'b': array([[-1.3138337 , 0.05587422]], dtype=float32)}
| 1 | import numpy as np | 
LSTMStateTuple(c=array([[[ 0.15593866, -1.1535455 ], [ 0.8337963 , 0.3000586 ], [ 1.3395942 , -0.65611506]]], dtype=float32), h=array([[ 0.5950521 , -0.82992613]], dtype=float32))
| 1 | import os | 
None
| 1 | import tensorflow as tf | 
None
| 1 | import configparser as cp | 
test shared embedding
How to use embedding
tf.feature_column.input_layer(.., ..)
How tf.Varlen react when read in batch
data is in format tf.SparseTensor
embedding: make_parse_example_spec
it is tf.Varlen because embedding can be multiple dimension, and then you can combine them by sum, mean, etc.
| 1 | fc = tf.feature_column.categorical_column_with_hash_bucket('my_fc', | 
{'my_fc': VarLenFeature(dtype=tf.string)}
| 1 | fc = tf.feature_column.categorical_column_with_vocabulary_file('my_fc',vocabulary_file='abc',vocabulary_size=100, | 
{'my_fc': VarLenFeature(dtype=tf.string)}
check tf.varlen
| 1 | fc = tf.feature_column.categorical_column_with_vocabulary_file('my_fc',vocabulary_file='abc',vocabulary_size=100, | 
True VarLenFeature(dtype=tf.string) my_fc <class 'tensorflow.python.feature_column.feature_column._SharedEmbeddingColumn'> _SharedEmbeddingColumn(categorical_column=_VocabularyFileCategoricalColumn(key='my_fc', vocabulary_file='abc', vocabulary_size=100, num_oov_buckets=10, dtype=tf.string, default_value=-1), dimension=32, combiner='sum', initializer=<tensorflow.python.ops.init_ops.TruncatedNormal object at 0x11e6f6668>, shared_embedding_collection_name='my_em_fc', ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=None, trainable=True) True my_fc_shared_embedding
check naming of these method
| 1 | fc = tf.feature_column.categorical_column_with_vocabulary_list('my_fc',vocabulary_list=['a','b'], | 
my_fc_indicator my_fc _SharedEmbeddingColumn(categorical_column=_VocabularyListCategoricalColumn(key='my_fc', vocabulary_list=('a', 'b'), dtype=tf.string, default_value=-1, num_oov_buckets=10), dimension=32, combiner='mean', initializer=<tensorflow.python.ops.init_ops.TruncatedNormal object at 0x11e6f3668>, shared_embedding_collection_name='my_em_fc', ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=None, trainable=True) {'my_fc': VarLenFeature(dtype=tf.string)} target keys my_fc day day_bucketized {'my_fc': VarLenFeature(dtype=tf.string)} {'day': FixedLenFeature(shape=(2,), dtype=tf.float32, default_value=None)} {'day': FixedLenFeature(shape=(2,), dtype=tf.float32, default_value=None)} True
embedding - input_layer
embedding layer should be defined before you tf.initialize. Because in most cases is also a trainable layer.
| 1 | target = {} | 
Miscellany
tf.gather_nd
| 1 | a= np.array([[1,2,3],[4,5,6]]) | 
tf.less + tf.where
| 1 | import tensorflow as tf | 
Tensor computation
| 1 | a = tf.constant([[2,3,5],[4,5,7]]) | 
Tensor reshape
确认rehsape逻辑,[batch_size, max_len, feature_dim]
| 1 | a = tf.constant([[2,3,5],[4,5,7],[3,4,5],[1,5,9]]) | 
import .py from parent directory
| 1 | import os, sys |