Photo by Humble Lamb on Unsplash

RNN- Character level text generation with Tensorflow 2.0 from scratch to deep insights.

Bhautik Donga
11 min readApr 11, 2020

--

Hello friends, I have gone through many blogs on RNN but I felt some of the holes in code level explanation in most of them which pushed me to write down this post. I hope this blog can fill that and will give you deep knowledge.

Here I assume that you have some basic theoretical knowledge of RNN. With some basic knowledge, you can easily follow my codes. During this project, I have so much confusion in some layer structure and especially about input and output shape which I tried to solve and here I am trying to give a solution for every bit of confusion. For this project, we are going to use the text file ‘The Jungle Book’. If you want to follow with me, a complete notebook with codes given here.

Now Let’s start our journey with RNN by importing some basic libraries.

import tensorflow as tf
import numpy as np
import os
import time

Now let’s open our text file and apply the ‘utf-8’ technique on it. After that, we are trying to remove some extra stuff given in starting of the textfile.

text = open('The Jungle Book.txt', 'rb').read().decode(encoding='utf-8')start_index = text.find("START OF THIS PROJECT GUTENBERG")
end_index = text.find("End of the Project Gutenberg")
text = text[start_index : end_index]

Let’s check the length of the total characters in this textfile and how many unique characters are there.

vocab = sorted(set(text))
print('Length of text: {} characters.'.format(len(text)))
print('Unique characters: {}'.format(len(vocab)))

From the above code output, we can say that we have a sequence consisting of 279191 characters in total and 77 unique characters. Now as we all know we can not give input in the TensorFlow model as a character so let us apply mapping each character to integer and integer to character. After that, we convert the whole text file to an integer sequence by applying this mapping.

char2idx = {char:i for i, char in enumerate(vocab)}
idx2char = np.array(vocab)
text_encoded = [char2idx[c] for c in text]
text_encoded = np.array(text_encoded)

Now we have an integer representation for each character. Notice that we mapped the character as indexes from 0 to len(unique).

# Show how the first 31 characters from the text are mapped to integers
print('Text: {} \n==> Encoded as : {}'.format(text[:31], text_encoded[:31]))

Create training examples and targets

The next part is to divide the text into example sequences. Each input sequence will contain seq_length characters from the text. For each input sequence, the corresponding targets contains the same length of text, except shifted one character to the right. So break the text into chunks of seq_length+1. For example, let’s say seq_length is 4 and our text is ‘Hello’. The input sequence would be ‘Hell’, and the target sequence ‘ello’.

To do this first use the tf.data.Dataset.from_tensor_slices function to convert the text vector into a stream of character indices.

# The maximum length sentence we want for a single input in characters
seq_length = 100
example_per_epoch = len(text)//seq_length # as we have 1 example of seq_length characters.
# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_encoded)
for i in char_dataset.take(5):
print(idx2char[i.numpy()])

The batch method lets us easily convert these individual characters to sequences of the desired size.

drop_remainder: representing whether the last batch should be dropped in the case it has fewer than batch_size elements; the default behavior is not to drop the smaller batch.

Here we have a sequence of integers representing characters which is much longer so it is much helpful if we divide this into many sequences. Here we apply .batch() function on char_dataset to make sequences. Here every integer act as one element so here one batch creates one sequence so we have to give batch size is equal to (seq_length+1).

sequences = char_dataset.batch(batch_size=seq_length+1, drop_remainder=True)# repr function print string representation of an object.
for item in sequences.take(5):
print(repr(''.join(idx2char[item.numpy()])))

Now for each sequence, duplicate it and shift it to form the input and target text by using the map method to apply a simple function to each batch:

def split_input_target(chunk):
input_text = chunk[:-1]
target_text = chunk[1:]
return input_text, target_text
dataset = sequences.map(split_input_target)for input_ex, target_ex in dataset.take(1):
print('Input data: ', repr(''.join(idx2char[input_ex.numpy()])))
print('Output data:',repr(''.join(idx2char[target_ex.numpy()])))

Each index of these vectors(input vector and output vector of integers) is processed as a one-time step. Means we have total timesteps in one example are equal to the sequence length. For the input at time step 0, the model receives the index for ‘F’ and try to predict the index for ‘i’ as the next character. At the next timestep, it does the same thing but the RNN considers the previous step context in addition to the current input character.

for i, (input_idx, target_idx) in enumerate(zip(input_ex[:5], target_ex[:5])):
print("Step {:4d}".format(i))
print("Input : {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
print("Expected output : {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

Create training batches

Now from first .batch() function we created sequences of characters and make dataset from it. and as we know one sequence acts as one example for input. so for batch training, we create a batch of more than one example. so again we use.batch() function with batch_size equal to the number of sequences we want in a batch. Note that here every sequence acts as one element means one batch contains batch_size number of sequences. Here we used tf.data to split the text into manageable sequences. But before feeding this data into the model, we need to shuffle the data and pack it into batches.

BATCH_SIZE = 64
BUFFER_SIZE = 10000
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print(dataset)

from the output of the above code we can derive that we have 64 examples(sequences) in a batch and every example has 100 timesteps or we can say 100 characters or in lemon term 100 input features.

Build The Model

# Length of vocabulary in chars
vocab_size = len(vocab)
# The embedding dimension
embedding_dim = 256
# Number of RNN units
rnn_units = 1024
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
model = tf.keras.Sequential([
tf.keras.layers.Embedding(input_dim=vocab_size,
output_dim = embedding_dim,
batch_input_shape = [batch_size, None]),
tf.keras.layers.GRU(units = rnn_units,
return_sequences= True,
stateful=True,
recurrent_initializer='glorot_uniform'),
tf.keras.layers.Dense(vocab_size)
])

return model

Let first try to understand every layer and every parameter of its.

Here, we have finite batch_size so give the value of batch_input_shape parameter, instead if we have variable length batch_size as a case if we give drop_remainder=False then we can define another parameter batch_size.

Embedding layer:
Embedding layer turns positive integers (indexes) into dense vectors of fixed size. Ex. It takes one word and represents it as a vector of output_dim. If we converted our text into one hot encoded and then give input then this function should not need but for millions of words, it's not beneficial for us to give that much shape of the array as input. so we just apply word embedding algorithm to our word or character (here act as integer) and make vector representation of that word or character or integer which takes lower dimensions and eventually less memory.
Here input_dim corresponds to the vocabulary size (number of distinct words) of your dataset so that this layer can decide how it should have to convert every word in vector space.
output_dim is an arbitrary hyperparameter that indicates the dimension of your embedding space. In other words, if you set output_dim=x, each word in the sentence will be characterized with x features.
input_length should be set to SEQ_LENGTH (an integer indicating the length of each sentence), assuming that all the sentences have the same length. Here we not define this parameter as it is not mandatory.
This layer can only be used as the first layer in a model. For visualization this image can be helpful:

image from the TensorFlow RNN blog.

GRU Layer:
here units define dimensionality of the output shape. Now as seen above image we can say that for each word or character process there is being one GRU layer with shared weights. so for one character or word, our embedding layer gives us output as a vector of embedding_dim shape. so it takes input as this vector and gives the output as rnn_unit shape of a vector which contains more information.

Dense Layer:
as seen from the image we have a separate dense layer for all single characters, we can say that the output of the GRU layer is connected to Dense Layer, and from that, we want to get some extra features by adding specific number of neurons in dense layer. Now consider this problem as a classification problem as we want to predict out next character so we need output neuron for a single character is the size of vocabulary so that it can predict the probability of every character for next word and the character or word with max probability we take it as next predicted word. So here we give dense layer units equal to vocab_size.

model = build_model(vocab_size = len(vocab),
embedding_dim = embedding_dim,
rnn_units = rnn_units,
batch_size = BATCH_SIZE)
model.summary()

Output is:

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (64, None, 256) 19712
_________________________________________________________________
gru (GRU) (64, None, 1024) 3938304
_________________________________________________________________
dense (Dense) (64, None, 77) 78925
=================================================================
Total params: 4,036,941
Trainable params: 4,036,941
Non-trainable params: 0
_________________________________________________________________

Run the model

# First check the shape of the output:
for input_ex_batch, target_ex_batch in dataset.take(1):
ex_batch_prediction = model(input_ex_batch) # simply it takes input and calculate output with initial weights.
print(ex_batch_prediction.shape, "# (batch_size, sequence_length, vocab_size)")

In the above example, the sequence length of the input is 100 but the model can be run on inputs of any length.

To get actual predictions from the model we need to sample from the output distribution, to get actual character indices. This distribution is defined by the logits over the character vocabulary.
Note: It is important to sample from this distribution because if we take the argmax of the distribution, it can easily get the model stuck in a loop.

Let first try it for the first batch:

sampled_indices = tf.random.categorical(ex_batch_prediction[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()
print(sampled_indices)

Decode these to see the text predicted by this untrained model. Here not confused about the shape of input_ex_batch because we are taking every single batch from the dataset so now we have input_ex_batch has the shape of (number of sequences, sequence length) and below we are taking the first sequence.

print('Input: \n', repr("".join(idx2char[input_ex_batch[0].numpy()])))
print('\nPredicted sequence for next characters is: \n', repr("".join(idx2char[sampled_indices])))

Train the model

At this point, the problem can be treated as a standard classification problem. Given the previous RNN state, and the input this time step, predict the class of the next character.

Attach an optimizer, and a loss function

The standard tf.keras.losses.sparse_categorical_crossentropy loss function works in this case because it is applied across the last dimension of the predictions. Because our model returns logits, we need to set the from_logits flag.

def loss(labels, logits):
return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)
ex_batch_loss = loss(target_ex_batch, ex_batch_prediction)
print("Prediction shape: ", ex_batch_prediction.shape, " # (batch_size, sequence_length, vocab_size)")
print("Scaler loss: ", ex_batch_loss.numpy().mean())

Let compile the model:

model.compile(optimizer='adam', loss=loss)

Configure checkpoints:

Use a tf.keras.callbacks.ModelCheckpoint to ensure that checkpoints are saved during training:

# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint( filepath=checkpoint_prefix, save_weights_only=True)
EPOCHS = 10history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Generate text:

Restore the latest checkpoint

To keep this prediction step simple, use a batch size of 1. Because of the way the RNN state is passed from timestep to timestep, the model only accepts a fixed batch size once built. To run the model with a different batch_size, we need to rebuild the model and restore the weights from the checkpoint.

tf.train.latest_checkpoint(checkpoint_dir)

Now let build model:

model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
# Builds the model based on input shapes received.
model.build(tf.TensorShape([1, None]))

model.build() : This method only exists for users who want to call it in a standalone way (as a substitute for calling the model on real data to
build it). It will never be called by the framework (and thus it will
never throw unexpected errors in an unrelated workflow).

model.summary()

Output:

Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (1, None, 256) 19712
_________________________________________________________________
gru_1 (GRU) (1, None, 1024) 3938304
_________________________________________________________________
dense_1 (Dense) (1, None, 77) 78925
=================================================================
Total params: 4,036,941
Trainable params: 4,036,941
Non-trainable params: 0
_________________________________________________________________

The prediction loop:

The following code block generates the text:

  • It Starts by choosing a start string, initializing the RNN state, and setting the number of characters to generate.
  • Get the prediction distribution of the next character using the start string and the RNN state.
  • Then, use a categorical distribution to calculate the index of the predicted character. Use this predicted character as our next input to the model.
  • The RNN state returned by the model is fed back into the model so that it now has more context, instead than only one character. After predicting the next character, the modified RNN states are again fed back into the model, which is how it learns as it gets more context from the previously predicted characters.
def generate_text(model, start_string, num_generate, temperature):

# Converting our start string to numbers (vectorizing)
input_eval = [char2idx[s] for s in start_string]
# convert (x,y) shaped matrix to (1,x,y).
input_eval = tf.expand_dims(input_eval, axis=0)

# Empty string to store our results
text_generated = []

# Here batch size == 1
model.reset_states()
for i in range(num_generate):
predictions = model(input_eval)

# remove the batch dimension
predictions = tf.squeeze(predictions, 0)

# using a categorical distribution to predict the
# character returned by the model
predictions = predictions / temperature

# We got the predictions for every timestep but we
# want only last so first we take [-1] to consider on last
# predictions distribution only and after we try to get id
# from 1D array. Ex. we got '2' from a=['2'] by a[0].
predicted_id = tf.random.categorical(predictions,
num_samples=1
)[-1,0].numpy()

# We pass the predicted character as the next input to the
# model along with the previous hidden state
input_eval = tf.expand_dims([predicted_id], 0)

text_generated.append(idx2char[predicted_id])

return (start_string + ''.join(text_generated))

The value of temperature determines how random the generated text is (lower is less random, meaning more predictable).

print(generate_text(model, 
start_string=u"The moon was sinking behind the",
num_generate=500, temperature=1.0))

Output:

The moon was sinking behind the hills and alt off
beains. Meep and putt not labbed it his byell.” Han would
had to ne there-People,” Mahind Baloo’s
(Gose. And as yourher agains into for to rot
E will thistily heard A bround intimettle back, and never teased
ot Tadible hunting ropen beany and the that
farcims of yellon in the Jungle were night take the MeaCBut Katarall, shouth, He care un ho scoully of the
othing to have pit--and
then belon flefth of any, leary hedry, for tiges abong they bawgun and memy
young is to cabel hors

Let’s check the output for temperature=0.5:

print(generate_text(model, 
start_string=u"The moon was sinking behind the",
num_generate=500, temperature=0.5))

Output:

The moon was sinking behind the pires of the back to the
not to the elephants all the time to the lind of himself that the had streng than
the liges and streng the thickers of the jungle and hild never and sung and all the seals
said he will to the watch. “Onder then the warked
and still, and the time to the siges of the big not cound himself the tund of a could beached and the foress and but the Bandar-log thing in the beaches or the near to the Council Rock.

“I’m lies man and there is ne that he street to your nearus

I know there is no grammatical sentence but still, this gives pretty good sentences. If you like you can tune some of the parameters and try to make this model much robust. I hope you like this post. If you like this post please give some claps for this post. Thank you.

--

--

Bhautik Donga
Bhautik Donga

Written by Bhautik Donga

Machine Learning (Computer Vision) Engineer

No responses yet