Generating Arbitrary Text with Markov Chains in Rust

A friend and I wanted to work on a quick project together and we came up with the idea of making a chatbot. What would make this chatbot special is that it would use the input of another one of our friends so it would post messages that are generated from past messages he's sent.

How it works

We'll need to give our program some input, for example the following text:

I really like reeses but I do not really like almond joys.

Given the above input, it would be broken down into the following:

Prefix            Suffix
------           ------
I really         like
really like      reeses
like reeses      but
reeses but       I
but I            do
I do             not
do not           really
not really       like
really like      almond
like almond      joys

We're breaking the sentence down into bigrams and storing what word it's followed by.

If you're unfamiliar with bigrams like I was, they're really just two part n-grams. If you're also unfamiliar with n-grams like I was, they're all the possible combinations of adjacent words in a given sentence. For example, the first n-gram from above would be I really and the next n-gram would use the second word as its first word, eg: really like and so on.

Getting started

Now that we have a basic idea how we want to model our text generator, lets get started by generating our project. I'm using the Rust nightly build, but it should also work on the stable release.

We can start our project by running cargo new markov --bin. We need the --bin flag because cargo generates projects as a library by default.

We'll add a new file in src/markov.rs with the following:

use std::collections::HashMap;

pub struct Markov {
    map: HashMap<String, Vec<String>>
}

impl Markov {
    pub fn new() -> Markov {
        Markov {
            map: HashMap::new()
        }
    }
}

So far it's pretty simple, we import HashMap to make it available in this module, define a new struct called Markov and give it a single property called map. The hash map is typed to having keys of type String and values that are of type Vec<String>. We use a vector because each prefix can have multiple suffixes.

Next we'll write the code to parse a sentence and insert it into our hash map. We're going to put this code in the impl block as we did with new.

pub fn parse(&mut self, sentence: &str) {
    let words = sentence.split(" ").collect::<Vec<&str>>();
    let word_count = words.len();

    for n in 0..word_count {
        if n + 2 < word_count {
            let key = format!("{} {}", words[n], words[n + 1]);
            let value = words[n + 2];

            self.insert(key, value.to_string())
        }
    }
}

First we define our parse function to take a mutable reference to self so we can insert the pairs later, and a reference to the sentence we're parsing.

Next we're splitting on space to get each word individually and then looping over each word. If there's at least two more words in our sentence then we create our key using the current word and the word following it. The value is the word after the last word in our key.

We've called self.insert but haven't defined it yet. Still inside the impl block, lets define it so we can insert our pairs.

fn insert(&mut self, key: String, value: String) {
    if self.map.contains_key(&key) {
        let current_value = self.map.get_mut(&key).unwrap();
        current_value.push(value);
    } else {
        self.map.insert(key, vec!(value));
    }
}

Here We check if our hash map already has the given key. If it does then we grab a mutable reference to the vector and push the value onto it. If the value isn't already in our hash map then we insert the pair wrapping value in a vector.

Generating sentences

Now that we can parse sentences and store n-grams we can start generating sentences!

We're going to need to be able to select random keys and values from our n-grams. For that we'll need to use the rand crate. We can add it to Cargo.toml.

[dependencies]
rand = "0.3"

We need to declare that we want to use the rand crate and use it. We can change the top of src/markov.rs to look like the following code:

extern crate rand;

use std::collections::HashMap;
use self::rand::{thread_rng, Rng};

Finally we can start writing the bread and butter of our program, generating semi-random sentences. This is the "chain" part in "markov chain" because we're chaining each state to a new state.

pub fn generate_sentence(self) -> String {
    let mut rng = thread_rng();
    let keys = self.map.keys().collect::<Vec<&String>>();

    let mut key = rng.choose(&keys).expect("could not get random value").to_string();
    let mut sentence = key.clone();

    loop {
        match self.map.get(&key) {
            Some(values) => {
                let value = rng.choose(values).expect("could not get value");
                sentence = format!("{} {}", sentence, value);

                key = next_key(&key, value);
            }
            None => break
        }
    }

    sentence
}

First thing we do is we grab a random key from our map and use it as the beginning of our sentence. Both key and sentence are declared as mut because we'll be re-assigning them as we go.

Next, inside the loop we try to match on getting the value for the current key. If there's no pair for the given key then we break out of the loop and return our sentence. If there is a pair then we take the value returned from map, get a random value from the suffixes, append the value to our sentence, and finally use the helper function next_key to fetch the next key.

We also need to define the helper function next_key at the bottom of src/markov.rs outside of the impl block.

fn next_key(key: &str, value: &str) -> String {
    let last_word = key.split(" ").last().expect("could not get last word");
    format!("{} {}", last_word, value)
}

The results

Now that our generator is complete, lets see it in action. Open up src/main.rs and replace the contents with the following:

mod markov;
use markov::Markov;

fn main() {
    let mut map = Markov::new();
    map.parse("I really like reeses but I do not really like almond joys.");
    map.parse("I would really like to travel the world.");
    println!("{:?}", map.generate_sentence());
}

All that's left for us to do is check out the output. We can run our program with cargo run and it should output a random sentence! It's probably not that exciting right now but the more data you feed it the better (or worse!) the results become.

Conclusion

Rust is a fun language to write code in, and writing a markov chain in it was fun. Thanks to the great ecosystem of packages integrating it into a Slack bot or Discord bot was extremely simple. In fact, the majority of the code in this blog post comes from Slack Markov, a bot that responds to @ mentions with markov chain generated sentences based on a specific Slack user's history.