Generating Arbitrary Text with Markov Chains in Rust
A friend and I wanted to work on a quick project together and we came up with the idea of making a chatbot. What would make this chatbot special is that it would use the input of another one of our friends so it would post messages that are generated from past messages he’s sent.
How it works
We’ll need to give our program some input, for example the following text:
I really like reeses but I do not really like almond joys.
Given the above input, it would be broken down into the following:
Prefix Suffix
------ ------
I really like
really like reeses
like reeses but
reeses but I
but I do
I do not
do not really
not really like
really like almond
like almond joys
We’re breaking the sentence down into bigrams and storing what word it’s followed by.
If you’re unfamiliar with bigrams like I was, they’re really just two part
n-grams. If you’re also unfamiliar with n-grams like I was, they’re all the
possible combinations of adjacent words in a given sentence. For example, the
first n-gram from above would be I really
and the next n-gram would use the
second word as its first word, eg: really like
and so on.
Getting started
Now that we have a basic idea how we want to model our text generator, lets get started by generating our project. I’m using the Rust nightly build, but it should also work on the stable release.
We can start our project by running cargo new markov --bin
. We need the
--bin
flag because cargo generates projects as a library by default.
We’ll add a new file in src/markov.rs
with the following:
use std::collections::HashMap;
pub struct Markov {
map: HashMap<String, Vec<String>>
}
impl Markov {
pub fn new() -> Markov {
Markov {
map: HashMap::new()
}
}
}
So far it’s pretty simple, we import HashMap
to make it available in this
module, define a new struct called Markov
and give it a single property called
map
. The hash map is typed to having keys of type String
and values that are
of type Vec<String>
. We use a vector because each prefix can have multiple
suffixes.
Next we’ll write the code to parse a sentence and insert it into our hash map.
We’re going to put this code in the impl
block as we did with new
.
pub fn parse(&mut self, sentence: &str) {
let words = sentence.split(" ").collect::<Vec<&str>>();
let word_count = words.len();
for n in 0..word_count {
if n + 2 < word_count {
let key = format!("{} {}", words[n], words[n + 1]);
let value = words[n + 2];
self.insert(key, value.to_string())
}
}
}
First we define our parse
function to take a mutable reference to self
so we
can insert the pairs later, and a reference to the sentence we’re parsing.
Next we’re splitting on space to get each word individually and then looping over each word. If there’s at least two more words in our sentence then we create our key using the current word and the word following it. The value is the word after the last word in our key.
We’ve called self.insert
but haven’t defined it yet. Still inside the impl
block, lets define it so we can insert our pairs.
fn insert(&mut self, key: String, value: String) {
if self.map.contains_key(&key) {
let current_value = self.map.get_mut(&key).unwrap();
current_value.push(value);
} else {
self.map.insert(key, vec!(value));
}
}
Here We check if our hash map already has the given key
. If it does then we
grab a mutable reference to the vector and push the value onto it. If the value
isn’t already in our hash map then we insert the pair wrapping value
in a
vector.
Generating sentences
Now that we can parse sentences and store n-grams we can start generating sentences!
We’re going to need to be able to select random keys and values from our
n-grams. For that we’ll need to use the rand
crate. We can add it to
Cargo.toml
.
[dependencies]
rand = "0.3"
We need to declare that we want to use the rand
crate and use
it. We can
change the top of src/markov.rs
to look like the following code:
extern crate rand;
use std::collections::HashMap;
use self::rand::{thread_rng, Rng};
Finally we can start writing the bread and butter of our program, generating semi-random sentences. This is the “chain” part in “markov chain” because we’re chaining each state to a new state.
pub fn generate_sentence(self) -> String {
let mut rng = thread_rng();
let keys = self.map.keys().collect::<Vec<&String>>();
let mut key = rng.choose(&keys).expect("could not get random value").to_string();
let mut sentence = key.clone();
loop {
match self.map.get(&key) {
Some(values) => {
let value = rng.choose(values).expect("could not get value");
sentence = format!("{} {}", sentence, value);
key = next_key(&key, value);
}
None => break
}
}
sentence
}
First thing we do is we grab a random key from our map and use it as the
beginning of our sentence. Both key
and sentence
are declared as mut
because we’ll be re-assigning them as we go.
Next, inside the loop we try to match on getting the value for the current
key
. If there’s no pair for the given key then we break out of the loop and
return our sentence. If there is a pair then we take the value returned from
map, get a random value from the suffixes, append the value to our sentence
,
and finally use the helper function next_key
to fetch the next key.
We also need to define the helper function next_key
at the bottom of
src/markov.rs
outside of the impl block.
fn next_key(key: &str, value: &str) -> String {
let last_word = key.split(" ").last().expect("could not get last word");
format!("{} {}", last_word, value)
}
The results
Now that our generator is complete, lets see it in action. Open up src/main.rs
and replace the contents with the following:
mod markov;
use markov::Markov;
fn main() {
let mut map = Markov::new();
map.parse("I really like reeses but I do not really like almond joys.");
map.parse("I would really like to travel the world.");
println!("{:?}", map.generate_sentence());
}
All that’s left for us to do is check out the output. We can run our program
with cargo run
and it should output a random sentence! It’s probably not that
exciting right now but the more data you feed it the better (or worse!) the
results become.
Conclusion
Rust is a fun language to write code in, and writing a markov chain in it was
fun. Thanks to the great ecosystem of packages integrating it into a Slack bot
or Discord bot was extremely simple. In fact, the majority of the code in this
blog post comes from Slack Markov, a bot that responds to @
mentions with
markov chain generated sentences based on a specific Slack user’s history.