Markov chains in Ruby - Ruby Quiz #74
April 09, 2006
Here’s my solution for the quiz No 74. It generates text with a first order Markov chain.
It generates text with a first order Markov Chain. Because the matrix of a Markov chain for native language tests is sparse, the chain is stored in a hash of hashes for better memory usage.
There are some open issues i think, but i did not had the time to implement them
- for usage of higher order markov chains one should add a list of previous elements in the method MarkovChain#add_elems(). But this breaks other parts of the code and these parts have to be adapted to this.
- better performance of the MarkovChain#rand() method
- and of course there are a lot of open issues with respect to natural language processing, the code does not produce grammatical valid sentences, no quotes, no comma, etc.
I programmed it to learn Ruby and to improve my coding skills, because i only have 2,3 weeks of experience with Ruby. I bought the “Programming Ruby” book by Dave Thomas et. al. and started to read it two weeks and two days ago and i used it a lot the last 48 hours. I think there are some parts of the code where there are better solutions or could be better expressed in a more “rubyish” way.
I used Java and Perl a lot in the last years and i will use Ruby now instead of Perl. I can remember my experiences with hashes of hashes in Perl. This was awkward and often a pain. And on friday as i started coding for this quiz i suddenly realized that i had none of this issues in Ruby. Wow ! Because i know the functional language Haskell i am used to iterators, map’s, lambda’s etc. and i tried to use iterators and block in the program, but i think i have to get more experience here in Ruby.
MarkovChain
# A Markov Chain contains a graph which edges are labeled with probabilities.
# The graph is stored as two hashes, one for the absolute probabilites, one for
# the relative probabilities. The nodes of the graph correspond to the keys of the hashes.
#
# The rand() method is worst case O(n), for small lists this is ok, but for
# large lists binary search will be better O(lg n)
class MarkovChain
# Initializes the hashes.
#
def initialize()
@absolute = {}
@relative = {}
@sum_edges = {}
@dirty = {}
end
# The nodes of the graph correspond to the keys of the hashes.
#
def nodes
@absolute.keys
end
# Add an edge from node a to node b.
#
def add(a, b)
@absolute[a] ||= {}
@absolute[a][b] ||= 0
@absolute[a][b] += 1
@sum_edges[a] ||= 0
@sum_edges[a] += 1
@dirty[a] = true
end
# Adds a list of elements pairwise.
# TODO this is for order = 1. For higher orders a list of previous elements
# could be added into the hash, but this will break other parts of the code
def add_elems(elems)
a = nil
elems.each() do |b|
add(a, b) if ( a )
a = b
end
end
# Calculates all the relative cumulative probabilities.
#
def recalc_all()
@relative = @absolute.dup()
@relative.each_pair do | a, hash |
recalc(a) if @dirty[a]
end
end
# Calculates the relative cumulative probabilities.
#
def recalc(a)
cum = 0.0
@relative[a] = @absolute[a].dup()
sum = @sum_edges[a] * 1.0
@relative[a].each_pair do | b, value|
cum = cum + ( value / sum )
@relative[a][b] = cum
end
@dirty.delete(a)
@relative[a]
end
# Generates a random successor of node a according to the probabilities learned
# from the example tests.
#
def rand(a)
recalc(a) if @dirty[a]
if a.nil? || @relative[a].nil? || @relative[a].length == 0
return nil
elsif @relative[a].length == 1
return @relative[a].keys[0]
else
# this is a linear search now with worst case O(n), TODO log(n) binary search
value = Kernel.rand()
candidates = @relative[a].select { |b, prob| prob > value }
return candidates[0][0]
end
end
def to_s
result = []
result << "MarkovChain"
@absolute.each_pair do | key, hash |
result << "#{key}: ABS: #{@absolute[key].to_s()} - REL: #{@relative[key].to_s()}"
end
result.join("\n")
end
end
StoryGenerator
require 'markov-chain'
# A generator for stories. Fill the Markov chain with the methods add()
# or add_file() and generate a story with the method story().
#
class StoryGenerator
attr_reader :mc
# Initializes.
#
def initialize(mc = nil)
if mc.nil?
@mc = MarkovChain.new()
else
@mc = mc
end
end
# Adds the words to the MarkovChain
#
def add(words)
@mc.add_elems(words)
end
# Adds a file to the MarkovChain.
#
def add_file(file)
puts "Reading file #{file}" if ( $VERBOSE )
words = Words.parse_file(file)
if ( $VERBOSE )
puts "Read in #{words.length} words."
puts "Inserting file #{file}"
STDOUT.flush()
end
@mc.add_elems(words)
end
# Genereates a story with n words (".", "," etc. counting as a word)
#
def story(n)
elems = generate(n, ".")
format(elems)
end
# Generates a story from the Markov Chain mc of length n and which starts with a
# successor of word.
#
def generate(n, word)
lexicon = @mc.nodes()
word = lexicon[rand(lexicon.length)] if word.nil?
elems = []
1.upto(n) do |i|
word = @mc.rand(word)
# if no word is word, take a random one from the lexicon
if word.nil?
elems += ["."]
word = lexicon[rand(lexicon.length)]
end
elems << word
if ( i % LOG_WORDS_NUM == 0 && $VERBOSE )
puts "Generated #{i} words."
STDOUT.flush()
end
end
elems
end
# Formats the elements.
#
def format(elems)
text = elems.join(" ")
text.gsub!(/\ [.,!?;]\ /, '\1 ')
text
end
end
Words
# A parser for texts in native languages, for example english or german.
#
# Example:
# "To be or not to be."
#
# will be parsed into the array
#
# ["To", "be", "or", "not", "to", "be", "."]
#
class Words
attr_reader :words
# Returns the words of a file.
#
def Words.parse_file(filename)
i = 1
all_words = []
File.open(filename) do | file |
file.each_line() do |line|
ws = Words.parse(line)
next if ws.nil?
all_words += ws
i += 1
if ( i % LOG_FILE_READ_NUM == 0 && $VERBOSE )
puts " Read #{i} lines."
STDOUT.flush()
end
end
end
all_words
end
# Returns the words of a line.
#
def Words.parse(line)
# replace newline
line.gsub!(/\r\n/, '')
return nil if line.length == 0
# seperate all special characters by blanks
line.gsub!(/([,;!?:.])/, ' \1 ')
# remove unused special characters
line.gsub!(/["'\-+=\t\r\n()]/, " ")
# split the line into words
words = line.split(/[ ]+/).select { |w| w.length > 0 }
words
end
end
Main
#!/usr/local/bin/ruby -w
require 'getoptlong'
require 'story-generator'
require 'markov-chain'
require 'words'
# Prints out the synopsis.
#
def synopsis()
puts "ruby main-markov-chains.rb [--help] [--number_of_words N|-n N] [--width W|-w W] [--verbose|-v] textfiles"
end
# these ugly global variables are used for printing out the progress
LOG_FILE_READ_NUM = 1000
LOG_WORDS_NUM = 100
# command line option
$VERBOSE = nil
$NUMBER_OF_WORDS = 100
$WIDTH = 78
opts = GetoptLong.new(
["--help", "-h", GetoptLong::NO_ARGUMENT ],
["--number_of_words", "--num", "-n", GetoptLong::REQUIRED_ARGUMENT ],
["--width", "-w", GetoptLong::REQUIRED_ARGUMENT ],
["--verbose", "-v", GetoptLong::NO_ARGUMENT ]
)
opts.each do |opt, arg|
case opt
when "--help"
synopsis()
exit(0)
when "--number_of_words"
$NUMBER_OF_WORDS = Integer(arg)
when "--verbose"
$VERBOSE = true
when "--width"
$WIDTH = Integer(arg)
end
end
files = ARGV
if files.length() == 0
STDERR.puts("Error: Missing argument")
synopsis()
exit(1)
end
# main
# our story generator
sg = StoryGenerator.new()
# feed all the files into it
files.each do |file|
sg.add_file(file)
end
# calculate the probabilities
if ( $VERBOSE )
puts "Calculating probabilities ..."
STDOUT.flush()
end
sg.mc.recalc_all()
# generate the story
if ( $VERBOSE )
puts "Generating..."
STDOUT.flush()
end
story = sg.story($NUMBER_OF_WORDS)
# and print it out formatted
index = 0
last = 0
space = 0
while index < story.length()
# remember the last non word element
space = index if ( story[index, 1] =~ /[ ,:;!?.]/ );
if (index - last == $WIDTH )
raise "There is a word larger then the width" if ( last == space+1 )
puts story[last..space]
index = space
last = space + 1
end
index += 1
end
puts story[last..-1]
See also the code repository on github.