Here’s my solution for the quiz No 74 . It generates text with a first
order Markov chain.
It generates text with a first order Markov Chain. Because the matrix of a
Markov chain for native language tests is sparse, the chain is stored in a
hash of hashes for better memory usage.
There are some open issues i think, but i did not had the time to implement
them
for usage of higher order markov chains one should add a list
of previous elements in the method MarkovChain#add_elems(). But this
breaks other parts of the code and these parts have to be adapted to
this.
better performance of the MarkovChain#rand() method
and of course there are a lot of open issues with respect to natural
language processing, the code does not produce grammatical valid
sentences, no quotes, no comma, etc.
I programmed it to learn Ruby and to improve my coding skills, because i only
have 2,3 weeks of experience with Ruby. I bought the “Programming Ruby” book
by Dave Thomas et. al. and started to read it two weeks and two days ago and
i used it a lot the last 48 hours. I think there are some parts of the code
where there are better solutions or could be better expressed in a more
“rubyish” way.
I used Java and Perl a lot in the last years and i will use Ruby now instead
of Perl. I can remember my experiences with hashes of hashes in Perl. This was
awkward and often a pain. And on friday as i started coding for this quiz i
suddenly realized that i had none of this issues in Ruby. Wow ! Because i
know the functional language Haskell i am used to iterators, map’s, lambda’s
etc. and i tried to use iterators and block in the program, but i think i
have to get more experience here in Ruby.
MarkovChain
# A Markov Chain contains a graph which edges are labeled with probabilities.
# The graph is stored as two hashes, one for the absolute probabilites, one for
# the relative probabilities. The nodes of the graph correspond to the keys of the hashes.
#
# The rand() method is worst case O(n), for small lists this is ok, but for
# large lists binary search will be better O(lg n)
class MarkovChain
# Initializes the hashes.
#
def initialize ()
@absolute = {}
@relative = {}
@sum_edges = {}
@dirty = {}
end
# The nodes of the graph correspond to the keys of the hashes.
#
def nodes
@absolute . keys
end
# Add an edge from node a to node b.
#
def add ( a , b )
@absolute [ a ] ||= {}
@absolute [ a ][ b ] ||= 0
@absolute [ a ][ b ] += 1
@sum_edges [ a ] ||= 0
@sum_edges [ a ] += 1
@dirty [ a ] = true
end
# Adds a list of elements pairwise.
# TODO this is for order = 1. For higher orders a list of previous elements
# could be added into the hash, but this will break other parts of the code
def add_elems ( elems )
a = nil
elems . each () do | b |
add ( a , b ) if ( a )
a = b
end
end
# Calculates all the relative cumulative probabilities.
#
def recalc_all ()
@relative = @absolute . dup ()
@relative . each_pair do | a , hash |
recalc ( a ) if @dirty [ a ]
end
end
# Calculates the relative cumulative probabilities.
#
def recalc ( a )
cum = 0.0
@relative [ a ] = @absolute [ a ]. dup ()
sum = @sum_edges [ a ] * 1.0
@relative [ a ]. each_pair do | b , value |
cum = cum + ( value / sum )
@relative [ a ][ b ] = cum
end
@dirty . delete ( a )
@relative [ a ]
end
# Generates a random successor of node a according to the probabilities learned
# from the example tests.
#
def rand ( a )
recalc ( a ) if @dirty [ a ]
if a . nil? || @relative [ a ]. nil? || @relative [ a ]. length == 0
return nil
elsif @relative [ a ]. length == 1
return @relative [ a ]. keys [ 0 ]
else
# this is a linear search now with worst case O(n), TODO log(n) binary search
value = Kernel . rand ()
candidates = @relative [ a ]. select { | b , prob | prob > value }
return candidates [ 0 ][ 0 ]
end
end
def to_s
result = []
result << "MarkovChain"
@absolute . each_pair do | key , hash |
result << " #{ key } : ABS: #{ @absolute [ key ]. to_s () } - REL: #{ @relative [ key ]. to_s () } "
end
result . join ( " \n " )
end
end
StoryGenerator
require 'markov-chain'
# A generator for stories. Fill the Markov chain with the methods add()
# or add_file() and generate a story with the method story().
#
class StoryGenerator
attr_reader :mc
# Initializes.
#
def initialize ( mc = nil )
if mc . nil?
@mc = MarkovChain . new ()
else
@mc = mc
end
end
# Adds the words to the MarkovChain
#
def add ( words )
@mc . add_elems ( words )
end
# Adds a file to the MarkovChain.
#
def add_file ( file )
puts "Reading file #{ file } " if ( $VERBOSE )
words = Words . parse_file ( file )
if ( $VERBOSE )
puts "Read in #{ words . length } words."
puts "Inserting file #{ file } "
STDOUT . flush ()
end
@mc . add_elems ( words )
end
# Genereates a story with n words (".", "," etc. counting as a word)
#
def story ( n )
elems = generate ( n , "." )
format ( elems )
end
# Generates a story from the Markov Chain mc of length n and which starts with a
# successor of word.
#
def generate ( n , word )
lexicon = @mc . nodes ()
word = lexicon [ rand ( lexicon . length )] if word . nil?
elems = []
1 . upto ( n ) do | i |
word = @mc . rand ( word )
# if no word is word, take a random one from the lexicon
if word . nil?
elems += [ "." ]
word = lexicon [ rand ( lexicon . length )]
end
elems << word
if ( i % LOG_WORDS_NUM == 0 && $VERBOSE )
puts "Generated #{ i } words."
STDOUT . flush ()
end
end
elems
end
# Formats the elements.
#
def format ( elems )
text = elems . join ( " " )
text . gsub! ( /\ [.,!?;]\ / , '\1 ' )
text
end
end
Words
# A parser for texts in native languages, for example english or german.
#
# Example:
# "To be or not to be."
#
# will be parsed into the array
#
# ["To", "be", "or", "not", "to", "be", "."]
#
class Words
attr_reader :words
# Returns the words of a file.
#
def Words . parse_file ( filename )
i = 1
all_words = []
File . open ( filename ) do | file |
file . each_line () do | line |
ws = Words . parse ( line )
next if ws . nil?
all_words += ws
i += 1
if ( i % LOG_FILE_READ_NUM == 0 && $VERBOSE )
puts " Read #{ i } lines."
STDOUT . flush ()
end
end
end
all_words
end
# Returns the words of a line.
#
def Words . parse ( line )
# replace newline
line . gsub! ( /\r\n/ , '' )
return nil if line . length == 0
# seperate all special characters by blanks
line . gsub! ( /([,;!?:.])/ , ' \1 ' )
# remove unused special characters
line . gsub! ( /["'\-+=\t\r\n()]/ , " " )
# split the line into words
words = line . split ( /[ ]+/ ). select { | w | w . length > 0 }
words
end
end
Main
#!/usr/local/bin/ruby -w
require 'getoptlong'
require 'story-generator'
require 'markov-chain'
require 'words'
# Prints out the synopsis.
#
def synopsis ()
puts "ruby main-markov-chains.rb [--help] [--number_of_words N|-n N] [--width W|-w W] [--verbose|-v] textfiles"
end
# these ugly global variables are used for printing out the progress
LOG_FILE_READ_NUM = 1000
LOG_WORDS_NUM = 100
# command line option
$VERBOSE = nil
$NUMBER_OF_WORDS = 100
$WIDTH = 78
opts = GetoptLong . new (
[ "--help" , "-h" , GetoptLong :: NO_ARGUMENT ],
[ "--number_of_words" , "--num" , "-n" , GetoptLong :: REQUIRED_ARGUMENT ],
[ "--width" , "-w" , GetoptLong :: REQUIRED_ARGUMENT ],
[ "--verbose" , "-v" , GetoptLong :: NO_ARGUMENT ]
)
opts . each do | opt , arg |
case opt
when "--help"
synopsis ()
exit ( 0 )
when "--number_of_words"
$NUMBER_OF_WORDS = Integer ( arg )
when "--verbose"
$VERBOSE = true
when "--width"
$WIDTH = Integer ( arg )
end
end
files = ARGV
if files . length () == 0
STDERR . puts ( "Error: Missing argument" )
synopsis ()
exit ( 1 )
end
# main
# our story generator
sg = StoryGenerator . new ()
# feed all the files into it
files . each do | file |
sg . add_file ( file )
end
# calculate the probabilities
if ( $VERBOSE )
puts "Calculating probabilities ..."
STDOUT . flush ()
end
sg . mc . recalc_all ()
# generate the story
if ( $VERBOSE )
puts "Generating..."
STDOUT . flush ()
end
story = sg . story ( $NUMBER_OF_WORDS )
# and print it out formatted
index = 0
last = 0
space = 0
while index < story . length ()
# remember the last non word element
space = index if ( story [ index , 1 ] =~ /[ ,:;!?.]/ );
if ( index - last == $WIDTH )
raise "There is a word larger then the width" if ( last == space + 1 )
puts story [ last .. space ]
index = space
last = space + 1
end
index += 1
end
puts story [ last ..- 1 ]
See also the code repository on github .