Markov chains in Ruby - Ruby Quiz #74

April 09, 2006

Here’s my solution for the quiz No 74. It generates text with a first order Markov chain.

It generates text with a first order Markov Chain. Because the matrix of a Markov chain for native language tests is sparse, the chain is stored in a hash of hashes for better memory usage.

There are some open issues i think, but i did not had the time to implement them

for usage of higher order markov chains one should add a list of previous elements in the method MarkovChain#add_elems(). But this breaks other parts of the code and these parts have to be adapted to this.
better performance of the MarkovChain#rand() method
and of course there are a lot of open issues with respect to natural language processing, the code does not produce grammatical valid sentences, no quotes, no comma, etc.

I programmed it to learn Ruby and to improve my coding skills, because i only have 2,3 weeks of experience with Ruby. I bought the “Programming Ruby” book by Dave Thomas et. al. and started to read it two weeks and two days ago and i used it a lot the last 48 hours. I think there are some parts of the code where there are better solutions or could be better expressed in a more “rubyish” way.

I used Java and Perl a lot in the last years and i will use Ruby now instead of Perl. I can remember my experiences with hashes of hashes in Perl. This was awkward and often a pain. And on friday as i started coding for this quiz i suddenly realized that i had none of this issues in Ruby. Wow ! Because i know the functional language Haskell i am used to iterators, map’s, lambda’s etc. and i tried to use iterators and block in the program, but i think i have to get more experience here in Ruby.

MarkovChain

# A Markov Chain contains a graph which edges are labeled with probabilities.

# The graph is stored as two hashes, one for the absolute probabilites, one for

# the relative probabilities. The nodes of the graph correspond to the keys of the hashes.

#

# The rand() method is worst case O(n), for small lists this is ok, but for

# large lists binary search will be better O(lg n)

class MarkovChain 

  # Initializes the hashes.

  #

  def initialize()
    @absolute = {}
    @relative = {}
    @sum_edges = {}
    @dirty = {}
  end
  
  # The nodes of the graph correspond to the keys of the hashes.

  #

  def nodes
    @absolute.keys
  end
  
  # Add an edge from node a to node b.

  #

  def add(a, b)
    @absolute[a] ||= {} 
    @absolute[a][b] ||= 0
    @absolute[a][b] += 1
    @sum_edges[a] ||= 0
    @sum_edges[a] += 1
    @dirty[a] = true
  end

  # Adds a list of elements pairwise.

  # TODO this is for order = 1. For higher orders a list of previous elements

  # could be added into the hash, but this will break other parts of the code

  def add_elems(elems)
    a = nil
    elems.each() do |b|
      add(a, b) if ( a )
      a = b
    end
  end

  # Calculates all the relative cumulative probabilities.

  #

  def recalc_all()
    @relative = @absolute.dup()
    @relative.each_pair do | a, hash |
      recalc(a) if @dirty[a]
    end
  end
  
  # Calculates the relative cumulative probabilities.

  #

  def recalc(a)
    cum = 0.0
    @relative[a] = @absolute[a].dup()
    sum = @sum_edges[a] * 1.0
    @relative[a].each_pair do | b, value|
      cum = cum + ( value / sum )
      @relative[a][b] = cum
    end
    @dirty.delete(a)
    @relative[a]
  end
  
  # Generates a random successor of node a according to the probabilities learned 

  # from the example tests.

  #

  def rand(a)
    recalc(a) if @dirty[a]
    if a.nil? || @relative[a].nil? || @relative[a].length == 0
      return nil
    elsif @relative[a].length == 1
      return @relative[a].keys[0]
    else
      # this is a linear search now with worst case O(n), TODO log(n) binary search

      value = Kernel.rand()
      candidates = @relative[a].select { |b, prob| prob > value }
      return candidates[0][0]
    end
  end

  def to_s
    result = []
    result << "MarkovChain"
    @absolute.each_pair do | key, hash |
      result << "#{key}: ABS: #{@absolute[key].to_s()} - REL: #{@relative[key].to_s()}"
    end
    result.join("\n")
  end

end

StoryGenerator

require 'markov-chain'
# A generator for stories. Fill the Markov chain with the methods add()

# or add_file() and generate a story with the method story().

#

class StoryGenerator

  attr_reader :mc
  
  # Initializes.

  #

  def initialize(mc = nil) 
    if mc.nil?
      @mc = MarkovChain.new()
    else
      @mc = mc 
    end
  end

  # Adds the words to the MarkovChain

  #

  def add(words) 
    @mc.add_elems(words)
  end

  # Adds a file to the MarkovChain.

  #

  def add_file(file) 
    puts "Reading file #{file}" if ( $VERBOSE )
    words = Words.parse_file(file)
    if ( $VERBOSE )
      puts "Read in #{words.length} words."
      puts "Inserting file #{file}"
      STDOUT.flush()
    end
    @mc.add_elems(words)
  end
  
  # Genereates a story with n words (".", "," etc. counting as a word) 

  #

  def story(n)
    elems = generate(n, ".")
    format(elems)
  end
  
  # Generates a story from the Markov Chain mc of length n and which starts with a

  # successor of word.

  #

  def generate(n, word)
    lexicon = @mc.nodes()
    word = lexicon[rand(lexicon.length)] if word.nil?
    elems = []
    
    1.upto(n) do |i|
      word = @mc.rand(word)
      # if no word is word, take a random one from the lexicon

      if word.nil?
        elems += ["."]
        word = lexicon[rand(lexicon.length)]
      end
      elems << word
      if ( i % LOG_WORDS_NUM == 0 && $VERBOSE )
        puts "Generated #{i} words." 
        STDOUT.flush()
      end
    end
    elems
  end
  
  # Formats the elements.

  #

  def format(elems)
    text = elems.join(" ")
    text.gsub!(/\ [.,!?;]\ /, '\1 ')
    text
  end

end

Words

# A parser for texts in native languages, for example english or german.

# 

# Example:

#   "To be    or not to    be."

#

# will be parsed into the array

#

# ["To", "be", "or", "not", "to", "be", "."]

#

class Words

  attr_reader :words
  
  # Returns the words of a file.

  #

  def Words.parse_file(filename)
    i = 1
    all_words = []
    File.open(filename) do | file |
      file.each_line() do |line|
        ws = Words.parse(line)
        next if ws.nil?
        all_words += ws
        i += 1 
        if ( i % LOG_FILE_READ_NUM == 0 && $VERBOSE ) 
          puts "  Read #{i} lines." 
          STDOUT.flush()
        end
      end
    end
    all_words
  end

  # Returns the words of a line.

  #

  def Words.parse(line)
    # replace newline

    line.gsub!(/\r\n/, '')
    return nil if line.length == 0
    # seperate all special characters by blanks

    line.gsub!(/([,;!?:.])/, ' \1 ')
    # remove unused special characters

    line.gsub!(/["'\-+=\t\r\n()]/, " ")
    # split the line into words

    words = line.split(/[ ]+/).select { |w| w.length > 0 }
    words
  end

end

Main

#!/usr/local/bin/ruby -w


require 'getoptlong'
require 'story-generator'
require 'markov-chain'
require 'words'

# Prints out the synopsis.

#

def synopsis()
  puts "ruby main-markov-chains.rb [--help] [--number_of_words N|-n N] [--width W|-w W] [--verbose|-v] textfiles"
end

# these ugly global variables are used for printing out the progress

LOG_FILE_READ_NUM = 1000
LOG_WORDS_NUM = 100

# command line option 

$VERBOSE = nil
$NUMBER_OF_WORDS = 100
$WIDTH = 78

opts = GetoptLong.new(
  ["--help", "-h", GetoptLong::NO_ARGUMENT ],
  ["--number_of_words", "--num", "-n", GetoptLong::REQUIRED_ARGUMENT ],
  ["--width", "-w", GetoptLong::REQUIRED_ARGUMENT ],
  ["--verbose", "-v", GetoptLong::NO_ARGUMENT ]
)

opts.each do |opt, arg|
  case opt
  when "--help"
    synopsis()
    exit(0)
  when "--number_of_words"
    $NUMBER_OF_WORDS = Integer(arg)
  when "--verbose"
    $VERBOSE = true
  when "--width"
    $WIDTH = Integer(arg)
  end 
end

files = ARGV

if files.length() == 0
  STDERR.puts("Error: Missing argument")
  synopsis()
  exit(1)
end

# main

# our story generator

sg = StoryGenerator.new()
# feed all the files into it

files.each do |file|
  sg.add_file(file)
end  

# calculate the probabilities

if ( $VERBOSE )
  puts "Calculating probabilities ..."
  STDOUT.flush()
end
sg.mc.recalc_all()

# generate the story

if ( $VERBOSE )
  puts "Generating..."
  STDOUT.flush()
end
story = sg.story($NUMBER_OF_WORDS)

# and print it out formatted

index = 0
last = 0
space = 0
while index < story.length()
  # remember the last non word element

  space = index if ( story[index, 1] =~ /[ ,:;!?.]/ );
  if (index - last == $WIDTH )
    raise "There is a word larger then the width" if ( last == space+1 )
    puts story[last..space]
    index = space
    last = space + 1
  end
  index += 1
end
puts story[last..-1]