Word Automation

Word is well known part of Microsoft Office package. It is widely used for writing documents and, as the most of Microsoft's applications, it supports COM automation. Instead of learning Word automation by writing bunch of small scripts that, for example, create new, open existing document or alter document's content we'll learn it by developing small utility that will convert Textile to Word documents.

Textile is lightweight markup language which is mostly used to write web pages, blogs or wikis. Due to the simple syntax, textile documents are easy to write. However they are rarely used in raw format. Instead, they are processed and converted to HTML or some other format.

Processing Textile format requires adequate parser. In Ruby, the most easier way to parse and process Textile documents, is through RedCloth library. At the moment of writing this chapter RedCloth version is 4.2.9 and it must be installed from sources. Prebuilt gem has binaries only for Rubies 1.8 and 1.9. Moreover build gem procedure has an error so we have to manually alter it after intalling RedCloth.

C:\>gem install RedCloth --platform=ruby
Fetching: RedCloth-4.2.9-x86-mingw32.gem (100%)
Successfully installed RedCloth-4.2.9-x86-mingw32
Parsing documentation for RedCloth-4.2.9-x86-mingw32
Installing ri documentation for RedCloth-4.2.9-x86-mingw32
Done installing documentation for RedCloth after 4 seconds
1 gem installed

Converting Textile to HTML with RedCloth is very simple. We have to create RedCloth object with Textile string as an argument passed to the constructor and call its to_html method. However if we try conversion after installation it will fail.

C:\>ruby -rredcloth -e "puts RedCloth.new('Some *bold* text').to_html"
c:/Ruby/22/lib/ruby/2.2.0/rubygems/core_ext/kernel_require.rb:54:in `require': cannot load such file -- 2.2/redcloth_scan (LoadError)
Couldn't load 2.2/redcloth_scan
The $LOAD_PATH was:
c:/Ruby/22/lib/ruby/gems/2.2.0/extensions/x86-mingw32/2.2.0/RedCloth-4.2.9
c:/Ruby/22/lib/ruby/gems/2.2.0/gems/RedCloth-4.2.9/lib
c:/Ruby/22/lib/ruby/gems/2.2.0/gems/RedCloth-4.2.9/lib/case_sensitive_require
c:/Ruby/22/lib/ruby/gems/2.2.0/gems/RedCloth-4.2.9/ext
c:/Ruby/22/lib/ruby/site_ruby/2.2.0
c:/Ruby/22/lib/ruby/site_ruby/2.2.0/i386-msvcrt
c:/Ruby/22/lib/ruby/site_ruby
c:/Ruby/22/lib/ruby/vendor_ruby/2.2.0
c:/Ruby/22/lib/ruby/vendor_ruby/2.2.0/i386-msvcrt
c:/Ruby/22/lib/ruby/vendor_ruby
c:/Ruby/22/lib/ruby/2.2.0
c:/Ruby/22/lib/ruby/2.2.0/i386-mingw32
        from c:/Ruby/22/lib/ruby/2.2.0/rubygems/core_ext/kernel_require.rb:54:in `require'
        from c:/Ruby/22/lib/ruby/gems/2.2.0/gems/RedCloth-4.2.9/lib/redcloth.rb:13:in `<top (required)>'
        from c:/Ruby/22/lib/ruby/2.2.0/rubygems/core_ext/kernel_require.rb:128:in `require'
        from c:/Ruby/22/lib/ruby/2.2.0/rubygems/core_ext/kernel_require.rb:128:in `rescue in require'
        from c:/Ruby/22/lib/ruby/2.2.0/rubygems/core_ext/kernel_require.rb:39:in `require'

Ruby obviously cannot load extension library redcloth_scan.so. Problem is that library is not copied to folder where Ruby will look for it. We will fix that now. Go to c:/Ruby/22/lib/ruby/gems/2.2.0/gems/RedCloth-4.2.9/lib, create folder 2.2 and move file redcloth_scan.so to new folder. Now try to again to convert Textile string to HTML.

C:\>ruby -rredcloth -e "puts RedCloth.new('Some *bold* text').to_html"
<p>String with some <strong>bold</strong> text</p>

Now when we have working RedCloth gem installed our goal is to extend RedCloth in such a way that we can create Word documents in a similar way.

C:\>ruby -rredcloth -e "puts RedCloth.new('Some *bold* text').to_word"

Actually our to_word method will have to accept path or name of the file where Word document will be saved but we will come to that later. First let's see how we can extend RedCloth.

On the RedCloth's site there is a small section how to customize gem to perform new conversion. We have to create new formatter that will output text in the desired format. Just to be sure what we have to do let's check gem's sources and see how library itself is performing conversions. In the file textile_doc.rb in lib\redcloth directory under RedCloth's installation directory we can find definition of to_html method.

def to_html( *rules )
  apply_rules(rules)

  to(RedCloth::Formatters::HTML)
end

Method applies rules passed as arguments and calls to method passing it a name of the module defined in the lib\redcloth\formatters\html.rb file. This module is used to extend Ruby String class with methods called by parser. During parsing, whenever new Textile marker is found in the input string, RedCloth calls method that should perform conversion corresponding to the current marker, passing it a hash with values relevant to the current context. We can see what this actually means by making one simple formatter that only prints out names of methods called by parser altogether with arguments passed to them. In the rwin_book folder create new directory word and create file missing_formatter.rb with the following code.

require 'redcloth'

module RedCloth
  module Formatters
    module Missing
      include RedCloth::Formatters::Base

      def escape(text)
        text
      end

      def p(value)
        value[:text] + "\n"
      end

      def method_missing(name, *args, &block)
        puts "Called #{name} with arguments #{args}"
        args[0][:text]
      end
    end
  end
end

module RedCloth
  class TextileDoc
    def to_missing(*rules)
      apply_rules(rules)
      to(RedCloth::Formatters::Missing)
    end
  end
end

This simple module has three methods. The first one, escape, is called for each word found in the input string and can be used to escape special characters or perform per-word transformations if needed. Next method is responsible for paragraphs processing. Paragraphs in Textile are blocks of lines separated by blank lines. Whenever new paragraph is found method p is called. The last method we are defining is method_missing. We already met this Kernel method in previous chapters. Here, we are using it just to log messages and their arguments sent from the parser to our string object. At the end we are defining new method in the TextileDoc class – to_missing.

Before we use new formatter we need one Textile document that will be processed. Here is a content of a sample1.textile file that will be used.

h1. Give RedCloth a try!

h2. Sub heading of level 2

h3. Sub-sub heading of level 3

A *_simple_* paragraph with _simple_ *text*
a line break, some _emphasis_ and a "link":http://redcloth.org


*-bold strikethrough text-*
*_-+bold italic strikethrough underlined text+-_*

Last step is to create script sample_missing_formatter.rb that uses new RedCloth's “formatter” for processing Textile.

require "./missing_formatter"
File.open('sample1.textile', 'r') do |f|
  r = RedCloth.new(f.read)
  puts r.to_missing
end

Running this script gives following output.

C:\projects\ruby\rwin_book\word>ruby sample_missing_formatting.rb
Called em with arguments [{:text=>"simple"}]
Called strong with arguments [{:text=>"simple"}]
Called em with arguments [{:text=>"simple"}]
Called strong with arguments [{:text=>"text"}]
Called em with arguments [{:text=>"emphasis"}]
Called link with arguments [{:name=>"link", :text=>"link", :href=>"http://redcloth.org", :name_without_attributes=>"link"}]
Called del with arguments [{:text=>"bold strikethrough text"}]
Called strong with arguments [{:text=>"bold strikethrough text"}]
Called ins with arguments [{:text=>"bold italic strikethrough underlined text"}]
Called del with arguments [{:text=>"bold italic strikethrough underlined text"}]
Called em with arguments [{:text=>"bold italic strikethrough underlined text"}]
Called strong with arguments [{:text=>"bold italic strikethrough underlined text"}]
h1. Give RedCloth a try!
h2. Sub heading of level 2
h3. Sub-sub heading of level 3
A simple paragraph with simple text
a line break, some emphasis and a link
bold strikethrough text
bold italic strikethrough underlined text

This rough analysis of conversion process gives us initial knowledge about requirements put on a future formatter that should convert Textile to Word documents. Comparing input file and output log reveals set of functions we have to implement. Each phrase modifier for italic, bold, strikethrough or underline font, triggers a call to em, strong, del and ins methods respectively. Each method is passed a hash with source text as a value of a :text key. In a similar way document structure modifiers like heading, bullets or table markers are processed, except corresponding formatter's methods will not be called unless they are defined.

First line within our Missing formatter hides two calls to functions we will need in Word formatter. It includes RedCloth::Formatter::Base module in our module and functions before_transform and after_transform are already defined in it. As their names already say these methods are called before any transformation begins and after transformation of whole input string is finished.

Having this knowledge about a way RedCloth gem is functioning, we can start to implement our Word converter. Contrary to the usual way Textile documents are processed in which result of transformation is string, Word conversion will result in a file. That means we have to pass a file name to our formatter before transformation starts. Here is how method to_word might be implemented.

module RedCloth
  class TextileDoc
    def to_word(file, *rules)
      apply_rules(rules)
      @word_file = file
      to(RedCloth::Formatters::Word)
    end
  end
end

We now have name of the Word file in TextileDoc class which we can use in future formatter. First thing we have to do in the formatter is to prevent document processing if file name is not defined. We can use before_transform method to check if file name is defined and raise an error if not.

module RedCloth
  module Formatters
    module Word
      def before_transform(value)
        raise "undefined file name" if file.nil? || file.empty?
      end
    end
  end
end

We are working with Word and, naturally, we have to start application and create new document before we try to write anything in it. We could do it in before_transform, set instance variables and use them later in the code, but we will, instead, create two methods which will be used for that purpose.

def word
  @word ||= WIN32OLE.new("Word.Application")
end

def doc
  @doc ||= word.documents.add
end

In the first method we are starting Word if @word variable is not defined, which is the case when method is called for the first time. Every other call will immediately return already running Word instance remembered in the @word variable. Second method calls add method on Word's documents collection. This method creates new, empty, document and returns it as a result. Instance of Document COM object is stored in the @doc variable and returned as a result of each subsequent call.

There is just one more thing we have to do before we start to implement methods that will perform conversions from Textile to Word. Just as we checked whether file name is set or not before any transformation we have to save and close our document and quit Word. As you might already guessed we will use after_transform method.

def after_transform(value)
  doc.saveas file
  doc.close
  word.quit
end

Without this method we will end with numerous instances of Word application, one per each call of to_word method.

When we used RedCloth::Formatters::Missing formatter we saw that each Textile modifier has corresponding method that will be called by the parser. We will start implementation of conversion methods with the set of headings processing functions. Heading markers in Textile are denoted by letter h followed by the number of heading level and ended by a dot ('.') character. Parser looks for methods named same way (without a dot) and if formatter has defined such a method it will be called. In Word heading is defined through style of the paragraph. Names of heading styles are “Heading 1”, “Heading 2”, etc. Therefore whenever our heading method in formatter is called we have to add text passed as an argument, change style of current paragraph, add line break to make a place for new paragraph and reset style to “Normal”. Following these rules we can define our first heading processor.

def h1(value)
  doc.paragraphs.add if word.selection.nil?
  word.selection.typetext value[:text]
  word.selection.style = "Heading 1"
  word.selection.typetext = "\n"
  word.selection.style = "Normal"
  ""
end

First line of h1 method adds a new paragraph to the document if document is empty, which we test by checking whether Word's selection object is nil. Next, we are using selection's typetext method which adds text to the current paragraph. After that we change style of the current paragraph to "Heading 1", creating a new paragraph by adding a line break at the end of the current paragraph and reseting style to "Normal". RedCloth expects all methods to return string value. Usually this string is modified according to a target format. In the case of HTML formatter it will return original text surrounded by <h1>...</h1> tags. Since we do not expect Word formatter to have string as a final result we are just returning empty string as a result of h1 method execution.

For each heading level we should make similar method. This can be boring task, so let's do it Ruby way by adding following right after including Base formatter in our RedCloth::Formatters::Word module

[:h1, :h2, :h3, :h4, :h5, :h6].each do |heading|
  define_method(heading) do |*args|
    heading_level = heading.to_s.match(/h(\d)/)[1]
    doc.paragraphs.add if word.selection.nil?

    word.selection.typetext args[0][:text]
    word.selection.style = "Heading #{heading_level}"
    word.selection.typetext "\n"
    word.selection.style = "Normal"
    ""
  end
end

For each element in the array we are defining a method which first determines heading level by using regular expression.

heading_level = heading.to_s.match(/h(\d)/)[1]

Regular expression matches letter 'h' followed by a digit which is captured and this capture is used to set a value of heading_level variable. After this line, code is similar to the one used in h1 method.

Due to the way Textile parsers work, handling phrase modifiers is a little bit more complicated than handling document structure ones. RedCloth expects formatter to return formatted string form each of modifier handling functions. It calls handling method each time it finds phrase modifier token, with the text that should be modified in the argument. Thus for *_phrase_* sequence parser will call two methods, first em and after that strong with the same argument {:text => “phrase”}. If we are about to convert it to the HTML we can simply return <em>phrase</em> as a result of a call to em method, and <strong><em>phrase</em></strong> as a result of call to strong function. Although final result has actually inverted places of italic and bold modifiers it will have no influence when this code is rendered in browser.

It is hard to use this approach in Word. Same text can occur more than once in one paragraph. If we change font weight and add text during first method call there is no way to know whether call to the next modifier method (strong in this case) is for a new word which should be displayed in bold text or it is just a new font weight for the previous phrase. Instead of sequentially changing input text in each call to phrase modifier handler we can replace input text with a placeholder and store target format in a variable. When whole paragraph is processed resulting text will be set of text that should be displayed in normal font and our placeholders. Additionally, each new modifier sequence must be replaced with unique placeholder so we can later directly find correct set of font weights along with the text they should be applied to.

Although this approach seems complex it is actually quite simple which we will see shortly. First thing we need is a way to convert Textile font tokens to Word font weights and a hash where we will keep our placeholders with the information about target text and font weights.

def styles_map
  {
    :strong => :bold,
    :em => :italic,
    :ins => :underline,
    :del_phrase => :strikethrough,
    :del => :strikethrough,
  }
end

def styles
  @styles ||= {}
end

Method styles_map returns a hash with Textile modifiers as keys and Word font weights as values and styles returns instance variable @styles which is, on the first call, assigned an empty hash.

We must also decide what we will use for placeholders. Let's use less then character < followed by wf, number of the placeholder and greater then character >. Thus placeholders will be of the form <wf1>. This is not the best solution since such sequence of characters might be found in the source text. Better way would be to use random sequence of characters but for this purpose our placeholders will be good enough. Now let's see how one of the phrase modifier handlers should look like.

def em(value)
  val = ""
  if value[:text].match(/<wf(\d+)>/) && styles["<wf#{$1}>"]
    val = "<wf#{$1}>"
    styles[val][:styles] << styles_map[tag]
  else
    val = "<wf#{next_tag_no}>"
    styles[val] = {:text => value[:text], :styles => [styles_map[tag]]}
  end
  val
end

First we are testing whether input text matches placeholder pattern. If it doesn't we are creating new placeholder. Number of placeholder is returned by the method next_tag_no.

def next_tag_no
  @tag_no ||= 0
  @tag_no += 1
end

After that we are using placeholder as a key in our styles hash and value is a new hash. In a :text key we are storing original text and in a :styles array values of Word font weights. Return value of function in this case is current placeholder. Since parser uses this value when it calls next phrase modifier handler for the same part of text our next modifier, if called, will receive placeholder as a value of a key :text in the input hash argument.

if clause first tests whether input text matches placeholder pattern and if it is, clause additionally checks if we already have information about text and font weights in the styles hash. Only if both conditions are met we set return value to the current placeholder and add new font weight to the array of existing font weights.

Similarly to the heading modifiers handling we want to avoid writing one function per phrase modifier and will again turn to the beauty of Ruby language.

[:strong, :em, :ins, :del, :del_phrase].each do |tag|
  define_method(tag) do |*args|
    val = ""
    if args[0][:text].match(/<wf(\d+)>/) && styles["<wf#{$1}>"]
      val = "<wf#{$1}>"
      styles[val][:styles] << styles_map[tag]
    else
      val = "<wf#{next_tag_no}>"
      styles[val] = {:text => args[0][:text], :styles => [styles_map[tag]]}
    end
    val
  end
end

Handling links is simpler since they are processed in a single call and we only need to store text that will be displayed and an URL.

def link(value)
  val = "<wf#{next_tag_no}>"
  styles[val] = {:text => value[:text], :href => value[:href], :styles => []}
  val
end

At the end of paragraph processing RedCloth will call formatter's p method with the text that contains all our placeholders, and in the @styles hash we will have all the information needed to properly format paragraph in the Word document. Modified text of the first paragraph of our sample Textile document is given below.

A <wf1> paragraph with <wf2> <wf3>
a line break, some <wf4> and a <wf5>
<wf6>
<wf7>

And the styles hash will have a following content.

{"<wf1>"=>{:text=>"simple", :styles=>[:italic, :bold]},
 "<wf2>"=>{:text=>"simple", :styles=>[:italic]},
 "<wf3>"=>{:text=>"text", :styles=>[:bold]},
 "<wf4>"=>{:text=>"emphasis", :styles=>[:italic]},
 "<wf5>"=>{:text=>"link", :href=>"http://redcloth.org", :styles=>[]}}

It is pretty much clear what we have to do now. All unaltered parts of text we have to simply add to the Word document and whenever we find placeholder we will use value of :text key from the corresponding element of styles hash and apply font kept in the :styles key.

def set_font_styles(word_font, styles)
  styles.each do |style|
    word_font.Bold = true if style == :bold
    word_font.Italic = true if style == :italic
    word_font.Underline = true if style == :underline
    word_font.Strikethrough = true if style == :strikethrough
  end
end

def reset_font_styles(word_font, styles)
  styles.each do |style|
    word_font.Bold = false if style == :bold
    word_font.Italic = false if style == :italic
    word_font.Underline = false if style == :underline
    word_font.Strikethrough = false if style == :strikethrough
  end
end

def p(value)
  doc.paragraphs.add if word.selection.nil?

  if styles.empty?
    word.selection.typetext value[:text]
  else
    text = [value[:text]]
    styles.keys.each do |st|
      parts = text.collect {|t| t.split st}.flatten
      curr = parts.shift
      word.selection.typetext curr

      if styles[st][:href]
        doc.hyperlinks.add(word.selection.range,
                           styles[st][:href], nil,
                           nil, styles[st][:text])
      else
        set_font_styles(word.selection.font, styles[st][:styles])
        word.selection.typetext styles[st][:text]
        reset_font_styles(word.selection.font, styles[st][:styles])
      end

      text = parts
    end
    styles.clear
    word.selection.typetext text[0] unless text.empty?
  end
  word.selection.typetext "\n"
  ""
end

Just as we did for headings we are adding new paragraph to the Word document if we are about to add text to the empty document. If no styles were used, paragraph is added to the end of the document. Otherwose font styles are applied and again paragraph is added to the end of the document.