Extending

Serializer

Serializer is used to generate ElementTree node for different elements we have already parsed. Serializers work with ElementTree API because we want to be able to easily manipulate with our generated content in serializers ands hooks. Generated tree is converted to HTML textual representation at the end of the process.

Serializer is passed reference to element where new content should be inserted. When serializer is done it calls hooks defined for this kind of element.

Supported OOXML document elements are:
  • Paragraph
  • Text
  • Link
  • Image
  • Table / Table Cell
  • Footnote
  • Symbol
  • List
  • Break
  • Table Of Contents (just parsed)
  • TextBox (just parsed)
  • Math (just parsed)
import ooxml
from ooxml import serialize

def serialize_break(ctx, document, elem, root):
    if elem.break_type == u'textWrapping':
        _div = etree.SubElement(root, 'br')
    else:
        _div = etree.SubElement(root, 'span')
        _div.set('style', 'page-break-after: always;')

    serialize.fire_hooks(ctx, document, elem, _div, ctx.get_hook('page_break'))

    return root


dfile = ooxml.read_from_file('doc_with_math_element.docx')

opts = {
  'serializers': {
     doc.Break: serialize_break,
    }
}

print serialize.serialize(dfile.document, opts)

Hook

Hooks are used for easy and quick manipulation with generated ElementTree elements. Hooks are called for each newly created element. Using hooks we are able to slightly modify or completele rewrite content generated by serializers.

Example

We are using MS Word to edit our document. Using style “Quote” we mark certain parts of our document as quote and using style “Title” we marked the title. Sample code which uses hooks will put the title inside of <h1> element and add class “our_quote” to the quote element.

Sample code

import six

import ooxml
from ooxml import parse, serialize, importer

def check_for_header(ctx, document, el, elem):
    if hasattr(el, 'style_id'):
        if el.style_id == 'Title':
            elem.tag = 'h1'

def check_for_quote(ctx, document, el, elem):
    if hasattr(el, 'style_id'):
        if el.style_id == 'Quote':
            elem.set('class', elem.get('class', '') + ' our_quote')

file_name = '../files/03_hooks.docx'
dfile = ooxml.read_from_file(file_name)

opts = {
    'hooks': {
       'p': [check_for_quote, check_for_header]
    }
}

six.print_(serialize.serialize(dfile.document, opts))