Changing Moses

This is going to be a bit of a stream-of-consciousness post. I'm trying to modify Moses to provide support for linking phrases in the XML input. The EBMT system often translates two or more separate phrases which come from the same example, which should either be used together or not at all (as they are intrinsically related).

As far as the XML code itself is concerned, this is not a major task. I will probably modify it to allow something along the lines of:

<multi-phrase><phrase1 span="1-2" english="abc"/><phrase2 span="4-6" english="xyz"/></multi-phrase>

Where the phrases within the multi-phrase tag must all be used or none of them used.

The tricky part of the modification is in changing the internal representation of phrases. Moses currently follows a process something like this during decoding:

  1. Generate list of all TranslationOptions (mappings from source phrases to target phrases) from all available sources, including:
    • Phrase table
    • XML Input
    • Other (Confusion networks, etc)
  2. Create an initial hypothesis with no tokens translated, place it on stack #0
  3. Work through stacks in order. Each stack represents a certain number of tokens translated (i.e. stack #1 has one token translated)
  4. At each stack, expand every hypothesis on the stack. For example, at stack #1:
    • Work through pre-computed list of TranslationOptions finding all that translate any number of contiguous tokens starting with the next untranslated token (i.e. token #2 in this example)
    • For each relevant TranslationOption, generate a new hypothesis and place it on the appropriate stack (e.g. if the TranslationOption translates three more tokens, tokens 2-4, then place it on stack #4 as we now have four tokens translated)
  5. Continue until all stacks have been processed. Complete translations are now available on the last stack and can be scored.

Unfortunately, this process allows very little flexibility in terms of the order in which tokens can be translated. Ideally, I'd like a single TranslationOption to represent my entire fragmented phrase. Each stack could still represent the number of tokens translated, but they would not necessarily be contiguous. Hypothesis expansion would then be modified to find all TranslationOptions that filled any of the remaining gaps, rather than only looking for TranslationOptions starting at a particular token. This, alas, requires vast changes to the innards of Moses and is likely to significantly reduce its efficiency.

I'm now thinking about representing the parts of my fragmented phrase as separate TranslationOptions, but introducing new structure to link them together as dependants. Each hypothesis will be aware of any dependants it still needs to use, and when it reaches the appropriate part of the sentence it will use only those TranslationOptions.