Preliminary results

I finished making the extensive changes to my EBMT system required to support the new features discussed in previous blog posts, including support for multiple alignments per example, and use of multiple examples per input. For full details, see the CVS changes brought in with commit #341, for which the commit message reads:

Massive changes. Now includes support for different alignment file types,
multiple alignments per example, overlapping translations and multiple translations
for a segment, caching of examples, correct XML printing
(with words tagged as phrases, rather than individually), and more.

I ran preliminary tests using the new system, but found that the version of Moses I was using did not support the multiple alignments, so I ran the test using only the top alignment for each example. The results on the IWSLT test data from English->Chinese were as follows:

NIST score = 2.7285  BLEU score = 0.0465 for system "moses+ebmt"
NIST score = 3.1452  BLEU score = 0.0701 for system "moses"

These are obviously not very impressive results. Part of the problem is that the IWSLT data is actually from a Chinese->English track and consequently has only one Chinese reference translation for each example (as opposed to four in the opposite direction). However, we can see form the results that we are still not close to beating the SMT system. Another major issue is that one of the benefits we had hoped to gain from the syntax-based matching is that we can find words which are related to each other even if they are not adjacent, but Moses does not currently allow us to specify such relations in the XML input.

After meeting with Steve, I have the following list of things to explore:

  • Modify Moses to be aware of phrases which must be translated together or not at all
  • Try to evaluate EBMT system independently of Moses (number of segments translated, etc)
  • Try same test on Europarl data
  • Try translating all four English sentences and compare each to the Chinese reference
  • Find new data with more reference translations
  • Try different parsers
  • See what difference worse CCG parser models make (less training vs more random)
  • Get a Chinese speaker to look at my output

My feeling is that the lack of support for defining relationships between translated phrases in Moses' XML input is one of the biggest problems. I plan to work on this first.