Tried running the system with the GIZA++ alignment 'probabilities' removed. BLEU score barely changed. Changing XML mode from exclusive to inclusive gave a big performance jump though. Looks like I need to play around with the probability parameter some more.
Do Everything
Finally created a script that pulls the whole compilation-ebmt-smt-bleu process together into one easy package. do_everything.sh grabs the latest revision from subversion, compiles it, runs the EBMT stage, then feeds it through Moses, and finally scores the output using BLEU. All of the output is stored tidily in a new folder with details of the software revision and all script parameters.
GIZA++ Structure
Just a quick one to keep track of what I know about how GIZA++ works.
Files:
hmm.{h,cc}: defines and creates HMM network
model1.{h,cc}: defines and initialises tTable (translation table?)
TTables.{h,cc}: template used for tTable
Alignment Scores
There's a big gap between this and my last post. I'll try and fill it in later. In the meantime, today I'm going to look at the alignment scores from GIZA++ and how they're affecting my translations.
Firstly, the scores seem very low (i.e. all raised to big negative powers of 10).
Q: What do these scores represent?
A: Not sure yet, but the only sentences that get high (>0.001) scores are short ones.
Finding min, max, avg on command line
Handy awk script to find min, max, avg, etc:
cat ../giza.en-fr/en-fr.A3.final.compact | awk '{ if (NR==1) { sum=min=max=$2; } else { sum+=$2; min=(min<$2)?min:$2; max=(max>$2)?max:$2; } } END { print "avg: " sum/NR " min: " min " max: " max; }'
Changing Moses
This is going to be a bit of a stream-of-consciousness post. I'm trying to modify Moses to provide support for linking phrases in the XML input. The EBMT system often translates two or more separate phrases which come from the same example, which should either be used together or not at all (as they are intrinsically related).
Preliminary results
I finished making the extensive changes to my EBMT system required to support the new features discussed in previous blog posts, including support for multiple alignments per example, and use of multiple examples per input. For full details, see the CVS changes brought in with commit #341, for which the commit message reads:
Massive changes. Now includes support for different alignment file types, multiple alignments per example, overlapping translations and multiple translations for a segment, caching of examples, correct XML printing (with words tagged as phrases, rather than individually), and more.
Issues producing tagged phrases
An important issue pertaining to the structure of the output has arisen while implementing the phrase translation output modules.
Adding support for multiple alignments
Adding support for multiple GIZA++ alignments (an n-best list) requires a lot of fairly substantial changes to my current infrastructure.