Syllabus

Homework 1

Due 11:59pm on Tuesday, Feb. 12, 2013

Word alignment is a fundamental task in statistical machine translation. This homework will give you an opportunity to try your hand at developing solutions to this challenging and interesting problem.

Getting started

Go to your clone of your course GitHub repository on the machine where you will be doing this assignment, and run the following command to obain the code and data you will need:

./tools/get-new-assignments

You will obtain a very simple heuristic aligner written in Python and 100,000 German-English parallel sentences from the Europarl corpus, version 7. The heuristic aligner uses set similarity to determine which words are aligned to each other in a corpus of parallel sentences. The intuition is that if you look at the set of sentence pairs that contain an English word $x$, and that set is similar to the set of sentence pairs that contain a German word $y$, then these words are likely to be translations of each other. The set similarity measure we use is Dice’s coefficient, defined in terms of sets $X$ and $Y$ as follows:

$$D(X,Y) = \frac{2 \times |X \cap Y|}{|X| + |Y|}$$

Dice’s coefficient ranges in value from $0$ to $1$.

In our formulation, every pair of words $(e,g)$ in the parallel corpus receives a Dice “score” $\delta(e,g)$. The aligner goes through all pairs of sentences and aligns English word $e_i$ to German word $g_j$ if $\delta(e_i,g_j) > \tau$. By making $\tau$ closer to $1$, fewer (hopefully, higher precision) points are aligned; by making it closer to $0$, more points are aligned. By default, our aligner uses $\tau = 0.5$ as its threshold.

Run the baseline heuristic model 1,000 sentences using the command:

./align -n 1000 | ./check > dice.al

This runs the aligner and stores the output in dice.al. To view and score the alignments, run this command (use the -n N option to display verbose output for only the first $N$ sentences):

./grade < dice.al

This command scores the alignment quality by comparing the output alignments against a set of human alignment annotations using a metric called the alignment error rate (AER), which balances precision and recall of the guessed alignments (paper). Look at the terrible output of this heuristic model – it’s better than chance, but not any good. Try training on 10,000 sentences instead of 1,000, by specifying the change on the command line:

./align -n 10000 | ./check | ./grade

Performance should improve. Another experiment that you can do is change the threshold criterion $\tau$ used by the aligner using the -t X option. How does this affect alignment error rate?

The Challenge

Your task for this assignment is to improve the alignment error rate as much as possible (lower is better). It shouldn’t be hard to improve it at least some: you’ve probably noticed that thresholding a Dice coefficient is a bad idea because alignments don’t compete against one another. A good way to correct this is with a model where alignments compete against each other, such as IBM Model 1. It forces all of the words you are conditioning on to compete to explain each word that is generated.

IBM Model 1 is a simple probabilistic translation model we talked about in class. A source sentence $\textbf{f} = \langle \varepsilon, f_1, f_2 , \ldots , f_{n} \rangle$ (where $\varepsilon$ represents a null token present in every sentence) and a desired target sentence length $m$ are given, and conditioned on these, Model 1 defines a distribution over translations of length $m$ into the target language using the following process:

\begin{align*} \mbox{for each}&i \in [1,2,\ldots,m] \\ a_i &\sim \textrm{Uniform}(0,1,2,\ldots,n) \\ e_i &\sim \textrm{Categorical}(\boldsymbol{\theta}_{f_{a_i}}) \end{align*}

The random variables $\textbf{a} = \langle a_1,a_2, \ldots a_m \rangle$ are the alignments that pick out a source word to translate at each position in the target sentence. A source word $f_j$ may be translated any number of times (0, 1, 2, etc.), but each word in the target language $e_i$ that is generated is generated exactly one time by exactly one source word.

The marginal (marginalizing over all possible alignments) likelihood of a sentence $\textbf{e} = \langle e_1, e_2, \ldots , e_m \rangle$, given $\textbf{f}$ and $m$ is:

$$p(\textbf{e} \mid \textbf{f},m) = \prod_{i=1}^m \sum_{j = 0}^{n} p(a_i = j) \times p(e_i \mid f_j)$$

The iterative EM update for this model is straightforward. At each iteration, for every pair of an English word type $e$ and a German word type $f$, you count up the expected (fractional) number of times tokens $f$ are aligned to tokens of $e$ and divide by the expected number of times that $f$ was chosen as a translation source. That will give you a new estimate of the translation probabilities $p(f \mid e)$, which leads to new alignment expectations, and so on. We recommend developing on a small data set (1,000 sentenes) and a few iterations of EM (in practice, Model 1 needs only 4 or 5 iterations to give good results). When you are finished, you should see both qualitative and quantitative improvements in the alignments.

Developing a Model 1 aligner is enough to earn seven points. But alignment isn’t a solved problem, and the goal of this assignment isn’t for you to just implement a well-known algorithm (in fact, you are not required to implement Model 1 at all, as long as you can beat it). To get full credit you’ll need to experiment. Here are some ideas that might inspire you:

Ground Rules

• You may work in independently or in groups of any size, under these conditions:

• You must notify us by posting a public note to piazza including the e-mails of everyone who will be working in the group (max=3).
• Everyone in the group will receive the same grade on the assignment.
• Once you have formed a group, you may not disband until the next homework.
• You must turn in the following by submitting to the public GitHub repository

• hw1/output.txt - your alignment of the full set of paralle data (dev/test/training). You can push new output as often as you like up until the assignment deadline. The output will be evaluated on the test data (which you do not have gold alignments for), but the grade program will give you a good idea how you are doing. Whoever has the lowest AER score on the test set when the deadline passes receives the most bonus points.
• hw1/README.md - a brief description of the algorithms you tried.
• hw1/... - your source code and revision history. We want to see evidence of regular progress over the course of the project. You don’t have to git push to the public repository unless you want to, but you should be committing changes with git add and git commit. We expect to see evidence that you are trying things out and learning from what you see.
• You may only use data or code resources other than the ones we provide with advance permission. We’ll probably ask you to make your resources available to everyone. So if, say, you have a cool idea for using the Stanford parser (which knows both German and English!), or a German-English dictionary, that’s great. But to make things fair everyone should have access to the same set of resources, so we’ll ask you to share the parses. This kind of constrained data condition is common in real-world evaluations of AI systems. In keeping with this philosophy, we’re much more likely to approve your request if you ask early. Do not ask the night before the assignment. A few things are off-limits: Giza++, cdec’s alignment tools, the Berkeley Aligner, or anything else that already does the alignment for you. You must write your own code for alignment. If you want to do system combination, join forces with your classmates to form a group.

Acknowledgements

This assignment has been done in various forms in the past by Philipp Koehn, John DeNero, and at Adam Lopez, Chris Callison-Burch, and Matt Post.

 Unless otherwise indicated, this content has been adapted from this course by Chris Dyer. Both the original and new content are licensed under a Creative Commons Attribution 3.0 Unported License.