Machine Translation

11-731 | Spring 2013

Homeworks (Leaderboard)

Homework 2

Due 11:59pm on Tuesday, Mar. 5, 2013

Automatic evaluation of machine translation output relative to a human reference (or, if available, a set of reference translations) is a core challenge in machine translation. Not only are such evaluation measures useful for determining whether a change to an MT system improved or degraded its performance, being able to solve the translation evaluation problem provides a simpler version of the translation problem itself: an effective evaluation measure must determine if a hypothesis is fluent and if it has the same meaning as the reference. These roughly correspond to the language model and translation model in the noisy channel decomposition of the translation problem we talked about.

Getting started

Go to your clone of your course GitHub repository on the machine where you will be doing this assignment, and run the following command to obain the code and data you will need:


In this homework, you will study the translation evaluation problem as follows. You will be given tuples of two English-language translation hypotheses ($\textbf{h}_1$ and $\textbf{h}_2$) and a reference translation ($\textbf{e}$) that was generated by a human translator. Your task will be to decide whether $\textbf{h}_1$ is a better translation than $\textbf{h}_2$, whether they are equally good (or bad), or whether $\textbf{h}_2$ is a better than $\textbf{h}_1$ by designing a function $f \mapsto \{ -1,0,1 \} $ indicating these three options, respectively.

To illustrate, suppose that $\textbf{h}_1$ and $\textbf{h}_2$ are the translation outputs from two different translation systems of the following input:

Chris hat die Prüfung bestanden, jedoch nur knapp.

where the two systems produce:

$\textbf{h}_1 = $Chris has insisted on the test, but just barely.
$\textbf{h}_2 = $Chris has passed the test, but just barely.

Let us assume that we are given the following reference translation:

$\textbf{e} = $Chris just barely passed the test.

We would hope that our automatic evaluation function $f$ would return $1$ with these inputs, indicating that $\textbf{h}_2$ is a better translation.

This assignment comes with a very simple evaluation algorithm implemented:

$$\begin{align*} f(\textbf{h}_1,\textbf{h}_2,\textbf{e}) = \begin{cases} -1 & \textrm{if }\ell(\textbf{h}_1,\textbf{e}) > \ell(\textbf{h}_2,\textbf{e}) \\ 0 & \textrm{if }\ell(\textbf{h}_1,\textbf{e}) = \ell(\textbf{h}_2,\textbf{e}) \\ 1 & \textrm{otherwise} \end{cases} \end{align*}$$

where $\ell(\textbf{h},\textbf{e})$ is the count of words in $\textbf{h}$ that are also in $\textbf{e}$.

To run this baseline heuristic model, use the following command:

./evaluate | ./check > simple.pred

This runs the aligner and stores the output in To view and score the alignments, run this command (use the -n N option to display verbose output for only the first $N$ sentences):

./grade < simple.pred

This command scores the prediction quality relative to human judgements of translation quality and reports the accuracy of the evaluator and the confusion matrix.

The Challenge

Your task for this assignment is to improve the accuracy of your translation evaluator relative to human judgements of translation quality as much as possible.

Developing an implementation that returns a prediction of which translation hypothesis is better relative to the reference by using the simple METEOR score is enough to earn seven points. The simple METEOR score is defined as the weighted harmonic mean of the precision and recall of word matches in each hypothesis against the reference (be sure to tune the precision-recall trade-off!).

This is a very simple baseline to implement. However, evaluation is by no means a solved problem, and the goal of this assignment is for you to develop a new approach to evaluating machine translation output that performs better than standard algorithms (as usual, you are not even required to implement the simple METEOR baseline at all, as long as you can beat it). To get full credit, you will need to experiment. Here are some ideas that might inspire you:

  • Use WordNet to permit matching on synonyms
  • Compute string similarity using string subsequence kernels
  • Use an $n$-gram language model to better assess fluency
  • Develop a single-sentence variant of BLEU
  • Use a dependency parser to assess syntactic wellformedness (paper)
  • Learn a classifier (from the provided training data) that uses many different features (paper)
  • See what evaluation measures other people have implemented (paper)

Be creative!

Ground Rules

  • You may work in independently or in groups of any size, under these conditions:

    • You must notify us by posting a public note to piazza including the e-mails of everyone who will be working in the group (max=3).
    • Everyone in the group will receive the same grade on the assignment.
    • Once you have formed a group, you may not disband until the next homework.
  • You must turn in the following by submitting to the public GitHub repository

    • hw2/output.txt - your predictions (1 per line) on data/test.hyp1-hyp2-ref
    • hw2/ - a brief description of the algorithms you tried.
    • hw2/... - your source code and revision history. We want to see evidence of regular progress over the course of the project. You don’t have to git push to the public repository unless you want to, but you should be committing changes with git add and git commit. We expect to see evidence that you are trying things out and learning from what you see.

You should feel free to use additional data resources such as thesauruses, WordNet, or parallel data. You are also free (and encouraged!) to use additional codebases, tools, and libraries except for those designed to evaluate machine translation systems. You must write your own evaluation function. However, if you want your evaluation to depend on lemmatizers, stemmers, automatic parsers, or part-of-speech taggers, or you would like to learn a metric using a general machine learning toolkit, that is fine. But translation evaluators including (but not limited too) available implementations of BLEU, METEOR, TER, NIST, and others are not permitted. You may of course inspect these systems if you want to understand how they work, although they tend to include other functionality that is not the focus of this assignment. It is possible to complete the assignment with a very modest amount of code. If you aren’t sure whether something is permitted, ask us.


This assignment has been done in the past in the course developed by Adam Lopez, Chris Callison-Burch, and Matt Post.

Creative Commons License Unless otherwise indicated, this content has been adapted from this course by Chris Dyer. Both the original and new content are licensed under a Creative Commons Attribution 3.0 Unported License.