Due 11:59pm on Tuesday, Mar. 5, 2013
Automatic evaluation of machine translation output relative to a human reference (or, if available, a set of reference translations) is a core challenge in machine translation. Not only are such evaluation measures useful for determining whether a change to an MT system improved or degraded its performance, being able to solve the translation evaluation problem provides a simpler version of the translation problem itself: an effective evaluation measure must determine if a hypothesis is fluent and if it has the same meaning as the reference. These roughly correspond to the language model and translation model in the noisy channel decomposition of the translation problem we talked about.
Go to your clone of your course GitHub repository on the machine where you will be doing this assignment, and run the following command to obain the code and data you will need:
./tools/get-new-assignments
In this homework, you will study the translation evaluation problem as follows. You will be given tuples of two English-language translation hypotheses ($\textbf{h}_1$ and $\textbf{h}_2$) and a reference translation ($\textbf{e}$) that was generated by a human translator. Your task will be to decide whether $\textbf{h}_1$ is a better translation than $\textbf{h}_2$, whether they are equally good (or bad), or whether $\textbf{h}_2$ is a better than $\textbf{h}_1$ by designing a function $f \mapsto \{ -1,0,1 \} $ indicating these three options, respectively.
To illustrate, suppose that $\textbf{h}_1$ and $\textbf{h}_2$ are the translation outputs from two different translation systems of the following input:
where the two systems produce:
Let us assume that we are given the following reference translation:
We would hope that our automatic evaluation function $f$ would return $1$ with these inputs, indicating that $\textbf{h}_2$ is a better translation.
This assignment comes with a very simple evaluation algorithm implemented:
$$\begin{align*} f(\textbf{h}_1,\textbf{h}_2,\textbf{e}) = \begin{cases} -1 & \textrm{if }\ell(\textbf{h}_1,\textbf{e}) > \ell(\textbf{h}_2,\textbf{e}) \\ 0 & \textrm{if }\ell(\textbf{h}_1,\textbf{e}) = \ell(\textbf{h}_2,\textbf{e}) \\ 1 & \textrm{otherwise} \end{cases} \end{align*}$$
where $\ell(\textbf{h},\textbf{e})$ is the count of words in $\textbf{h}$ that are also in $\textbf{e}$.
To run this baseline heuristic model, use the following command:
./evaluate | ./check > simple.pred
This runs the aligner and stores the output in dice.al
. To view and score the alignments, run this command (use the -n N
option to display verbose output for only the first $N$ sentences):
./grade < simple.pred
This command scores the prediction quality relative to human judgements of translation quality and reports the accuracy of the evaluator and the confusion matrix.
Your task for this assignment is to improve the accuracy of your translation evaluator relative to human judgements of translation quality as much as possible.
Developing an implementation that returns a prediction of which translation hypothesis is better relative to the reference by using the simple METEOR score is enough to earn seven points. The simple METEOR score is defined as the weighted harmonic mean of the precision and recall of word matches in each hypothesis against the reference (be sure to tune the precision-recall trade-off!).
This is a very simple baseline to implement. However, evaluation is by no means a solved problem, and the goal of this assignment is for you to develop a new approach to evaluating machine translation output that performs better than standard algorithms (as usual, you are not even required to implement the simple METEOR baseline at all, as long as you can beat it). To get full credit, you will need to experiment. Here are some ideas that might inspire you:
Be creative!
You may work in independently or in groups of any size, under these conditions:
You must turn in the following by submitting to the public GitHub repository
hw2/output.txt
- your predictions (1 per line) on data/test.hyp1-hyp2-ref
hw2/README.md
- a brief description of the algorithms you tried.hw2/...
- your source code and revision history. We want to see evidence of regular progress over the course of the project. You don’t have to git push
to the public repository unless you want to, but you should be committing changes with git add
and git commit
. We expect to see evidence that you are trying things out and learning from what you see.You should feel free to use additional data resources such as thesauruses, WordNet, or parallel data. You are also free (and encouraged!) to use additional codebases, tools, and libraries except for those designed to evaluate machine translation systems. You must write your own evaluation function. However, if you want your evaluation to depend on lemmatizers, stemmers, automatic parsers, or part-of-speech taggers, or you would like to learn a metric using a general machine learning toolkit, that is fine. But translation evaluators including (but not limited too) available implementations of BLEU, METEOR, TER, NIST, and others are not permitted. You may of course inspect these systems if you want to understand how they work, although they tend to include other functionality that is not the focus of this assignment. It is possible to complete the assignment with a very modest amount of code. If you aren’t sure whether something is permitted, ask us.
This assignment has been done in the past in the course developed by Adam Lopez, Chris Callison-Burch, and Matt Post.
Unless otherwise indicated, this content has been adapted from this course by Chris Dyer. Both the original and new content are licensed under a Creative Commons Attribution 3.0 Unported License. |