Natural Language Processing (11-411 and 11-611)

Important Links

Project Data

Wikipedia articles

1 Data Set 1
a1.htm a1.txt a2.htm a2.txt a3.htm a3.txt a4.htm a4.txt a5.htm a5.txt a6.htm a6.txt a7.htm a7.txt a8.htm a8.txt a9.htm a9.txt a10.htm a10.txt
2 Data Set 2
a1.htm a1.txt a2.htm a2.txt a3.htm a3.txt a4.htm a4.txt a5.htm a5.txt a6.htm a6.txt a7.htm a7.txt a8.htm a8.txt a9.htm a9.txt a10.htm a10.txt
3 Data Set 3
a1.htm a1.txt a2.htm a2.txt a3.htm a3.txt a4.htm a4.txt a5.htm a5.txt a6.htm a6.txt a7.htm a7.txt a8.htm a8.txt a9.htm a9.txt a10.htm a10.txt
4 Data Set 4
a1.htm a1.txt a2.htm a2.txt a3.htm a3.txt a4.htm a4.txt a5.htm a5.txt a6.htm a6.txt a7.htm a7.txt a8.htm a8.txt a9.htm a9.txt a10.htm a10.txt
Download all

Remember: in addition to these articles, your systems will be tested against an unknown set of 10 Wikipedia articles from a different category.

When we run your systems, we will pass the .txt filenames on the command line. However, the .htm files will also be available alongside the raw textfiles, so your program can make use of the HTML markup if you would like. For example, if we run ./ask s1/a1.txt 5, your program can safely read from s1/a1.htm instead.

Questions and answers

Write sample Easy/Medium/Hard questions (tab-separated format)

Show questions and answers written by your team (tab-separated format)

Question/answer pairs from previous years (may be useful in training or testing your systems, though the held-out test set will be different from all of these)