INEX 2010 Link-The-Wiki Task and Result Submission Specification

Introduction

Link-the-Wiki in 2010

The Link-the-Wiki track aims to produce a standard procedure and metrics for the evaluation of link discovery between documents. Given a new “orphan” (unlinked) document, the task is to analyse the text and recommend a set of outgoing links from anchors (specified as passages in the orphan document) to Best Entry Points (BEPs) in existing documents in the collection. The BEP for a link should be the position in the target document from which the reader, having just followed the link, should begin reading.

History

Until 2009, successive versions of the INEX Wikipedia Collection were used for the Link-the-Wiki track. In 2009, the INEX 2009 Wikipedia Collection was used alongside the Te Ara Encyclopedia Collection, the latter of which has no existing links. Separate tasks were run for linking within each collection and linking between the two collections. In 2010, only the Te Ara collection is being used.

Task

One task will be run:

  1. Link-Te-Ara: The task is to identify anchor-to-BEP links within the Te Ara Encyclopedia Collection. All topics will be used and a number of topics will be chosen for evaluation.

Topics

The Te Ara collection can be downloaded from the INEX website.

All topics (documents) in the collection can be found in a single file, named xml.xml, within the archive. This file is an XML dump from the SQL database that backs the Te Ara website, and each topic is contained within a <row> tag in the file.

When calculating offsets (explained in the Result Submission section) of anchors and BEPs, count the number of non-XML characters (characters within text nodes) from the beginning of the entire file, not the beginning of the relevant topic.

Rules

A submission should include all of the topics in the collection. Missing topics will be regarded as having a score of zero for the purpose of evaluation.

All links should be given as “outgoing” links, i.e. the list of links for a given topic should include only links whose anchors are in that topic. In previous years, incoming links were also allowed, but this is not necessary when all of the documents in the collection are being linked.

Each topic may have up to 50 anchors, and each anchor may have up to 5 BEPs, each in different target documents.

Useful Tools

To assist with the calculation of offsets and lengths, two programs are provided: XML2FOL, which outputs the offsets and lengths of the nodes in a given XML file, and XML2TXT, which converts an XML document into a text-only document. The XML2FOL program serves as the reference implementation for the File/Offset/Length format.

Another program will soon be available for checking runs before submission. It will make sure that all the required run details have been specified, and it will check the anchor offsets and lengths to make sure that they match the specified anchor-text. It cannot catch all errors, so the participants are responsible for ensuring that their programs produce sensible output.

Assessment

A selection of topics will be manually assessed by INEX participants. Each topic will be assigned to a participant, who will use the provided assessment GUI to specify whether each link is deemed to be relevant or not. These assessments will then be used to evaluate the results.

Result Submission

Format

Results are to be submitted in the following XML format. It is identical to last year's format for the Link-Te-Ara task, except that certain elements and attributes are now optional, and the offset calculation is different.

Example

An example of a submission in the correct format is given below:

<inexltw-submission participant-id="12"
   run-id="Otago_LTeAraA2B_01"
   task="LTeAra">
   <details>
      <machine>
         <cpu>Intel Celeron</cpu>
         <speed>1.06GHz</speed>
         <cores>1</cores>
         <hyperthreads>1</hyperthreads>
         <memory>128MB</memory>
      </machine>
      <time>3.04 seconds</time>
   </details>
   <description>Describe the approach here, NOT in the run-id.</description>
   <collections>
      <collection>TeAra_2010_Collection</collection>
   </collections>
   <topic file="9638" name="Matariki ? M?ori New Year">
      <outgoing>
         <anchor offset="7445748" length="8" name="balloons">
            <tobep offset="7952293">10151</tobep>
            <tobep offset="10553520">12991</tobep>
            <tobep offset="11686141">14270</tobep>
            <tobep offset="8016276">10208</tobep>
            <tobep offset="7226359">9363</tobep>
         </anchor>
         ...
      </outgoing>
   </topic>
</inexltw-submission>

DTD

The DTD for the submission format is given below:

<!ELEMENT inexltw-submission (details, description, collections, topic+)>
<!ATTLIST inexltw-submission
   participant-id CDATA #REQUIRED
   run-id CDATA #REQUIRED
   task (LTAra_A2B) #REQUIRED
>

<!ELEMENT details (machine|time)>

<!ELEMENT machine (cpu|speed|cores|hyperthreads|memory)>
<!ELEMENT cpu (#PCDATA)>
<!ELEMENT speed (#PCDATA)>
<!ELEMENT cores (#PCDATA)>
<!ELEMENT hyperthreads (#PCDATA)>
<!ELEMENT memory (#PCDATA)>

<!ELEMENT time (#PCDATA)>

<!ELEMENT description (#PCDATA)>

<!ELEMENT collections (collection+)>
<!ELEMENT collection (#PCDATA)>

<!ELEMENT topic (outgoing|anchor+)>
<!ATTLIST topic
    file CDATA #REQUIRED
    name CDATA #IMPLIED
>
    
<!ELEMENT outgoing (anchor+)>

<!ELEMENT anchor (tobep+)>
<!ATTLIST anchor
   name CDATA #IMPLIED
   offset CDATA #REQUIRED
   length CDATA #REQUIRED
>

<!ELEMENT tobep (#PCDATA)>
<!ATTLIST tobep
   offset CDATA #REQUIRED
>