Mahmoud El-Haj
PhD candidate at the School of Computer Science and Electronic Engineering
Essex University

   
 

   

   

SUMMARISATION CORPORA

In this page you will find details on how to get single and multi-document summarisation corpora, please see below for more details.

1- Essex Arabic Summaries Corpus (EASC)

Click the link below for a copy of EASC Corpus:
Download EASC Corpus

About EASC:
The EASC is an Arabic natural language resources. It contains 153 Arabic articles and 765 human-generated extractive summaries of those articles. These summaries were generated using Mechanical Turk (http://www.mturk.com/).

Among the major features of EASC are:
Names and extensions are formatted to be compatible with current evaluation systems such as ROUGE and AutoSummENG. Available in two encoding formats UTF-8 and ISO-8859-6 (Arabic).

The Essex Arabic Summaries Corpus (EASC) uses copyright material. Users of the corpus are responsible for ensuring that they comply with the terms of the copyrights that apply to the source material and the derived works (summaries) and the terms of relevant copyright law.

Any other original data that is distributed with this corpus is made available under the Creative Commons Attributive/Share Alike license (http://creativecommons.org/licenses/by-sa/3.0/). You must provide details of the source of the material when using it.
_______________________________________________________

2- Multi-documents Summarisation Dataset

The dataset is derived from publicly available WikiNews (http://www.wikinews.org/) English texts.

The source texts were under CC Attribution Licence V2.5 (cf. http://creativecommons.org/licenses/by/2.5/).
Texts in other languages have been translated by native speakers of each language.

The documents hold no meta-data or tags: they consist plain text files encoded in UTF-8 (without a Byte Order Marker - BOM).
Tables and formatting have been removed.

700 files are contained in the dataset, 100 for each of the following languages:
- Arabic
- Czech
- English
- French
- Greek
- Hebrew
- Hindi

Visit the following for the corpora license agreement and description:
(http://www.nist.gov/tac/2011/Summarization/README.MultiLing2011)

Please visit the Text Analysis Conference (TAC) website for details on how to get the corpora:

TAC Website:
(http://www.nist.gov/tac/)

MultiLing Dataset::
(http://www.nist.gov/tac/2011/Summarization/index.html)

MultiLing Webpage:
(http://users.iit.demokritos.gr/~ggianna/TAC2011/MultiLing2011.html)

 

 

LINKS

Find me on LinkedIn
Mahmoud El-Haj LinkedIn

Follow my work on ResearchGate
Mahmoud El-Haj ResearchGate

School of Computer Science and Electronic Engineering

Essex University

_______________________________

CONTACT

E-mail:
melhaj@essex.ac.uk

Office:
Computer Science Dept. 5B-531-NLE

Address:
University of Essex
Wivenhoe Park, Colchester, CO4 3SQ

 

 

   

Copyrights © 2012-2013, By Mahmoud El-Haj