|
SUMMARISATION CORPORA
In this page you will find details on how to get
single and multi-document summarisation corpora, please see below
for more details.
1- Essex Arabic Summaries
Corpus (EASC)
Click the link below for a copy of EASC Corpus:
Download EASC Corpus
About EASC:
The EASC is an Arabic natural language resources. It contains 153 Arabic
articles and 765 human-generated extractive summaries of those articles.
These summaries were generated using Mechanical Turk (http://www.mturk.com/).
Among the major features of EASC are:
Names and extensions are formatted to be compatible with current
evaluation systems such as ROUGE and AutoSummENG. Available in two
encoding formats UTF-8 and ISO-8859-6 (Arabic).
The Essex Arabic Summaries Corpus (EASC) uses copyright material. Users
of the corpus are responsible for ensuring that they comply with the
terms of the copyrights that apply to the source material and the
derived works (summaries) and the terms of relevant copyright law.
Any other original data that is distributed with this corpus is made
available under the Creative Commons Attributive/Share Alike license (http://creativecommons.org/licenses/by-sa/3.0/).
You must provide details of the source of the material when using it.
_______________________________________________________
2- Multi-documents Summarisation
Dataset
The dataset is derived from publicly available
WikiNews (http://www.wikinews.org/)
English texts.
The source texts were under CC Attribution Licence V2.5 (cf.
http://creativecommons.org/licenses/by/2.5/).
Texts in other languages have been translated by native speakers of each
language.
The documents hold no meta-data or tags: they consist plain text files
encoded in UTF-8 (without a Byte Order Marker - BOM).
Tables and formatting have been removed.
700 files are contained in the dataset, 100 for each of the following
languages:
- Arabic
- Czech
- English
- French
- Greek
- Hebrew
- Hindi
Visit the following for the corpora license agreement and description:
(http://www.nist.gov/tac/2011/Summarization/README.MultiLing2011)
Please visit the Text Analysis Conference (TAC) website for details on
how to get the corpora:
TAC Website:
(http://www.nist.gov/tac/)
MultiLing Dataset::
(http://www.nist.gov/tac/2011/Summarization/index.html)
MultiLing Webpage:
(http://users.iit.demokritos.gr/~ggianna/TAC2011/MultiLing2011.html) |
LINKS
Find me on LinkedIn
Mahmoud El-Haj LinkedIn
Follow my work on ResearchGate
Mahmoud El-Haj ResearchGate
School of Computer Science and Electronic
Engineering
Essex University
_______________________________
CONTACT
E-mail:
melhaj@essex.ac.uk
Office:
Computer Science Dept. 5B-531-NLE
Address:
University of Essex
Wivenhoe Park, Colchester, CO4 3SQ
|
|
|