Corpus no anotados

Corpus específicos por tipo de texto

  • Arabic Wikipedia XML corpus uses the 30 most frequent categories of the Arabic Wikipedia XML corpus gathered by Ludovic Denoyer and Patrick Gallinari were selected in order to provide a testbed for the single-label categorization task in the Arabic language. (Universidad Politécnica de Valencia)
  • Blogs Clustering Corpus is a set of corpora made up of discussion lines extracted from two blogs websites: boing-boing and slashdot. The aim of this corpus is to support experiments of supervised and unsupervised classifiers with narrow domain short texts, especifically in the medicine field, with documents related with the “cancer” topic. (Universidad Politécnica de Valencia)
  • CesCa (Catalan and Spanish). CESCA is a Catalan corpus consisting of scholar writing text elaborated by 2,400 scholars between the ages of five and sixteen. Each informant has written different types of text: vocabularies, narrative and definition texts as well as jokes (Universitat de Barcelona)

Corpus específicos para aplicaciones concretas

  • Arabic QA, The list of documents consists of Arabic newswire articles collected from the Web. We have manually built a set of Arabic questions in order to ensure that each question has the correct answer in the documents. The proportions of the question type (factoid, list, …) are similar to the ones used in the CLEF 2006. (Universidad Politécnica de Valencia)
  • Question Classification Corpus: 6.000 questions in Spanish.
  • CICLing­2002 Clustering Corpus, this a pre-processed version of 48 scientific abstracts from the CICLing 2002 conference (computational linguistics). The aim of this corpus is to support experiments of supervised and unsupervised classifiers with narrow domain short texts. (Universidad Politécnica de Valencia junto a otros centros)
  • Cross­Lingual Plagiarism Corpus, the CliPA corpus has been created as a resource for the design and test of methods for the automatic detection of cross-lingual plagiarism cases. It contains a set of original text fragments in English and around twelve different plagiarised versions of them in Spanish (Italian will be added soon). The plagiarised text fragments were obtained by both “human plagiarisers” and Machine Translators. In order to create a realistic plagiarism detection environment, the corpus includes a set of text fragments on the same topic but originally written in Spanish. (Universidad Politécnica de Valencia)
  • KnCr clustering corpus is a new narrow-domain short text corpus in the medicine domain which was constructed by downloading the last sample of documents provided in MEDLINE and selecting only those which are related with the “Cancer” domain (Universidad Politécnica de Valencia)
  • Single­label hep­ex Clustering Corpus, This corpus is a pre-processed version of the collection of scientific abstracts compiled by the University of Jaén named hep-ex. The aim of the pre-processed version of this corpus is to support experiments of supervised and unsupervised classifiers with narrow domain short texts (Universidad de Jaén, Universidad Politécnica de Valencia y otros centros)

Corpus anotados con información lingüística


  • AnCora-CA y AnCora-ES (an extension of previous corpora CESS-ECE ( and 3LB ( are multilevel annotated corpus of Spanish and Catalan, consisting of 500,000 words per language mostly from newspaper articles. AnCora is annotated with morphological (PoS), syntactic (constituents and functions) and semantic (argument structure and thematic roles, semantic class, named entities and WordNet senses) information. All resulting layers are independent of each other, thus making easier the data management. The annotation was performed manually, semiautomatically, or fully automatically, depending on the encoded linguistic information (Universitat de Barcelona)
  • AnCora-CO-CA and AnCora-CO-ES are a subset of the multilevel annotated corpus AnCora, consisting of 400,000 words per language, enriched with coreference information, where all noun phrases (NPs) – pronominal or with a nominal head– pointing to the same entity are linked (Universitat de Barcelona)
  • ANCORA-­DEP­-CA and ANCORA­-DEP­-ES can be used as source of information for inducing grammars, developing, improving and/or evaluating syntactic parsers and algorithms for semantic role labelling, dependency-based. This corpus is used in the CoNLL Shared Task 2009: Syntactic and Semantic Dependencies in Multiple Languages, where the core of the task is to predict syntactic and semantic dependencies and their labelling (Universitat de Barcelona)
  • DeliciousT140, Colección de 144.574 documentos web en inglés, con su correspondiente información de tags extraída de Delicious en junio de 2008, a partir de los feeds ofrecidos por este sitio web para los 140 tags más populares. (Universidad Nacional de Educación a Distancia)

Corpus anotados para aplicaciones concretas

  • GeoSemCor2.0:SemCor corpus labeled with WordNet 2.0 synsets, enriched with the addition of labels for synsets that are related to geographical entities
  • Emoticorpus: This resource is an annotated corpus of quotes from the Italian wikiquote collection. This corpus can be used in order to carry out experiments over Automatic Humour Recognition (AHR) (Universidad Politécnica de Valencia)
  • ANERcorp: It is an Arabic NER corpus which consists of 150,000 tokens (Universidad Politécnica de Valencia)
  • AQA: a multilingual collection of questions and answers in Spanish, Italian and English (based on the collection of CLEF 2008) manually annotated with anaphoric relations. (Universidad de Alicante)


  • ANERgazet: It is a set of 3 Arabic gazetteers (people, locations and organizations) which might be used mainly for the Arabic NER task, but still can be used for other Arabic NLP tasks. (Universidad Politécnica de Valencia)
  • AnCoraVerb-CA and AnCoraVerb-ES are two verbal lexicons containing 2.143 and 2,603 different verbs respectively for Catalan and Spanish. In AnCora-Verb lexicons, the mapping between syntactic functions, arguments and thematic roles of each verbal predicate it is established taking into account the verbal semantic class and the diatheses alternations in which the predicate can participate. Each verbal predicate may be divided in different senses where each sense is related to one or more semantic classes (Lexical Semantic Structures), basically differentiated according to the four event classes -accomplishments, achievements, states and activities-, and on the diatheses alternations in which a sense can occur. (Universitat de Barcelona)
  • EuroWordNet is a multilingual database with wordnets for several European languages (Dutch, Italian, Spanish, German, French, Czech and Estonian) (Universidad Politécnica de Cataluña, Universitat de Barcelona y Universidad Nacional de Educación a Distancia)
  • MCR: Multilingual Central Repository, The MCR integrates into the same EuroWordNet framework wordnets from five different languages (together with four English WordNet versions). The MCR also integrates WordNet Domains (Magnini and Cavaglià, 2000) and new versions of the Base Concepts and Top Concept Ontology. The final version of the MCR contains 1,642,389 semantic relations between synsets, most of them acquired by automatic means (Universidad Politécnica de Catalunya junto a otros miembros del proyecto MEANING)
  • The Arabic WordNet (AWN) is a lexical database of the Arabic language following the development process of Princeton English WordNet and Euro WordNet. (Universidad Politécnica de Cataluña, Universidad de Barcelona junto a otros centros)


Dominios semánticos específicos

  • Geo­WordNet, semi-automatically generated mapping from WordNet 2.0 to geographical coordinates (Universidad Politécnica de Valencia)
  • OntoFis es una ontología de granularidad intermedia formalizada en OWL-DL. Con esta ontología se han etiquetado 45.000 términos relacionados con los conceptos 'composición química', 'principio activo' y 'nombre de los medicamentos comercializados en España' Universidad de Alicante

Recopilación de Terceros


Léxicos y tesauros


  • Unified Medical Language System (UMLS). Se compone de tres recursos: Metathesaurus, Semantic Network y Lexicon. Contiene 140 vocabularios (MeSH y DeCS entre otros) en 17 idiomas. La versión de 2008 AA contenía más de millón y medio de conceptos y cerca de 8 millones de nombres normalizados para dichos conceptos. Su red semántica se compone de 135 tipos de conceptos generales. La representación del castellano en este recurso es del 4,14%.
recursos.txt · Última modificación: 2009/11/02 11:06 por textmess
CC Attribution-Noncommercial-Share Alike 3.0 Unported Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0