Skip to content

Commit

Permalink
* initialize repository with original files
Browse files Browse the repository at this point in the history
- from http://wordnetcode.princeton.edu/glosstag.shtml
- split noun.xml into noun-0.xml and noun-1.xml because of github's
  file size limit, but otherwise exactly same the same content
  • Loading branch information
odanoburu committed Apr 18, 2019
0 parents commit 71e11d4
Show file tree
Hide file tree
Showing 8,253 changed files with 19,157,459 additions and 0 deletions.
The diff you're trying to view is too large. We only load the first 3000 changed files.
32 changes: 32 additions & 0 deletions LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
WordNet Release 3.0

This software and database is being provided to you, the LICENSEE, by
Princeton University under the following license. By obtaining, using
and/or copying this software and database, you agree that you have
read, understood, and will comply with these terms and conditions.:

Permission to use, copy, modify and distribute this software and
database and its documentation for any purpose and without fee or
royalty is hereby granted, provided that you agree to comply with
the following copyright notice and statements, including the disclaimer,
and that the same appear on ALL copies of the software, database and
documentation, including modifications that you make for internal
use or for distribution.

WordNet Gloss Disambiguation Project Copyright 2008
by Princeton University. All rights reserved.

THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON
UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON
UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-
ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE
OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT
INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR
OTHER RIGHTS.

The name of Princeton University or Princeton may not be used in
advertising or publicity pertaining to distribution of the software
and/or database. Title to copyright in this software, database and
any associated documentation shall at all times remain with
Princeton University and LICENSEE agrees to preserve same.
583 changes: 583 additions & 0 deletions README.txt

Large diffs are not rendered by default.

313 changes: 313 additions & 0 deletions dtd/glosstag.dtd
Original file line number Diff line number Diff line change
@@ -0,0 +1,313 @@
<!--=======================================================================
File: glosstag.dtd
Date: 04/01/2008
Synopsis: DTD for WordNet Gloss Disambiguation project.
=======================================================================-->

<!ENTITY % id "id CDATA #REQUIRED">
<!--Unique identifier-->

<!ELEMENT wordnet (synset)+>
<!ATTLIST wordnet ver CDATA #REQUIRED>

<!ELEMENT synset (terms?, keys?, gloss, gloss, gloss)>
<!ATTLIST synset pos CDATA #REQUIRED
ofs CDATA #REQUIRED
%id;>
<!--Wrapper for the synset, carries the pointer to the offset in the WordNet db file for the synset. A synset contains synset terms, sense keys, and 3 versions of the gloss.

@pos contains one of n, v, a, or r for noun, verb, adj, or adv and
indicates which of the db files the synset is found in.

@ofs contains the byte offset in data.pos file for the synset.
-->

<!ELEMENT gloss (((classif | aux)*, def, aux*, ex*) | text | orig)>
<!ATTLIST gloss desc (wsd | text | orig) #REQUIRED>
<!--This is the wrapper for the contents of a gloss.

@desc="orig" means this contains a copy of the original WordNet
gloss in the child <orig> tag.

@desc="text" means this contains the tokenized text of the gloss
in the child <text> tag.

@desc="wsd" is assigned to disambiguated glosses. Minimally, this must
contain a definition. It can be preceded by domain classification and/or
auxiliary info (usually in parens, but not always), and optionally followed
by more auxiliary info and zero or more examples. Contents of classif and
aux (with the exception of verb arguments) are not sense-tagged. Only the
synset terms in ex's are sense-tagged.

-->

<!ELEMENT def (aux | wf | cf | mwf | qf)*>
<!ATTLIST def %id;>
<!-- Wraps the definition text. With the exception of the contents of aux with
@tag="ignore", and stoplist words/collocations, all open class words
and collocations within defs are sense-tagged.
-->

<!ELEMENT aux (wf | cf | mwf | qf)+>
<!ATTLIST aux tag (ignore) #IMPLIED
type (arg) #IMPLIED>
<!--This tag demarcates auxiliary info. <aux> contents are always secondary to
the main sense of the synset. aux info generally precedes or follows the
def, but can also be embedded within the def text. There are two kinds of
auxiliary info, @tag="ignore" and @type="arg". Those assigned @tag="ignore"
contain mainly grammatical or usage information, some qualifying text such
as a year born, time period, or date range, or a chemical or other symbol.
The contents of @tag="ignore" aux's are not sense-tagged. aux's that
are assigned @type="arg" only appear in verb glosses, and contain
the argument, or typical argument, for the preceding verb. They are set
off in this way so that the syntax of the definition fits that of the
lemma (defining verb is intransitive if the lemma is intransitive). The
contents of @type="arg" aux's are sense-tagged.
-->

<!ELEMENT classif (wf | cf | mwf)+>
<!ATTLIST classif type (cat | use | reg | unk) #REQUIRED>
<!--This is a wrapper for domain classification info preceding a def. For
the purposes of tagging/WSD, this text is "ignorable", as it is repeated
in the usage, region, and category pointers for the synset.

The classification types for glosses are "cat" (domain category), "use"
(usage), and "reg" (region). "unk" is for unknown.
-->

<!ELEMENT ex (wf | cf | mwf | qf)+>
<!ATTLIST ex %id;>
<!--ex's contain a single sentence exemplifying one of the synset words or
collocations. Only the synset terms in ex's have been
sense-tagged.
-->

<!ELEMENT qf (wf | cf | mwf | qf)+>
<!ATTLIST qf rend (sq | dq) #IMPLIED>
<!--A qf delimits single- and double-quoted forms, replacing the actual
left and right-hand quote marks. The @rend attribute indicates whether
the quote marks are to be rendered as single (sq) or double (dq).
-->

<!ELEMENT id EMPTY>
<!ATTLIST id sk CDATA #REQUIRED
lemma CDATA #REQUIRED
coll CDATA #IMPLIED
%id;>
<!--id is the sense tag for the parent element (a wf or glob).

The <id/> holds the sense key for the synset referenced in the @sk attribute,
and the lemma form corresponding to the lexeme it refers to in @lemma.
The lemma here is the WordNet entry form, without wn pos appended. Case
is preserved so that it matches exactly the WordNet entry form (unlike
on the sense key).

@coll points to the collocation (ie, <glob> tag) that the <id> is the sense tag
for. @coll on the <id> matches the @coll on <glob>.

<id/>'s are assigned during sense-tagging, and will only exist for
disambiguated words/collocations (that is, only words not within auxiliary
text, or on the stoplist, or in example sentences when not a synset
word/collocation).

If a word or collocation is tagged to more than one sense, it will have
more than one id, one id per sense tag.
-->

<!ELEMENT mwf (wf | cf)+>
<!ATTLIST mwf type (date | drange | nrange| num | time | curr | meas | math | other) #REQUIRED
>
<!--mwf's delimit multi-word forms belonging to a number of semantic classes,
which were automatically assigned during preprocessing. The class is
indicated by the value in the mwf's @type attribute. All or part of the
mwf may be sense tagged to WordNet entry words/collocations.

Attributes:

type="date" indicates the mwf is a date.
type="drange" indicates the mwf is a date range.
type="nrange" indicates the mwf is a numeric range other than a (recognizable) date.
type="num" indicates the mwf is a numeric form.
type="time" indicates the mwf is a time.
type="curr" indicates the mwf is currency.
type="meas" indicates the mwf is a measurement.
type="math" indicates the mwf is a formula or other mathematical form.
type="other" indicates other multi-word forms (eg., groupings of symbols).
-->

<!ELEMENT wf (#PCDATA | id)*>
<!ATTLIST wf tag (un | auto | man | ignore) "un"
lemma CDATA #IMPLIED
pos CDATA #IMPLIED
type (punc | year | chem | num | time | symb | curr | math | abbr | acronym) #IMPLIED
rdf CDATA #IMPLIED
sep CDATA #IMPLIED
%id;>
<!--A wf is a single-word form (or punctuation) that is not part of a
WordNet collocation.

@lemma contains all potential lemmas for the orthographic form, where the
lemma is the WordNet entry form of the word with wn pos appended (eg.
the form "flies" has 3 potential lemmas: the noun fly, the verb fly, and
the noun flies, hence, @lemma="flies%1|fly%1|fly%2").

Neither @lemma nor <id/> are required (as @type="punc" wfs do not get
assigned a lemma, and not all wfs are sense-tagged). @lemma remains
unchanging for the duration of the auto/manual sense tagging phases.

@sep contains the character separating this wf from the next in print. Valid
values for the @sep attribute are "-", "", and " ", for hyphenated words not
in wn, cases where no space follows the <wf>, and the default case,
respectively. The default value for @sep is a space, not explicitly assigned.

@type is assigned to wf's that are punctuation, abbreviations, acronyms,
or belong to one of a small set of semantic classes listed below.

@pos was automatically assigned to wf/cfs within <def> only.

Attribute values for tag:

tag="un" indicates the wf has not been sense-tagged (ie, is untagged)
tag="auto" indicates the wf was automatically disambiguated
tag="man" indicates the wf was manually disambiguated
tag="ignore" indicates that the wf is to be ignored during disambiguation

Attribute values for type:

type="punc" indicates the wf is punctuation.
type="year" indicates the wf is a year.
type="chem" indicates the wf is a chemical name.
type="num" indicates the wf is a number.
type="time" indicates the wf is a time.
type="math" indicates the wf is mathematical symbol, variable, etc.
type="symb" indicates the wf is a symbol.
type="curr" indicates the wf is currency.
type="abbr" indicates the wf is an abbreviation.
type="acronym" indicates the wf is an acronym.

Pos values (a variation on Penn Treebank's tagset):

( open paren
) close paren
, comma
. final punc (stop, question mark, exclamation point)
... ellipsis
: colon, semicolon, emdash
CC coordinating conjunction
CD number (spelled out or numeral)
DT determiner
FW foreign or unknown word
IN preposition
JJ adjective
JJR adjective, comparative
JJS adjective, superlative
MD modal
NN noun, singular or mass
NNP proper noun, singular or plural
NNS noun, plural
PDT predeterminer
PRP personal pronoun
PRP$ possessive pronoun
RB adverb
RBR adverb, comparative
RBS adverb, superlative
RP particle
SYM symbol
TO to
UH interjection
VB verb, base form or present tense
VBD verb, past tense
VBG verb, gerund/present participle
VBN verb, past participle
VBP verb, non-3rd person sing present
VBZ verb, 3rd person sing present
WDT wh-determiner
WP wh-pronoun
WP$ possessive wh-pronoun
WRB wh-adverb

-->

<!ELEMENT cf (#PCDATA | glob)*>
<!ATTLIST cf tag (un | ignore) #REQUIRED
lemma CDATA #IMPLIED
pos CDATA #IMPLIED
type (punc | year | chem | num | time | symb | curr | math | abbr | acronym) #IMPLIED
coll CDATA #REQUIRED
rdf CDATA #IMPLIED
sep CDATA #IMPLIED
%id;>
<!--A cf is a wf that is a collocation form, i.e., it is part of a WordNet collocation.
A collocation may be contiguous or non-contiguous. All cf's for the same collocation
are linked together via the @coll attribute. The "head" cf (i.e., the first word
in the collocation) is the form that gets assigned the sense tag (which
is a child of the <glob> tag corresponding with that collocation). Thus,
the sense tag is not an immediate child of <cf>, it is a descendant of it.

Neither @lemma nor <glob/> are required (as @type="punc" cfs do not have
a lemma, and non-head cfs do not bear the <glob/> tag, unless they
are ALSO the head cf of another collocation). @lemma remains unchanging
for the duration of the auto/manual sense tagging phases.

The @coll attribute is required for all cfs, and contains a unique alpha
character that identifies all cf's belonging to the same collocation. If
the cf is a form in more than one collocation, @coll will contain a
comma-delimited list of alpha chars. @coll values are unique
within a gloss, that is, they start over with "a" for each gloss.

@sep contains the character separating this cf from the next in print. Valid
values for the @sep attribute are "-", "'", and "", for hyphenated words not
in wn, contractions that get split, and cases where no space follows the <cf>,
respectively. The default value for @sep is a space, not explicitly assigned.

@pos was automatically assigned to wf/cfs within <def> only.

Since the <glob> tag carries sense tag info, the @tag attribute on the cf does
not indicate man or auto.

Attribute values for tag:

tag="un" indicates the cf was not disambiguated on its own (ie.,
independent of the collocation)
tag="ignore" indicates the cf is a stoplist item

See wf for values for @type and @pos.

-->

<!ELEMENT glob (id)*>
<!ATTLIST glob tag (un | auto | man) #REQUIRED
glob (auto | man) #REQUIRED
lemma CDATA #REQUIRED
coll CDATA #REQUIRED
%id;>
<!--
Bears the lemma and sense tag info for a collocation. The @glob attribute
indicates whether the collocation was automatically or manually globbed.

@tag indicates if the glob has been sense-tagged, and if so, whether
it was sense-tagged automatically or manually.

@coll is the alpha identifier for the collocation to which this lemma
and sense tag "belongs". All cfs for a collocation will reference this
value in the cf's @coll attribute.

If the collocation has been sense tagged, it will contain one or more
<id/> children, one for each sense tag.
-->

<!ELEMENT text (#PCDATA)>
<!--Contains the tokenized text of the WordNet gloss, including definition and
example sentences. Paired left and right quotes are encoded in UTF-8. -->

<!ELEMENT orig (#PCDATA)>
<!--Contains the original WordNet version of the gloss, including
definition and example sentences, as it exists in the database files. -->

<!ELEMENT terms (term)+>
<!ELEMENT term (#PCDATA)>
<!--Contains WordNet synset terms-->

<!ELEMENT keys (sk)+>
<!ELEMENT sk (#PCDATA)>
<!--Contains WordNet synset keys for the synset terms-->
Loading

0 comments on commit 71e11d4

Please sign in to comment.