WordNet - A Lexical Database for English
This is a Racket FFI interface to the Princeton University’s WordNet® library. The following excerpt from their website adequately summarizes what WordNet is.
WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.
WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. However, there are some important distinctions. First, WordNet interlinks not just word forms—strings of letters—but specific senses of words. As a result, words that are found in close proximity to one another in the network are semantically disambiguated. Second, WordNet labels the semantic relations among words, whereas the groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity.
1 Requirements
This package has been developed and tested on Mac OS X (10.10). The instructions here should largely be applicable for Linux and other forms of Unix. If you run into any issues, please contact the author.
2 Installing the WordNet library
The WordNet library is available from here. The default library available from WordNet links into a static library, which is unusable by the Racket FFI. Follow the instructions in this section to build a shared library.
Assume that ~/Downloads/WordNet-3.0 is the directory into which the tarball has been untar’d.
> cd ~/Downloads/WordNet-3.0 |
Edit the configure.ac file and add the following lines to it, after the line that says AC_PROG_INSTALL:
AC_ENABLE_SHARED |
AC_DISABLE_STATIC |
AC_PROG_LIBTOOL(libtool) |
Edit the lib/Makefile.am file and replace its contents with the following:
lib_LTLIBRARIES = libWN.la |
libWN_la_SOURCES = binsrch.c morph.c search.c wnglobal.c wnhelp.c wnrtl.c\ |
wnutil.c |
libWN_la_CPPFLAGS = $(INCLUDES) -fPIC |
libWN_la_LDFLAGS = -shared -fPIC |
INCLUDES = -I$(top_srcdir) -I$(top_srcdir)/include |
SUBDIRS = wnres |
Now, reconfigure and build the distribution. Replace the prefix to suit your installation appropriately
> autoreconf -i |
> ./configure --prefix="/usr/local" |
> make |
> sudo make install |
Your library and its associate data will now be installed in /usr/local
3 About the library
(require wn/wn) | package: base |
The WordNet library consists of a few sections: Search, Morphology and Utilities. This Racket interface to the library leaves out some of the utilities because they are largely redundant. The documentation of the original C library functions is available here.
The library must be initialized before any of the functions can be used. The following function initializes the library.
4 High Level Interface
procedure
(<search-fn> word part-of-speech [ #:recursive recursive?]) → (listof string?) word : string? part-of-speech : parts-of-speech? recursive? : boolean? = #t
antonyms |
hypernyms |
hyponyms |
entails |
similars |
member-meronyms |
substance-meronyms |
part-meronyms |
member-holonyms |
substance-holonyms |
part-holonyms |
meronyms |
holonyms |
causes |
participles-of-verb |
attributes |
derivations |
classifications |
classes |
synonyms |
noun-coordinates |
hierarchical-meronyms |
hierarchical-holonyms |
classification-categories |
classification-usages |
classification-regionals |
class-categories |
class-usages |
class-regionals |
instances-of |
instances |
procedure
(lemma word part-of-speech) → (or/c string? #f)
word : string? part-of-speech : parts-of-speech?
procedure
(parts-of-speech? x) → boolean
x : any/c
5 The C Library Interface
This section covers the lower-level C library interface. The high-level interface covers most of what is necessary, but should you need deeper access into the library, the following document should help.
5.1 Basic Type Definitions
procedure
(search-type? x) → boolean?
x : any/c
'hypernym, 'recursive-hypernym,
'hyponym, 'recursive-hyponym,
'entails, 'recursive-entails,
'similar, 'recursive-similar,
'member-meronym, 'recursive-member-meronym,
'substance-meronym, 'recursive-substance-meronym,
'part-meronym, 'recursive-part-meronym,
'member-holonym, 'recursive-member-holonym,
'substance-holonym, 'recursive-substance-holonym,
'part-holonym, 'recursive-part-holonym,
'meronym, 'recursive-meronym,
'holonym, 'recursive-holonym,
'cause, 'recursive-cause,
'particple-of-verb, 'recursive-particple-of-verb,
'see-also, 'recursive-see-also,
'pertains-to, 'recursive-pertains-to,
'attribute, 'recursive-attribute,
'verb-group, 'recursive-verb-group,
'derivation, 'recursive-derivation,
'classification, 'recursive-classification,
'class, 'recursive-class,
'synonyms, 'recursive-synonyms,
'polysemy, 'recursive-polysemy,
'frames, 'recursive-frames,
'noun-coordinates, 'recursive-noun-coord@ (linebreak) inates, 'relatives, 'recursive-relatives,
'hierarchical-meronym, 'recursive-hierarchical-meronym,
'hierarchical-holonym, 'recursive-hierarchical-holonym,
'keywords-by-substring, 'recursive-keywords-by-substring,
'overview, 'recursive-overview,
'classification-category, 'recursive-classification-category,
'classification-usage, 'recursive-classification-usage,
'classification-regional, 'recursive-classification-regional,
'class-category, 'recursive-class-category,
'class-usage, 'recursive-class-usage,
'class-regional, 'recursive-class-regional,
'instance-of, 'recursive-instance-of,
'instances, 'recursive-instances.
The names here are made more readable, but are drawn from the list of “search ptrs” in the documentation. They correspond to #define’d constants in the file wn.h in the WordNet source directory. The ‘recursive-’ versions of these constants are negated, according to the convention used by WordNet, which uses negative search types for recursive searches. For more information about these search types, it is best to refer to the code. The WordNet Documentation is sparse, and will mostly direct you to play with the command line tools.
procedure
(limited-search-type? x) → boolean?
x : any/c
procedure
(c-synset? x) → boolean?
x : any?
5.2 Search Functions
procedure
(find-the-info search-str part-of-speech search-type sense-id) → (or/c string? #f) search-str : string? part-of-speech : part-of-speech? search-type : search-type? sense-id : non-negative-integer?
procedure
(available-search-types string part-of-speech) → (list-of search-type?) string : string? part-of-speech : part-of-speech?
procedure
(find-the-info-ds search-str part-of-speech search-type sense-id) → (or/c c-synset? #f) search-str : string? part-of-speech : part-of-speech? search-type : limited-search-type? sense-id : non-negative-integer?
syntax
(in-senses c-synset-ptr)
c-synset-ptr : (or/c c-synset? #f)
syntax
(in-results c-synset-ptr)
c-synset-ptr : (or/c c-synset? #f)
syntax
(in-words c-synset-ptr)
c-synset-ptr : (or/c c-synset? #f)
5.3 Example for iteration forms
(define (hypernyms word part-of-speech search-type) (let ([synset (find-the-info-ds word part-of-speech 'recursive-hypernym 0)]) (remove-duplicates (for*/list ([sense (in-senses synset)] [result (in-results sense)] [word (in-words result)]) word))))
5.4 The c-synset data-structure
The c-synset data structure is defined as follows. For each of the following fields, a field accessor called c-synset-<field-name> is defined and can be used to access data from returned C pointers. Refer to the FFI documentation for more information.
(define-cstruct _c-synset ([here-i-am long] ; current file position [synset-type _adjective-markers] ; type of ADJ synset [file-num int] ; file number that synset comes from [part-of-speech string] ; part of speech [word-count int] ; number of words in synset [c-words _string-pointer] ; words in synset (pointer to string) [lex-id _int-pointer] ; unique id in lexicographer file (pointer to int) [wn-sense _int-pointer] ; sense number in wordnet (pointer to int) [which-word int] ; which word in synset we're looking for [pointer-count int] ; number of pointers [pointer-type _int-pointer] ; pointer types (pointer to int) [pointer-offsets _long-pointer] ; pointer offsets (pointer to long) [pointer-part-of-speech _int-pointer] ; pointer part of speech (pointer to int) [pointer-to _int-pointer] ; pointer 'to' fields (pointer to int) [pointer-from _int-pointer] ; pointer 'from' fields (pointer to int) [verb-frame-count int] ; number of verb frames [frame-ids _int-pointer] ; frame numbers (pointer to int) [frame-to _int-pointer] ; frame 'to' fields (pointer to int) [definition string] ; synset gloss (definition) [key uint] ; unique synset key [next-synset _c-synset-pointer/null] ; ptr to next synset containing searchword (pointer to synset) [next-form _c-synset-pointer/null] ; ptr to list of synsets for alternate spelling of wordform (pointer to synset) [search-type _search-type] ; type of search performed [pointer-list _c-synset-pointer/null] ; ptr to synset list result of search (pointer to synset) [head-word string] ; if pos is "s", this is cluster head word [head-sense short])) ; sense number of headword