Page Menu
Home
DevCentral
Search
Configure Global Search
Log In
Files
F11722285
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
1 KB
Referenced Files
None
Subscribers
None
View Options
diff --git a/README b/README
index bbad423..39b382f 100644
--- a/README
+++ b/README
@@ -1,38 +1,39 @@
This script allows to extract proper nouns from an English text with NTLK.
Install dependencies
--------------------
* Install NTLK according your OS (pkg install ntlk on FreeBSD for example)
* Install numpy (pkg install py27-numpy)
* Download the needed NLTK resources with ntlk.download():
+** averaged_perceptron_tagger
** maxent_treebank_pos_tagger
** punkt
** treebank
Source text
-----------
You need a copy of the text you want to extract from as plain text.
Source English word list
------------------------
The expected format is a list in lowercase, each line a substantive word.
Filename should be wordsEn.txt or modified in eliminate-common-nouns script.
Such file is available at http://www-01.sil.org/linguistics/wordlists/english/
Usage
-----
./extract-proper-nouns source.txt > nouns.txt
To sort them and eliminate duplicates:
./extract-proper-nouns source.txt | sort | uniq > nouns.txt
To discard known English words:
./eliminate-common-nouns nouns.txt
Acknowledgment
--------------
Thank you to Rama for NLTK suggestion and some brief guidance.
The original code idea is from Alvations, and could be seen at http://stackoverflow.com/a/17672491/1930997.
File Metadata
Details
Attached
Mime Type
text/x-diff
Expires
Thu, Sep 18, 02:07 (1 d, 6 h)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
2986794
Default Alt Text
(1 KB)
Attached To
Mode
rEPN extract-proper-nouns
Attached
Detach File
Event Timeline
Log In to Comment