This repository was archived by the owner on Feb 14, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
tag-clusterer. It clusters tags. Generates .tsx.
License
jimregan/tag-clusterer
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
tag-clusterer. It clusters tags.
Actually, it doesn't. It's just a pair of filters to delexicalise tagged
text, which is then clustered by a word clustering tool (currently, mkcls
only), and then generates a tagset specification (.tsx) file for use with
apertium-tagger.
The input must be tagged -- not analysed -- text in the format used for
supervised training of the tagger. Feed its output to mkcls (play around with
the number of classes it generates), then feed that into mkcls-to-tsx.pl
semi-lexicalise.pl can take a set of tags to treat as stopwords, in the
apertium-transfer-tools .atx format, and semi-lexicalise the input. In this
case, the generated .tsx file will have closed classes, and may be usable
without extra intervention.
.atx looks like this:
<?xml version="1.0" encoding="iso-8859-1"?>
<transfer-at source="Portuguese" target="Spanish">
<source>
<lexicalized-words>
<lexicalized-word tags="cnjsub"/>
<lexicalized-word tags="det.*"/>
<lexicalized-word tags="pr"/>
<lexicalized-word tags="prn.tn.*"/>
<lexicalized-word tags="prn.enc.*"/>
<lexicalized-word tags="prn.pro.*"/>
<lexicalized-word tags="rel.*"/>
<lexicalized-word tags="vbser.*"/>
<lexicalized-word tags="vbhaver.*"/>
<lexicalized-word tags="vbmod.*"/>
<lexicalized-word tags="vblex.*" lemma="há"/>
</lexicalized-words>
</source>
<target>
<lexicalized-words>
<lexicalized-word tags="cnjsub"/>
<lexicalized-word tags="det.*"/>
<lexicalized-word tags="pr"/>
<lexicalized-word tags="prn.tn.*"/>
<lexicalized-word tags="prn.enc.*"/>
<lexicalized-word tags="prn.pro.*"/>
<lexicalized-word tags="rel.*"/>
<lexicalized-word tags="vbser.*"/>
<lexicalized-word tags="vbhaver.*"/>
<lexicalized-word tags="vbmod.*"/>
<lexicalized-word tags="vblex.*" lemma="hacer"/>
</lexicalized-words>
</target>
</transfer-at>
semi-lexicalise.pl only uses the <source> part, so you only need that.
About
tag-clusterer. It clusters tags. Generates .tsx.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published