Clojure Unsupervised Part-Of-Speech Tagger Explained

alextp · on Sept 20, 2010

That's a very thorough description. It's always great to see laymen-readable expositions of research.

I find it nice that a purely functional language can handle this sort of problem (state management) more cleanly than what I usually do with my python-based samplers.

gjm11 · on Sept 20, 2010

Very nice. It would be considerably enhanced by some examples of how the algorithm performs on real data.

(I wonder how it does if you feed it not words from a natural language, but tokens from a programming language. Can this sort of technique be adapted to infer the whole grammar? I guess that would be more difficult -- you're trying to learn a much more complicated sort of structure.)

alextp · on Sept 20, 2010

If you feed it words and appropriate features from a programming language you should get something close to a tokenizer. Laura dietz ( http://www.mpi-inf.mpg.de/~dietz/index.html#publications ) has some work that applies similar techniques to programming languages to find bugs.

bravura · on Sept 20, 2010

This is an unsupervised part-of-speech tagger, not an unsupervised parser. Grammar induction is not possible in Aria's model, assuming we are talking about PCFG or some other hierarchical grammar.

jaekwon · on Sept 20, 2010

very cool, haven't read it through yet but i'm curious, what's your motivation for this research?

alextp · on Sept 20, 2010

It is still an open problem to do unsupervised or mostly-unsupervised part-of-speech tagging that performs as well as (or close to) the usual supervised models. For English in the standard domains for which there are annotated corpora this is not necessary, but if you want to apply tagging to a completely different domain or a resource-impoverished language this sort of technique is necessary.

This is relevant because it performs just as well (or better) than state-of-the-art methods while being faster and simpler.