(Jan 28 2014): Found this old draft I never got around to polish. It’s very rough, but this is a developer notebook after all, so here it goes out into the world.
- Install mahout
- Run as core (Create new file)
- Create sequence files from text in directory
- Get dictionary size (to pass into cvb)
- Create vectors in format for cvb. Creates two files.
matrix has sparse vector and
docIndex has mapping from numeric key to original key
- dump docs related to topics