Attaining Vision

Developer notebook of Daniel Alberto Cañas

Installing Scikit-learn on Amazon EC2

(Jan 28 2014): Updated installation to conform to the recommended virtualenv source install.

I’ve been using scikit-learn over the past few weeks on a project. While developing and analyzing the data I just needed to get work done without the hassle of a complex installation, the Ubuntu image on EC2 provided just that. Now that the project is ready to be deployed, I need to install scikit-learn on the default Amazon Linux AMI. As I learned, installing scikit-learn is not trivial. It only has two dependencies, but those dependencies have dependencies and you have to sift through documentation of at least 5 packages to truly understand what what is needed to install and in what order. So I decided to brush up on my writing skills, dust off the old blog, and pen a simple guide that I can reference later. I’ll explain what the scikit-learn dependencies are and how to install them on the Amazon Linux AMI, specifically image ami-1624987f.

requirements

First we need Python version 2.6 or greater installed. The main two requirements are NumPy and SciPy. NumPy and SciPy each have their dependencies which are listed below.

  1. Numpy
    • c compiler (gcc)
    • fortran compiler (gfortran)
    • python header files (2.4.x - 3.2.x)
    • Strongly recommended BLAS or LAPACK
  2. Scipy
    • Numpy
    • Complete LAPACK library

Running LDA on Mahout

(Jan 28 2014): Found this old draft I never got around to polish. It’s very rough, but this is a developer notebook after all, so here it goes out into the world.

  • Install mahout
  • Run as core (Create new file)
  • Create sequence files from text in directory
1
bin/mahout seqdirectory -c UTF-8 -i data/documents/2012/ -o seqfiles
  • Create document vectors
1
2
bin/mahout seq2sparse -i seqfiles/ -o normalized-3-gram -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -s 5 -md 3 -x 90 -ng 3 -ml 50 -n 2 -seq
bin/mahout seq2sparse -i seqfiles/ -o normalized-3-gram -ow -chunk 200 --minSupport 5 --minDF 3 --maxDFPercent 90 -ng 3 -ml 50 -n 2 -seq
  • Get dictionary size (to pass into cvb)
1
bin/mahout seqdumper -i normalized-3-gram/dictionary.file-0 -c
  • Create vectors in format for cvb. Creates two files. matrix has sparse vector and docIndex has mapping from numeric key to original key
1
bin/mahout rowid -i normalized-3-gram/tf-vectors -o normalized-3-gram/tf-vectors-cvb
  • run cvb
1
bin/mahout cvb  -i normalized-3-gram/tf-vectors-cvb/matrix -o normalized-3-gram/cvb -k 100 -ow -x 20 -nt 640 -dict normalized-3-gram/dictionary.file-0 -dt normalized-3-gram/cvb-doc-topics-10
  • View index
1
bin/mahout seqdumper -i normalized-3-gram/tf-vectors-cvb/docIndex
  • view results
1
bin/mahout vectordump -i normalized-3-gram/cvb-10 -o prob3 -d normalized-3-gram/dictionary.file-0 -dt sequencefile -p 1 -sort true -vs 5
  • dump docs related to topics
1
bin/mahout vectordump -i normalized-3-gram/cvb-doc-topics-10/ -o prob5 -p 1 
  • doc id mappings
1
bin/mahout seqdumper -i normalized-3-gram/tf-vectors-cvb/docIndex -o doc-id-mappings 
  • JAVA OPTS
1
-Xmx12000m -server -XX:+UseParallelGC -XX:+UseParallelOldGC -Xms8000m

RUN IT

1
nohup /mahout/bin/mahout cvb  -i normalized-3-gram/tf-vectors-cvb/matrix -o normalized-3-gram/cvb-100 -k 100 -ow -nt 237731 -dict normalized-3-gram/dictionary.file-0 -dt normalized-3-gram/cvb-doc-topics-100 -x 20 &

Jess, Ants, Modules, and Conflicts: Revisited

In my previous post I explored the Jess defmodule construct. Experimenting more with defmodules I ran the simulator source and got the following result:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Jess, the Rule Engine for the Java Platform
Copyright (C) 2008 Sandia Corporation
Jess Version 7.1p2 11/5/2008

Module Ant Simulator
---------------------
Food item gathered OK
Food item gathered OK
Food item gathered OK
Food item gathered OK
Food item gathered OK
Food item gathered OK
Food item gathered OK
Food item gathered OK
Food item gathered OK
Food item gathered OK
Food item gathered OK
Food item gathered OK
Garbage emptied OK
Still food silly.Enemy Ant has appeared.
Food item gathered OK
Enemy appeared... Will attack.
Enemy Ant has appeared.
Attacking enemy ant...
Ant killed by enemy :-( <Fact-17>

I thought I found a mistake. On the surface it looks like the ant is gathering food before going off to fight the enemy. This doesn’t make any sense given that the Jess documentation states when an auto-focus rule is activated, the module it appears in is automatically pushed onto the focus stack and becomes the focus module. So when an enemy ant appears, the threat module is immediately pushed onto the stack and gains focus. Once the threat module has focus, the attack-enemy-ant rule fires. To understand why, we must understand that the rule is already on the agenda in the threat module, given that its activation is what caused the activation of theauto-focus property in the first place.

Jess, Ants, Modules, and Conflicts

After attending the last two editions of Rules Fest, I was unable to make it to San Francisco this year. The conference is a chance to meet the people behind the algorithms and technologies being used in expert systems. At least the last two years, it was a small enough conference where sitting down and chatting with the expert systems experts was possible. During a panel session last year, the discussion drifted into the differences between Jess and Drools and it was mentioned that Drools did not have support for what Jess called modules. Besides a few cursory projects, I haven’t really used Drools, so I may be mistaken, and with Drools development advancing at such a furious pace it may have already added that functionality. Anyway, having never used modules in any of my Jess projects, I fired up Jess and decided to take a gander at modules.

Embedding RapidMiner as a Library in an Application (Part II)

My last post described how to create attributes in Java, the first step in the process of creating a RapidMiner example set. This entry will show the steps necessary to get from attributes to a full-fledged example set.

First, let’s review the workflow for creating an example set:

Flow for creating an example set

We already know how to create attributes, the next step is to create an example table. Then we’ll populate the example table with data before finally creating the example set.

Embedding Rapidminer as a Library in an Application

I was involved recently in a project deploying real-time predictive models created using RapidMiner. As the documentation is sparse, I struggled at first grasping the RapidMiner internals. But once I got the basics of the RapidMiner data model, the process was pretty straightforward. So I decided to write the basic steps for integrating RapidMiner in a Java application along with a basic understanding of the data model and a few tips and gotchas I encountered along the way.