Monday, November 5, 2012

Getting started with Ramp: Detecting insults

Ramp is a python library for rapid machine learning prototyping. It provides a simple, declarative syntax for exploring features, algorithms and transformations quickly and efficiently. At its core it's a pandas wrapper around various python machine learning and statistics libraries (scikit-learn, rpy2, etc.). Some features:
  • Fast caching and persistence of all intermediate and final calculations -- nothing is recomputed unnecessarily.
  • Advanced training and preparation logic. Ramp respects the current training set, even when using complex trained features and blended predictions, and also tracks the given preparation set (the x values used in feature preparation -- e.g. the mean and stdev used for feature normalization.)
  • A growing library of feature transformations, metrics and estimators. Ramp's simple API allows for easy extension.

Detecting insults

Let's try Ramp out on the Kaggle Detecting Insults in Social Commentary challenge. I recommend grabbing Ramp straight from the Github repo so you are up-to-date.

First, we load up the data using pandas. (You can download the data from the Kaggle site, you'll have to sign up.)
import pandas

training_data = pandas.read_csv('train.csv')

print training_data

pandas.core.frame.DataFrame
Int64Index: 3947 entries, 0 to 3946
Data columns:
Insult     3947  non-null values
Date       3229  non-null values
Comment    3947  non-null values
dtypes: int64(1), object(2)
We've got about 4000 comments along with the date they were posted and a boolean indicating whether or not the comment was classified as insulting. If you're curious, the insults in question range from the relatively civilized ("... you don't have a basic grasp on biology") to the mundane ("suck my d***, *sshole"), to the truly bottom-of-the-internet horrific (pass).

Anyways, let's set up a DataContext for Ramp. This involves providing a store (to save cached results to) and a pandas DataFrame with our actual data.
from ramp import *

context = DataContext(
              store='~/data/insults/ramp', 
              data=training_data)
We just provided a directory path for the store, so Ramp will use the default HDFPickleStore, which attempts to store objects (on disk) in the fast HDF5 format and falls back to pickling if that is not an option.

Next, we'll specify a base configuration for our analysis.
base_config = Configuration(
    target='Insult',
    metrics=[metrics.AUC()],
    )
Here we have specified the DataFrame column 'Insult' as the target for our classification and the AUC for our metric.

Model exploration

Now comes the fun part -- exploring features and algorithms. We create a ConfigFactory for this purpose, which takes our base config and provides an iterator over declared feature sets and estimators.
import sklearn
from ramp.estimators.sk import BinaryProbabilities

base_features = [
    Length('Comment'),  
    Log(Length('Comment') + 1)
]

factory = ConfigFactory(base_config,
    features=[
        # first feature set is basic attributes
        base_features,

        # second feature set adds word features
        base_features + [
            text.NgramCounts(
                text.Tokenizer('Comment'),
                mindocs=5,
                bool_=True)],

        # third feature set creates character 5-grams
        # and then selects the top 1000 most informative
        base_features + [
            trained.FeatureSelector(
                [text.NgramCounts(
                    text.CharGrams('Comment', chars=5),
                    bool_=True,
                    mindocs=30)
                ],
                selector=selectors.BinaryFeatureSelector(),
                n_keep=1000,
                target=F('Insult')),
            ],

        # the fourth feature set creates 100 latent vectors
        # from the character 5-grams
        base_features + [
            text.LSI(
                text.CharGrams('Comment', chars=5),
                mindocs=30,
                num_topics=100),
            ]
    ],

    # we'll try two estimators (and wrap them so
    # we get class probabilities as output):
    model=[
        BinaryProbabilities(
            sklearn.linear_model.LogisticRegression()),
        BinaryProbabilities(
            sklearn.naive_bayes.GaussianNB())
    ])
We've defined some base features along with four feature sets that seem promising.

 Now, let's run cross-validation and compare the results:
for config in factory:
    models.cv(config, context, folds=5, repeat=2, 
              print_results=True)
Here are a couple snippets of the output:
...

Configuration
 model: Probabilites for LogisticRegression(
          C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty=l2, tol=0.0001)
 3 features
 target: Insult
auc
0.8679 (+/- 0.0101) [0.8533,0.8855]

...

Configuration
 model: Probabilites for GaussianNB()
 3 features
 target: Insult
auc
0.6055 (+/- 0.0171) [0.5627,0.6265]

...
The Logistic Regression model has of course dominated Naive Bayes. The best feature sets are the 100-vector LSI and the 1000-word character 5-grams. Once a feature is computed, it does not need to be computed again in separate contexts. The binary feature selection is an exception to this though: because it uses target "y" values to select features, Ramp needs to recreate it for each cross validation fold using only the given training values (You can also cheat and tell it not to do this, training it just once against the entire data set.)

Predictions

We can also create a quick utility that processes a given comment and spit out it's probability of being an insult:
def probability_of_insult(config, ctx, txt):
    # create a unique index for this text
    idx = int(md5(txt).hexdigest()[:10], 16)

    # add the new comment to our DataFrame
    d = DataFrame(
            {'Comment':[txt]}, 
            index=pandas.Index([idx]))
    ctx.data = ctx.data.append(d)

    # Specify which instances to predict with predict_index
    # and make the prediction
    pred, predict_x, predict_y = models.predict(
            config, 
            ctx,
            predict_index=pandas.Index([idx]))

    return pred[idx]
And we can run it on some sample text:
probability_of_insult(
        logreg_lsi_100_config, 
        context, 
        "ur an idiot")

> .8483555

probability_of_insult(
        logreg_lsi_100_config, 
        context, 
        "ur great")

> .099361
Ramp will need to create the model for the full training data set the first time you make a prediction, but will then cache and store it, allowing you to quickly classify subsequent text.

And more

There's more machine learning goodness to be had with Ramp. Full documentation is coming soon, but for now you can take a look at the code on github: https://github.com/kvh/ramp.  Email me or submit an issue if you have any bugs/suggestions/comments.

6 comments:

  1. This looks interesting. Thanks for sharing.

    Did I get it right that the basic idea is to have a unified, pandas-based framework to piece together features and models from the different libraries?

    ReplyDelete
  2. Yep, that's the idea, with added caching and training logic.

    ReplyDelete
  3. Thank you, interesting project.

    ReplyDelete
  4. Looks very interesting, thanks. I am going through the tutorial, but I am getting the stacktrace below on every call to cv (the insult data seems to be exactly the same you imported, i.e. the dimensions are exactly the same as shown)

    By the way, I had to add this for it to work, I think it's missing: import sklearn.naive_bayes

    >>> models.cv(c, context, folds=5, repeat=2,print_results=True)
    loading 'Length(Comment) [1dd14719]'
    loading 'log(Add(Length(Comment), 1)) [ff9b53cc]'
    loading 'Insult [7c3354d6]'
    Traceback (most recent call last):
    File "", line 1, in
    File "/.virtualenvs/data/lib/python2.7/site-packages/ramp/models.py", line 117, in cv
    metric.score(actuals,preds))
    File "/.virtualenvs/data/lib/python2.7/site-packages/ramp/metrics.py", line 32, in score
    return self.metric(actual, predicted, **self.kwargs)
    File "/.virtualenvs/data/lib/python2.7/site-packages/sklearn/metrics/metrics.py", line 247, in auc_score
    fpr, tpr, tresholds = roc_curve(y_true, y_score)
    File "/.virtualenvs/data/lib/python2.7/site-packages/sklearn/metrics/metrics.py", line 150, in roc_curve
    signal = np.c_[y_score, y_true]
    File "/.virtualenvs/data/lib/python2.7/site-packages/numpy/lib/index_tricks.py", line 323, in __getitem__
    res = _nx.concatenate(tuple(objs),axis=self.axis)
    ValueError: array dimensions must agree except for d_0

    ReplyDelete
    Replies
    1. mugiwara:

      The error you are experiencing is a result of changes I've made to how probability predictions are handled. I've updated the code here to reflect those changes -- you'll want to use the "BinaryProbabilities" wrapper for the sklearn estimators (you might need to pull the latest Ramp code from github to use this). This will allow you to compute the AUC without further post-processing.

      Thanks for sharing the error!

      Delete
    2. That was fast, thanks! I was using the package from pypi, but I will pull the latest code from github and try the new version.

      Delete