- Fast caching and persistence of all intermediate and final calculations -- nothing is recomputed unnecessarily.
- Advanced training and preparation logic. Ramp respects the current training set, even when using complex trained features and blended predictions, and also tracks the given preparation set (the x values used in feature preparation -- e.g. the mean and stdev used for feature normalization.)
- A growing library of feature transformations, metrics and estimators. Ramp's simple API allows for easy extension.
Detecting insults
Let's try Ramp out on the Kaggle Detecting Insults in Social Commentary challenge. I recommend grabbing Ramp straight from the Github repo so you are up-to-date.First, we load up the data using pandas. (You can download the data from the Kaggle site, you'll have to sign up.)
import pandas
training_data = pandas.read_csv('train.csv')
print training_data
pandas.core.frame.DataFrame
Int64Index: 3947 entries, 0 to 3946
Data columns:
Insult 3947 non-null values
Date 3229 non-null values
Comment 3947 non-null values
dtypes: int64(1), object(2)
We've got about 4000 comments along with the date they were posted and a boolean indicating whether or not the comment was classified as insulting. If you're curious, the insults in question range from the relatively civilized ("... you don't have a basic grasp on biology") to the mundane ("suck my d***, *sshole"), to the truly bottom-of-the-internet horrific (pass).Anyways, let's set up a DataContext for Ramp. This involves providing a store (to save cached results to) and a pandas DataFrame with our actual data.
from ramp import *
context = DataContext(
store='~/data/insults/ramp',
data=training_data)
We just provided a directory path for the store, so Ramp will use the default HDFPickleStore, which attempts to store objects (on disk) in the fast HDF5 format and falls back to pickling if that is not an option.Next, we'll specify a base configuration for our analysis.
base_config = Configuration(
target='Insult',
metrics=[metrics.AUC()],
)
Here we have specified the DataFrame column 'Insult' as the target for our classification and the AUC for our metric.Model exploration
Now comes the fun part -- exploring features and algorithms. We create a ConfigFactory for this purpose, which takes our base config and provides an iterator over declared feature sets and estimators.import sklearn
from ramp.estimators.sk import BinaryProbabilities
base_features = [
Length('Comment'),
Log(Length('Comment') + 1)
]
factory = ConfigFactory(base_config,
features=[
# first feature set is basic attributes
base_features,
# second feature set adds word features
base_features + [
text.NgramCounts(
text.Tokenizer('Comment'),
mindocs=5,
bool_=True)],
# third feature set creates character 5-grams
# and then selects the top 1000 most informative
base_features + [
trained.FeatureSelector(
[text.NgramCounts(
text.CharGrams('Comment', chars=5),
bool_=True,
mindocs=30)
],
selector=selectors.BinaryFeatureSelector(),
n_keep=1000,
target=F('Insult')),
],
# the fourth feature set creates 100 latent vectors
# from the character 5-grams
base_features + [
text.LSI(
text.CharGrams('Comment', chars=5),
mindocs=30,
num_topics=100),
]
],
# we'll try two estimators (and wrap them so
# we get class probabilities as output):
model=[
BinaryProbabilities(
sklearn.linear_model.LogisticRegression()),
BinaryProbabilities(
sklearn.naive_bayes.GaussianNB())
])
We've defined some base features along with four feature sets that seem promising.Now, let's run cross-validation and compare the results:
for config in factory:
models.cv(config, context, folds=5, repeat=2,
print_results=True)
Here are a couple snippets of the output:
...
Configuration
model: Probabilites for LogisticRegression(
C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, penalty=l2, tol=0.0001)
3 features
target: Insult
auc
0.8679 (+/- 0.0101) [0.8533,0.8855]
...
Configuration
model: Probabilites for GaussianNB()
3 features
target: Insult
auc
0.6055 (+/- 0.0171) [0.5627,0.6265]
...
The Logistic Regression model has of course dominated Naive Bayes. The best feature sets are the 100-vector LSI and the 1000-word character 5-grams. Once a feature is computed, it does not need to be computed again in separate contexts. The binary feature selection is an exception to this though: because it uses target "y" values to select features, Ramp needs to recreate it for each cross validation fold using only the given training values (You can also cheat and tell it not to do this, training it just once against the entire data set.)Predictions
We can also create a quick utility that processes a given comment and spit out it's probability of being an insult:def probability_of_insult(config, ctx, txt):
# create a unique index for this text
idx = int(md5(txt).hexdigest()[:10], 16)
# add the new comment to our DataFrame
d = DataFrame(
{'Comment':[txt]},
index=pandas.Index([idx]))
ctx.data = ctx.data.append(d)
# Specify which instances to predict with predict_index
# and make the prediction
pred, predict_x, predict_y = models.predict(
config,
ctx,
predict_index=pandas.Index([idx]))
return pred[idx]
And we can run it on some sample text:probability_of_insult(
logreg_lsi_100_config,
context,
"ur an idiot")
> .8483555
probability_of_insult(
logreg_lsi_100_config,
context,
"ur great")
> .099361
Ramp will need to create the model for the full training data set the first time you make a prediction, but will then cache and store it, allowing you to quickly classify subsequent text.And more
There's more machine learning goodness to be had with Ramp. Full documentation is coming soon, but for now you can take a look at the code on github: https://github.com/kvh/ramp. Email me or submit an issue if you have any bugs/suggestions/comments.
This looks interesting. Thanks for sharing.
ReplyDeleteDid I get it right that the basic idea is to have a unified, pandas-based framework to piece together features and models from the different libraries?
Yep, that's the idea, with added caching and training logic.
ReplyDeleteThank you, interesting project.
ReplyDeleteLooks very interesting, thanks. I am going through the tutorial, but I am getting the stacktrace below on every call to cv (the insult data seems to be exactly the same you imported, i.e. the dimensions are exactly the same as shown)
ReplyDeleteBy the way, I had to add this for it to work, I think it's missing: import sklearn.naive_bayes
>>> models.cv(c, context, folds=5, repeat=2,print_results=True)
loading 'Length(Comment) [1dd14719]'
loading 'log(Add(Length(Comment), 1)) [ff9b53cc]'
loading 'Insult [7c3354d6]'
Traceback (most recent call last):
File "", line 1, in
File "/.virtualenvs/data/lib/python2.7/site-packages/ramp/models.py", line 117, in cv
metric.score(actuals,preds))
File "/.virtualenvs/data/lib/python2.7/site-packages/ramp/metrics.py", line 32, in score
return self.metric(actual, predicted, **self.kwargs)
File "/.virtualenvs/data/lib/python2.7/site-packages/sklearn/metrics/metrics.py", line 247, in auc_score
fpr, tpr, tresholds = roc_curve(y_true, y_score)
File "/.virtualenvs/data/lib/python2.7/site-packages/sklearn/metrics/metrics.py", line 150, in roc_curve
signal = np.c_[y_score, y_true]
File "/.virtualenvs/data/lib/python2.7/site-packages/numpy/lib/index_tricks.py", line 323, in __getitem__
res = _nx.concatenate(tuple(objs),axis=self.axis)
ValueError: array dimensions must agree except for d_0
mugiwara:
DeleteThe error you are experiencing is a result of changes I've made to how probability predictions are handled. I've updated the code here to reflect those changes -- you'll want to use the "BinaryProbabilities" wrapper for the sklearn estimators (you might need to pull the latest Ramp code from github to use this). This will allow you to compute the AUC without further post-processing.
Thanks for sharing the error!
That was fast, thanks! I was using the package from pypi, but I will pull the latest code from github and try the new version.
Delete