Comparison of Semantic Role Labeling (aka Shallow Semantic Parsing) Software

There are quite a few high-quality Semantic Role Labelers out there, I recently tried a few of them out and thought I'd share my experiences.

If you are unfamiliar with SRL, from wikipedia:
Semantic role labeling, sometimes also called shallow semantic parsing, is a task in natural language processing consisting of the detection of the semantic arguments associated with the predicate or verb of a sentence and their classification into their specific roles. For example, given a sentence like "Mary sold the book to John", the task would be to recognize the verb "to sell" as representing the predicate, "Mary" as representing the seller (agent), "the book" as representing the goods (theme), and "John" as representing the recipient. This is an important step towards making sense of the meaning of a sentence. A semantic representation of this sort is at a higher-level of abstraction than a syntax tree. For instance, the sentence "The book was sold by Mary to John" has a different syntactic form, but the same semantic roles.
SRL is generally the final step in an NLP pipeline consisting of tokenizer -> tagger -> syntactic parser -> SRL. The following tools implement various parts of this pipeline, typically using existing external libraries for the steps up to role labeling.

I've provided sample output where possible for the sentence: "Remind me to move my car friday." Ideally, an SRL should extract the two roles (remind and move) and their proper arguments (including the temporal "friday" argument).

Without further ado...


In order from most recent release to oldest:

Authors: Anders Bj√∂rkelund, Bernd Bohnet, Love Hafdell, and Pierre Nugues
Latest release: Nov 2012
Comes with nice server/web-interface. Has trained models for english, chinese and german. Newer version w graph-based parser, but does not provide trained models. Achieved some top scores at CoNLL 2009 shared task (SRL-only). You can try it out yourself here:
~1.5gb RAM
Example output

Authors: Dipanjan Das, Andre Martins, Nathan Schneider, Desai Chen and Noah A. Smith at Carnegie Mellon University.
Latest release: May 2012
trained on FrameNet. Extracts nominal frames as well as verbal.
Resource intensive (~8gb RAM for me on 64bit).
Example output


Authors: R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa
Latest release: August 2011
The only completely self-contained library on this list. Very fast and efficient c code. Non-commercial license.
~180mb RAM
Author: Mihai Surdeanu
Latest release: 2007
If you want to compile with gcc > 4.3 you need to add some explicit c headers (much easier than trying to install multiple gccs!). I put a patched version up on github if you're interested (I also made some OSX compatibility changes and hacked on a server mode):
c++ code, uses AdaBoost and Charniak parser. Fast and efficient.
~150mb RAM
Example output


Authors: K. Erk and S. Pado
Latest release: 2007
You'll need to download and install TnT, TreeTagger, Collins parser and mallet to get this running. Uses actual framenet labels (including nominal targets),  and comes with pre-trained classifiers for FrameNet 1.3.
Low memory usage.


I couldn't get a local install of this working. The web demo works though, so you can give that a go. You can see all their software demos here:


Authors: Lund University
This work has been subsumed by Mate-tools.
Example output


The java libraries can get memory hungry, so if you are looking for something more lightweight, I would recommend either SwiRL or SENNA. In terms of labeling performance, direct comparisons between most of these libraries is hard due to their varied outputs and objectives. Most perform at or near state-of-the-art, so it's more about what fits your needs.

Let me know if I missed any!

Getting started with Ramp: Detecting insults

Ramp is a python library for rapid machine learning prototyping. It provides a simple, declarative syntax for exploring features, algorithms and transformations quickly and efficiently. At its core it's a pandas wrapper around various python machine learning and statistics libraries (scikit-learn, rpy2, etc.). Some features:
  • Fast caching and persistence of all intermediate and final calculations -- nothing is recomputed unnecessarily.
  • Advanced training and preparation logic. Ramp respects the current training set, even when using complex trained features and blended predictions, and also tracks the given preparation set (the x values used in feature preparation -- e.g. the mean and stdev used for feature normalization.)
  • A growing library of feature transformations, metrics and estimators. Ramp's simple API allows for easy extension.

Detecting insults

Let's try Ramp out on the Kaggle Detecting Insults in Social Commentary challenge. I recommend grabbing Ramp straight from the Github repo so you are up-to-date.

First, we load up the data using pandas. (You can download the data from the Kaggle site, you'll have to sign up.)
import pandas

training_data = pandas.read_csv('train.csv')

print training_data

Int64Index: 3947 entries, 0 to 3946
Data columns:
Insult     3947  non-null values
Date       3229  non-null values
Comment    3947  non-null values
dtypes: int64(1), object(2)
We've got about 4000 comments along with the date they were posted and a boolean indicating whether or not the comment was classified as insulting. If you're curious, the insults in question range from the relatively civilized ("... you don't have a basic grasp on biology") to the mundane ("suck my d***, *sshole"), to the truly bottom-of-the-internet horrific (pass).

Anyways, let's set up a DataContext for Ramp. This involves providing a store (to save cached results to) and a pandas DataFrame with our actual data.
from ramp import *

context = DataContext(
We just provided a directory path for the store, so Ramp will use the default HDFPickleStore, which attempts to store objects (on disk) in the fast HDF5 format and falls back to pickling if that is not an option.

Next, we'll specify a base configuration for our analysis.
base_config = Configuration(
Here we have specified the DataFrame column 'Insult' as the target for our classification and the AUC for our metric.

Model exploration

Now comes the fun part -- exploring features and algorithms. We create a ConfigFactory for this purpose, which takes our base config and provides an iterator over declared feature sets and estimators.
import sklearn
from import BinaryProbabilities

base_features = [
    Log(Length('Comment') + 1)

factory = ConfigFactory(base_config,
        # first feature set is basic attributes

        # second feature set adds word features
        base_features + [

        # third feature set creates character 5-grams
        # and then selects the top 1000 most informative
        base_features + [
                    text.CharGrams('Comment', chars=5),

        # the fourth feature set creates 100 latent vectors
        # from the character 5-grams
        base_features + [
                text.CharGrams('Comment', chars=5),

    # we'll try two estimators (and wrap them so
    # we get class probabilities as output):
We've defined some base features along with four feature sets that seem promising.

 Now, let's run cross-validation and compare the results:
for config in factory:, context, folds=5, repeat=2, 
Here are a couple snippets of the output:

 model: Probabilites for LogisticRegression(
          C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty=l2, tol=0.0001)
 3 features
 target: Insult
0.8679 (+/- 0.0101) [0.8533,0.8855]


 model: Probabilites for GaussianNB()
 3 features
 target: Insult
0.6055 (+/- 0.0171) [0.5627,0.6265]

The Logistic Regression model has of course dominated Naive Bayes. The best feature sets are the 100-vector LSI and the 1000-word character 5-grams. Once a feature is computed, it does not need to be computed again in separate contexts. The binary feature selection is an exception to this though: because it uses target "y" values to select features, Ramp needs to recreate it for each cross validation fold using only the given training values (You can also cheat and tell it not to do this, training it just once against the entire data set.)


We can also create a quick utility that processes a given comment and spit out it's probability of being an insult:
def probability_of_insult(config, ctx, txt):
    # create a unique index for this text
    idx = int(md5(txt).hexdigest()[:10], 16)

    # add the new comment to our DataFrame
    d = DataFrame(
            index=pandas.Index([idx])) =

    # Specify which instances to predict with predict_index
    # and make the prediction
    pred, predict_x, predict_y = models.predict(

    return pred[idx]
And we can run it on some sample text:
        "ur an idiot")

> .8483555

        "ur great")

> .099361
Ramp will need to create the model for the full training data set the first time you make a prediction, but will then cache and store it, allowing you to quickly classify subsequent text.

And more

There's more machine learning goodness to be had with Ramp. Full documentation is coming soon, but for now you can take a look at the code on github:  Email me or submit an issue if you have any bugs/suggestions/comments.