Sunday, November 18, 2012

Comparison of Semantic Role Labeling (aka Shallow Semantic Parsing) Software

There are quite a few high-quality Semantic Role Labelers out there, I recently tried a few of them out and thought I'd share my experiences.

If you are unfamiliar with SRL, from wikipedia:
Semantic role labeling, sometimes also called shallow semantic parsing, is a task in natural language processing consisting of the detection of the semantic arguments associated with the predicate or verb of a sentence and their classification into their specific roles. For example, given a sentence like "Mary sold the book to John", the task would be to recognize the verb "to sell" as representing the predicate, "Mary" as representing the seller (agent), "the book" as representing the goods (theme), and "John" as representing the recipient. This is an important step towards making sense of the meaning of a sentence. A semantic representation of this sort is at a higher-level of abstraction than a syntax tree. For instance, the sentence "The book was sold by Mary to John" has a different syntactic form, but the same semantic roles.
SRL is generally the final step in an NLP pipeline consisting of tokenizer -> tagger -> syntactic parser -> SRL. The following tools implement various parts of this pipeline, typically using existing external libraries for the steps up to role labeling.

I've provided sample output where possible for the sentence: "Remind me to move my car friday." Ideally, an SRL should extract the two roles (remind and move) and their proper arguments (including the temporal "friday" argument).

Without further ado...

Labelers

In order from most recent release to oldest:

Authors: Anders Bj√∂rkelund, Bernd Bohnet, Love Hafdell, and Pierre Nugues
Latest release: Nov 2012
Comes with nice server/web-interface. Has trained models for english, chinese and german. Newer version w graph-based parser, but does not provide trained models. Achieved some top scores at CoNLL 2009 shared task (SRL-only). You can try it out yourself here: http://barbar.cs.lth.se:8081/
~1.5gb RAM
Example output

Authors: Dipanjan Das, Andre Martins, Nathan Schneider, Desai Chen and Noah A. Smith at Carnegie Mellon University.
Latest release: May 2012
trained on FrameNet. Extracts nominal frames as well as verbal.
Resource intensive (~8gb RAM for me on 64bit).
Example output

SENNA

Authors: R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa
Latest release: August 2011
The only completely self-contained library on this list. Very fast and efficient c code. Non-commercial license.
~180mb RAM
Author: Mihai Surdeanu
Latest release: 2007
If you want to compile with gcc > 4.3 you need to add some explicit c headers (much easier than trying to install multiple gccs!). I put a patched version up on github if you're interested (I also made some OSX compatibility changes and hacked on a server mode): github.com/kvh/SwiRL
c++ code, uses AdaBoost and Charniak parser. Fast and efficient.
~150mb RAM
Example output

Shalmaneser

Authors: K. Erk and S. Pado
Latest release: 2007
You'll need to download and install TnT, TreeTagger, Collins parser and mallet to get this running. Uses actual framenet labels (including nominal targets),  and comes with pre-trained classifiers for FrameNet 1.3.
Low memory usage.

Curator

I couldn't get a local install of this working. The web demo works though, so you can give that a go. You can see all their software demos here: http://cogcomp.cs.illinois.edu/curator/demo/

LTH

Authors: Lund University
This work has been subsumed by Mate-tools.
Example output

Conclusions

The java libraries can get memory hungry, so if you are looking for something more lightweight, I would recommend either SwiRL or SENNA. In terms of labeling performance, direct comparisons between most of these libraries is hard due to their varied outputs and objectives. Most perform at or near state-of-the-art, so it's more about what fits your needs.

Let me know if I missed any!

Monday, November 5, 2012

Getting started with Ramp: Detecting insults

Ramp is a python library for rapid machine learning prototyping. It provides a simple, declarative syntax for exploring features, algorithms and transformations quickly and efficiently. At its core it's a pandas wrapper around various python machine learning and statistics libraries (scikit-learn, rpy2, etc.). Some features:
  • Fast caching and persistence of all intermediate and final calculations -- nothing is recomputed unnecessarily.
  • Advanced training and preparation logic. Ramp respects the current training set, even when using complex trained features and blended predictions, and also tracks the given preparation set (the x values used in feature preparation -- e.g. the mean and stdev used for feature normalization.)
  • A growing library of feature transformations, metrics and estimators. Ramp's simple API allows for easy extension.

Detecting insults

Let's try Ramp out on the Kaggle Detecting Insults in Social Commentary challenge. I recommend grabbing Ramp straight from the Github repo so you are up-to-date.

First, we load up the data using pandas. (You can download the data from the Kaggle site, you'll have to sign up.)
import pandas

training_data = pandas.read_csv('train.csv')

print training_data

pandas.core.frame.DataFrame
Int64Index: 3947 entries, 0 to 3946
Data columns:
Insult     3947  non-null values
Date       3229  non-null values
Comment    3947  non-null values
dtypes: int64(1), object(2)
We've got about 4000 comments along with the date they were posted and a boolean indicating whether or not the comment was classified as insulting. If you're curious, the insults in question range from the relatively civilized ("... you don't have a basic grasp on biology") to the mundane ("suck my d***, *sshole"), to the truly bottom-of-the-internet horrific (pass).

Anyways, let's set up a DataContext for Ramp. This involves providing a store (to save cached results to) and a pandas DataFrame with our actual data.
from ramp import *

context = DataContext(
              store='~/data/insults/ramp', 
              data=training_data)
We just provided a directory path for the store, so Ramp will use the default HDFPickleStore, which attempts to store objects (on disk) in the fast HDF5 format and falls back to pickling if that is not an option.

Next, we'll specify a base configuration for our analysis.
base_config = Configuration(
    target='Insult',
    metrics=[metrics.AUC()],
    )
Here we have specified the DataFrame column 'Insult' as the target for our classification and the AUC for our metric.

Model exploration

Now comes the fun part -- exploring features and algorithms. We create a ConfigFactory for this purpose, which takes our base config and provides an iterator over declared feature sets and estimators.
import sklearn
from ramp.estimators.sk import BinaryProbabilities

base_features = [
    Length('Comment'),  
    Log(Length('Comment') + 1)
]

factory = ConfigFactory(base_config,
    features=[
        # first feature set is basic attributes
        base_features,

        # second feature set adds word features
        base_features + [
            text.NgramCounts(
                text.Tokenizer('Comment'),
                mindocs=5,
                bool_=True)],

        # third feature set creates character 5-grams
        # and then selects the top 1000 most informative
        base_features + [
            trained.FeatureSelector(
                [text.NgramCounts(
                    text.CharGrams('Comment', chars=5),
                    bool_=True,
                    mindocs=30)
                ],
                selector=selectors.BinaryFeatureSelector(),
                n_keep=1000,
                target=F('Insult')),
            ],

        # the fourth feature set creates 100 latent vectors
        # from the character 5-grams
        base_features + [
            text.LSI(
                text.CharGrams('Comment', chars=5),
                mindocs=30,
                num_topics=100),
            ]
    ],

    # we'll try two estimators (and wrap them so
    # we get class probabilities as output):
    model=[
        BinaryProbabilities(
            sklearn.linear_model.LogisticRegression()),
        BinaryProbabilities(
            sklearn.naive_bayes.GaussianNB())
    ])
We've defined some base features along with four feature sets that seem promising.

 Now, let's run cross-validation and compare the results:
for config in factory:
    models.cv(config, context, folds=5, repeat=2, 
              print_results=True)
Here are a couple snippets of the output:
...

Configuration
 model: Probabilites for LogisticRegression(
          C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty=l2, tol=0.0001)
 3 features
 target: Insult
auc
0.8679 (+/- 0.0101) [0.8533,0.8855]

...

Configuration
 model: Probabilites for GaussianNB()
 3 features
 target: Insult
auc
0.6055 (+/- 0.0171) [0.5627,0.6265]

...
The Logistic Regression model has of course dominated Naive Bayes. The best feature sets are the 100-vector LSI and the 1000-word character 5-grams. Once a feature is computed, it does not need to be computed again in separate contexts. The binary feature selection is an exception to this though: because it uses target "y" values to select features, Ramp needs to recreate it for each cross validation fold using only the given training values (You can also cheat and tell it not to do this, training it just once against the entire data set.)

Predictions

We can also create a quick utility that processes a given comment and spit out it's probability of being an insult:
def probability_of_insult(config, ctx, txt):
    # create a unique index for this text
    idx = int(md5(txt).hexdigest()[:10], 16)

    # add the new comment to our DataFrame
    d = DataFrame(
            {'Comment':[txt]}, 
            index=pandas.Index([idx]))
    ctx.data = ctx.data.append(d)

    # Specify which instances to predict with predict_index
    # and make the prediction
    pred, predict_x, predict_y = models.predict(
            config, 
            ctx,
            predict_index=pandas.Index([idx]))

    return pred[idx]
And we can run it on some sample text:
probability_of_insult(
        logreg_lsi_100_config, 
        context, 
        "ur an idiot")

> .8483555

probability_of_insult(
        logreg_lsi_100_config, 
        context, 
        "ur great")

> .099361
Ramp will need to create the model for the full training data set the first time you make a prediction, but will then cache and store it, allowing you to quickly classify subsequent text.

And more

There's more machine learning goodness to be had with Ramp. Full documentation is coming soon, but for now you can take a look at the code on github: https://github.com/kvh/ramp.  Email me or submit an issue if you have any bugs/suggestions/comments.

Friday, February 17, 2012

What's the best machine learning algorithm?

Correct answer: “it depends”.

Next best answer: Random Forests.

A Random Forest is a machine learning procedure that trains and aggregates a large number of individual decision trees. It works for any generic classification or regression problem; is robust to different variable input types, missing data, and outliers; has been shown to perform extremely well across large classes of data; and scales reasonably well computationally (it’s also map-reducible). Perhaps best of all, it requires little tuning to get good results. Robustness and ease-of-use are not often appreciated as they should be in machine learning (not to the extent buzzwordy names are, anyways), and it's hard to beat tree ensembles, and Random Forests in particular, on these dimensions.

Random forests work by generating (typically hundreds) of decision trees in a specific random way such that each is de-correlated with the others. Since each decision tree is a low-bias, high-variance estimator, and each is relatively uncorrelated with the others, when we aggregate their predictions we get a final prediction with low bias AND low variance. Magic. The trick is in getting trees trained on the same dataset to be uncorrelated. This is accomplished by using randomly sampled subsets of features for evaluation at each node in each tree and a randomly sampled subset (bootstrap) of data points to train each tree.

Put simply, if you have a machine learning problem and you don’t know what to use, you should use random forests. Here, in table form (courtesy of Hastie, Tibshirani and Friedman), is why:




Random forests inherit most of the good attributes of "Trees" in the above chart, but in addition also have state-of-the-art predictive power. Their main drawbacks are a lack of good interpretability, something that most other highly predictive algorithms do even worse on; and computational performance -- if you need something for real-time production, it could be hard to justify using random forests and spending the time to evaluate hundreds or thousands of trees.

If you are interested in playing around, grab the R package.

I recently heard the president of Kaggle, Jeremy Howard, mention that Random Forests seem to show up in a disproportionate number of winning entries in their data mining competitions. Cross-validation, I call that.


More reading:
A Comparison of Decision Tree Ensemble Creation Techniques
An Empirical Comparison of Supervised Learning Algorithms

Wednesday, February 15, 2012

Recurrent -- A python library for natural language parsing of recurring events

For a project I'm working on I needed the ability to turn a natural language phrase like "every other saturday starting next month" into iCalendar-standard RRULEs. I couldn't find a python library that implemented this, so I built it. Check it out on github.


Here are some example input phrases and output recurrence rules:


  • 'on weekdays' => 'RRULE:BYDAY=MO,TU,WE,TH,FR;INTERVAL=1;FREQ=WEEKLY'
  • 'daily starting march 3rd until april 5th' => 'DTSTART:20120303\nRRULE:FREQ=DAILY;INTERVAL=1;UNTIL=20120405'
  • 'the first and third friday of every month' => 'RRULE:BYDAY=1FR,3FR;INTERVAL=1;FREQ=MONTHLY'
  • 'once a year on the fourth thursday in november' => 'RRULE:BYMONTH=11;BYDAY=4TH;INTERVAL=1;FREQ=YEARLY'
It's an alpha release currently, so please submit any issues you find.

Thursday, January 5, 2012

Basic Income Guarantee

Job losses from this recession are staggering. calculated risk has a nice visualization of the magnitude of man hours lost

These losses have not been spread evenly of course -- the less educated have taken the brunt

Even under optimistic scenarios (which we are not currently experiencing), it will take 5 to 10 years to get these jobs back. If they ever come back.

This means a lot of people without work and income. The question then is: what should we do about this? Should we do anything?

One option is to give everyone a basic income. The basic income guarantee is an "unconditional, government-insured guarantee that all citizens will have enough income to meet their basic needs." It differs specifically from a negative income tax, like our current social programs, in that every individual (not household) receives the benefit, regardless of wealth or income, unconditionally.

Why would this work? There are a bunch of reasons, which I'll get into, but more pressing is the one big reason why people think it won't work: people will lose the motivation to work. I think this belief represents a misunderstanding of human motivation. People work harder -- go to college, put in the hours -- because of innate desires for status and social approval, not out of terror of meeting their basic needs. In fact, under pressure of homelessness or starvation, people will often be forced to make sub-optimal long-term decisions, like foregoing investment (schooling) in themselves or their children.


Anyways, here are the pros and cons as I see them, and my rebuttals.

Cons

Reduced incentive to work
I addressed this above, but there is hard evidence on this. Manitoba, Canada, actually tried a basic income program for 5 years, known as Mincome. The results:
On the whole, the research results were encouraging to those who favour a GAI [Guaranteed Annual Income]. The reduction in work effort was modest: about one per cent for men, three per cent for wives, and five per cent for unmarried women. These are small effects in absolute terms.
Smaller and shorter-term studies in the US have found slightly larger effects, around 5% total drop in employment. I think this is a real draw back. It's easy to see how demand for certain low-paying jobs would decrease, raising prices and reducing output for those goods below an optimal level. The good part about basic income is that its distortion of incentives is very simple and fairly predictable, unlike most federal social programs that target specific actions or demographics and create myriad twisted incentives and unintended consequences.

Redistribution
You may or may not see this as a con, but in the end a basic income is a redistribution of wealth from the rich to the poor. (Yes the "rich" get the income too but they are paying for their own in taxes as well as the poor's.) You can make a moral argument against redistribution, but I think it is a weak one. Luck plays an enormous role in any one person's outcome. To the extent we can smooth out luck, I think we should. A basic income guarantee seems like a reasonable step in the right direction.

Inflation
Some people argue a basic income would simply raise prices. Again, this money is not just printed, it comes from a (presumably progressive) tax, so any affect on the money supply would come from marginal consumption differences at high and low income levels, which do exist empirically, but the effect would be small and predictable (i.e. preventable).

Political hazard
It would be hard politically to ever reduce the basic income, and very politically profitable to increase it. This would quickly result in another runaway social program (we have enough of those already). I think this is very easily solved by tying the basic income amount to hard economic and government indicators (inflation, gdp, debt ratio) in a way that guarantees both the solvency of the state and utility of the income. The rate setting could never be something that was easily changed by politicians.


Pros

Simplification
I am a HUGE proponent of simplified (not smaller) government. Currently we have

  • medicaid
  • medicare
  • CHIP
  • social security
  • foodstamps
  • unemployment benefits
  • Section 8 housing
  • TANF
  • and more...
All of these could be replaced by a dead simple basic income program: cut everyone over 18 with a social security number a check. The opportunities for the government to screw it up are minimal; the bureaucracy would be tiny.

Increased education
Research on the Mincome program found significant increases in education.

Reduced poverty, homelessness
This is of course the big benefit. The negative externalities of poverty are large (cite, cite). Reducing it benefits everyone. Even if you are rich, have no feelings of empathy for the poor, and only want the best for yourself (in short you are an asshole), I still think it is in your best interest, economically and politically, to keep poverty low.



I was prompted to write about this by my growing belief that human labor, as a factor of production, will eventually and inevitably be priced out of most of the economy. Machines have been and will continue to produce more and more of our goods without our input. This is an awesome thing, and it doesn't mean humans won't find other productive work to fill their time, it just means as a proportion of our total output, our contribution will be continually less significant. This means a lot of surplus value, value that could be productively put towards a basic income instead of, say, capitalists' (bulging) pockets.

Now I sound like a Marxist.

Sunday, November 20, 2011

Introducing Scientifiqa, the science-based Q&A site

I just launched Scientifiqa, yet another general Q&A site (and stackexchange clone). What makes this one different? Every answer must be a citation (and summary) of a peer-reviewed academic article or survey. The hope is that this requirement will ensure high quality discussions and give people an easy and quick way to come up to speed on current scientific understanding in a particular area.

Some questions already posted:


Or ask your own question.

Please share any suggestions or feedback!

Monday, October 3, 2011

A better arXiv

Imagine if arXiv was a fully functional online community.

Here’s one vision:


Member profiles 
The community would be open to anyone (possibly contingent upon approval by existing members.) Profiles would hook into Linkedin, Mendeley, Researchgate, etc to pull in your education, institutions and publications.

Reputation
Every member of the community would have a (private) reputation score based on their “impact” in their field. All actions on the site would be weighted by a member’s reputation, it’s calculation based on:
  • published papers (traditional impact scores)
  • education, employment and institutions (verified by community)
  • actions within the arXiv community (peer reviews, upvoted comments, well-rated submissions)
This would have to be handled openly and with care, but would help keep quality high and entice well-respected academics to engage in the forum.
Peer review
Members could write and request (public or private) reviews of submissions. This would include “rating” the submission across various dimensions. Possible dimensions:
  • methodology
  • novelty
  • expected impact
Comments
There would also be an outlet for less formal discussion, with upvoting and threading.
Article discovery
There would be support for sorting and filtering submissions by ratings and reviews. There could be a published “journal” of best articles every month, possibly selected by weighted vote, if people felt the need for a formal publication.
In addition there would be a personalized feed of submissions, based on topic filters and a recommender system.
Open access
Most importantly, ANYONE could read the output of this community. Even the taxpayer/college student who funded it all...
In addition to these community features, there could be added capabilities for submitting papers. This could be simple things like support for attaching code, data or arbitrary media; or a more significant overhaul of a “paper”, turning it into what is essentially a web page, with inline links for references, expandable sections, embedded media, etc. Unfortunately, there is no standardized typesetting solution for the web yet, so some work might be required to make this happen.

If this got built and people adopted it, I think it could deliver a swift, fatal blow to the academic publishing industry -- something it desperately needs.

We built science.io with this end vision in mind. If you are involved with arXiv or are interested in making this or something like it happen, I’d love to chat. Get in touch kvh@science.io