Fermi Paradox for Startups

Fermi paradox for business ideas -- for when you come up with a brilliant business idea and are wondering how no one on the planet has ever thought of it before.

Why doesn’t this business/product exist yet?

ReasonConstraintExamplesRelative likelihood
No one has thought of the idea beforeCreativity Nuclear fission in 1910 ~0% (Only happens once per idea!)
People have thought of the idea, but no one with the resources/ambition/skills required to attempt itInitiative Nuclear fission in 1936 5%
People have attempted it but no one has succeeded in building a working solutionTechnology Heavier-than-air flight in 1890 5%
People have succeeded in building a solution, but no one wants itMarket - Existence Segway in 2001 10%
People have built a solution that customers will pay for but there is no effective distribution channelDistribution Ride sharing in 2000 5%
People have built a profitable solution, the market where it works is just smallMarket - Scale Segway in 2017 25%
You just aren’t paying attentionYour own ignorance Every idea you haven’t heard of 50% 

Successful startups typically come out of “Initiative” stage, by bringing hustle and cash to new technology, or the “Distribution” stage, by bringing hustle and cash to new distribution channels. The catch is that they only account for 10% of ideas you think of!  

How the World Works

The first 16 years of my education focused on memorizing facts: passphrases that I handed back to my teacher at exam time (with varying success). After I left the formal education system, I learned about the world through articles and front-page news: stories – interesting stories – that usually had little connection with reality.

Which is to say I went the first 20 or 30 years of my life not learning how the world actually works. To understand how the world actually works requires building a model of the fundamental forces and causes that drive events and behaviors. It requires zooming out of time and place to see the bigger patterns. Memorizing for exams doesn’t incentivize that kind of learning, but it’s the most valuable kind.

The books below helped give me the right frameworks for interpreting and predicting the world. After I had read them I felt like I had a superpower, like the hood had been lifted off the world and I could see it's moving parts plainly for the first time. Such is the power of the condensed knowledge of hundreds of years of thought and research contained in books. It’s like cheating, if you can call 8,000 plus pages of dense reading cheating.

The best books are comparative history books. They ask: what are the big events and unknowns that need to be explained and what is the best explanation that fits the cases we can find? You won't find any theory-first books on the list (e.g. of the Gladwellian type). They suffer from a selection problem: it’s too easy to find history (or science) that supports a given theory, so we end up with the most seductive (or confirming) theories being the most popular and widely accepted. Without starting with the history, the whole picture, it's hard to know what theories to believe.
This list isn’t the only such list you could create – there are great books I’ve missed (please share your suggestions) and even greater books that have yet to be written. But it's a start.
  1. History
  2. People
  3. Organizations


History

The first four books are “big history”, covering the whole sweep of humanity and its creations. These books give a foundation for interpreting the modern world. Reading history is like having the magical ability to live parallel lives and explore parallel worlds.

1. Sapiens: A Brief History of Humankind

Sapiens is a big history of the human species, from our evolution in Africa millions of years ago, through the cultural, agricultural, and scientific revolutions. It’s an unvarnished look at the human condition on the grandest scale. It introduces us to a recurring theme: the world is fundamentally evolutionary (“Darwinian”) – those organisms, empires, cultures, organizations, and ideas that survive and/or “reproduce” successfully are the ones that will tend to exist in the world. As in: religion spread not because of its righteousness or the will of a higher power, but because groups of humans that happened to adopt (or be susceptible to) religion cooperated better than those that did not, and were therefore more successful at reproducing and expanding.

Further readingNon-zeroGuns, Germs, and Steel

2. The Origins of Political Order

Origins of Political Order is another big history book, covering political development and the formation of states from the earliest civilizations up to the industrial revolution. Fukuyama’s sweeping knowledge of comparative history makes for a highly credible journey through the basic realities of state building and what it means to have a functioning government, and how you arrive there. Few history books achieve so much in so few pages.

3. The Worldly Philosophers

Heilbroner gives us a tour of the history of economic thought, through the lives of the great minds that developed it. Along the way we learn the fundamentals of economics and its historical development, as well as about economic philosophy’s (large, occasionally catastrophic) role in shaping our world, all while giving perspective on modern economic orthodoxy. Immeasurably better than the remedial calculus course you were given as economics 101.

4. The Birth of Plenty

For most of human civilization economics was boring: almost all production was agricultural, growth was driven by a Malthusian population cycle, and entrepreneurship and innovation were limited. The industrial revolution and rise of modern liberal capitalism changed that and transformed the world in the process. Today, direct economic forces drive nearly everything in our daily lives. This is why we are reading a second book on economics.
The Birth of Plenty gives a history of modern economic development, describing the conditions under which economic growth and prosperity occur.



People

We all have an intuitive understanding of ourselves, the people in our lives, and the social interactions that are so important to our world. Nonetheless, we carry blindspots (often by design) to some of the basic aspects of our behavior. The three books in this section illuminate the more hidden elements of people and their behaviors, as well as providing a foundation for thinking about the source and motivation of behaviors in general.

5. Sociobiology

Sociobiology provides the foundation for understanding animal, and thus human, behavior as social organisms, covering the biological and evolutionary origins of altruism, cooperation, aggression, sex, and everything in between.

Further readingSelfish GeneOn Human Nature

6. The Social Animal

Psychology is plagued by bad science and popular writing misinterpreting it. There are a lot of over-reaching extrapolations and over-confident generalizations of narrow, contrived experiments on college students. This isn’t researchers’ fault, the science is inherently hard: you can only manipulate one variable in an experiment, but social and psychological contexts involve dozens of complex dimensions, many of which can be important to the outcome. To make it worse, often both context and outcomes are not directly observable, but can only be measured in proxy. Bottom line: it is hard to experimentally verify human behavior.
With that giant caveat, I present The Social Animal, one of the best of the lot. It is a compendium of social psychology research, covering our behaviors under social conditions, including all our quirks and dysfunctions.

Further readingInfluenceThinking Fast and Slow

7. The World’s Religions

For most of human civilization, religion was (and still is, to various extents) the foundation of society, culture, rule of law, and even economic production. The World’s Religions is a dive into each of the world’s seven main religions – Hinduism, Buddhism, Confucianism, Taoism, Islam, Judaism, and Christianity – plus consideration of primal religions. Smith is an engaging, if not particularly critical, narrator.



Organizations

Organizations are the foundation for the basic functions of society – businesses, municipalities, civic groups, churches, etc. They are the day-to-day institutions that operate the world. We’ll study the two most important here: Political and business organizations.

8. Titan: The Life of John D. Rockefeller

Titan is the story of the life of John D. Rockefeller and the rise of Standard Oil. It’s also the story of entrepreneurship, industry, and business – and the fundamental forces at work in capitalism. [Sidenote: I’m looking for a better business book recommendation, since the list includes another, more important, book on oil. It would be nice to have some diversity of industry!]

9. The Power Broker

This biography of New York public servant Robert Moses is a case study in power and politics – how an individual can accumulate power and resources politically, and the true dynamics of political systems. One of the best biographies ever written.

Further readingThe Dictator’s Handbook

10. The Prize: The Epic Quest for Oil, Money, and Power

A recounting of the 20th century through the lens of oil, which is a very illuminating lens. It shows us the interaction of state and business, and how the dynamics of power and money play out on a grand scale.

Intelligence is not magic

I just finished reading Nick Bostrom’s Superintelligence. I found the book to be impressive, thought provoking, and total nonsense.

If you haven’t read it, the crux of it is this: Imagine an agent with unlimited superpowers (specifically and especially the power for unlimited self-improvement) determined to achieve some goal. Boy, wouldn’t that be bad?

It would. An agent with unlimited god-like superpowers would be hard to stop, by definition. We could have ended the book right there, and saved each other 400 pages of trouble.

The book reminds me of the Less Wrong crowd: take a seductive but fundamentally flawed premise — in the LW case: that humans have a stable, knowable utility function; or here in the AI case: that super-intelligence implies omniscience and omnipotence — and then build up an impressive edifice of logic and analysis, adorned with great quantities of invented pseudo-technical jargon, and proceed to evangelize from the top.

The problem is superintelligence does not imply omniscience or omnipotence. Let me try to convince you why.

Intelligence is not fitness
First, it might be helpful to define intelligence (something Bostrom never really does!). Importantly, it’s easy to conflate intelligence with fitness for a particular environment. Intelligence is a component of fitness, but an arbitrarily useful or useless one, depending on the circumstances. For example, if you put me in a cage with a drooling, starved lion, no amount of intelligence will save me — there is the fundamental constraint that I cannot physically alter the "hungry -> hunt prey -> eat prey” part of the lion’s brain with my available resources. I am not fit for this particular environment, even if I might be intelligent. This is why, despite our superior intelligence, ants and bacteria are wildly more successful species (by almost any possible metric: total biomass, impact on the planet, longevity, probability of existence X years in future, comparative ability to annihilate the other respective species, total energy consumption. Even total data exchanged!). Ants and bacteria are fit for earth.

Formally, the fitness of an agent is its inherent ability to achieve a broad range of goals and sub-goals in a given environment (typically related to a super-goal of survival, reproduction, or “winning” some game). We can think of an agent as composed of a set of three capabilities: 1) input sensors, 2) decision making system, and 3) available actions. This agent exists in an environment and is created with an initial state.



Intelligence, then, is the quality of the decision making component: for a given environment and starting state, and a given set of input data, how close to optimal is the agent in choosing a sequence of available actions to maximize the probability of obtaining its goal. So besides intelligence, an agent's fitness is determined by its input sensors, available actions (both of which are related to its physical form), and its starting state in the environment, like starting in a cage with a hungry lion, for instance. The available actions are a function of the state of the environment and the state of the agent, but may be fundamentally/physically constrained.

I’m not able to beat the lion because there is no plausible action sequence that allows me to gain the strength or jaw power to overcome the lion, or to alter the lion’s deeply-ingrained hunting instinct. I am fundamentally limited by my available set of actions (my physical form), and my starting state, so no amount of intelligence can help me.

Similarly if I play a game of poker against the world’s best poker player, someone far more optimal at choosing betting actions than me, i.e. more intelligent in this domain, I would do quite well if I had a hidden video feed to my opponents cards. The expert player would have no chance, despite greater intelligence in the given environment, since I had access to additional input data that they did not. Intelligence, again, would be of no help.

Now, in the real world, an intelligent agent can, of course, take a sequence of actions that augments their physical capabilities such that they acquire additional input data and/or additional candidate actions. But the exact capabilities they can acquire, given their starting state and physical form, is an open question. It’s not a given that they can eventually acquire all capabilities necessary. The agent may be fundamentally constrained in a way it can’t possibly overcome. Intelligence is not omnipotence.

Take, for example, a common trope of runaway AI scenarios where the AI manipulates an unwitting human into doing its bidding. Critical to the scenario is supposing that there actually exists some sequence of words and actions that produce the desired action in the human. But its possible, maybe even likely, that no such sequence exists. Just like you can’t get your puppy to take a shit where you want him to, despite your supposedly far superior intelligence, you can’t necessarily make a human unlock the network firewall just because you’ve outsmarted them. Theoretically, it is quite possible that there does not exist any sequence of visual or audio signals sent to a human brain that produce a particular desired action — no matter how perfect your model of that brain and it’s responses to stimuli or how many infinite scenarios you can play out on that brain. If my dog wants to shit on my sofa, and all I have to stop him is a microphone to project my voice into the room, there might not be any sequence of audio signals I can produce that will stop him. It’s quite likely actually. My physical instantiation in this scenario prevents me from taking the required actions (physically preventing the dog from getting on the sofa; offering him Kibbles treats to please, pretty please, shit somewhere else; etc). It’s hopeless for me. Go ahead, Ruffo, ruin the couch. Good boy.

Here is where Bostrom would pull out his magic superintelligence wand and have the AI discover a new theory of sound waves that allows it to use the microphone to map and alter the neurons in the dog’s brain. Good luck with that!

You can’t always outsmart every one
Ever played rock-paper-scissors against a 4 year old? Well, I just played my niece the other day, and she’s bad at it. Like, can’t even make a scissors with her hand bad. Really though, rock-paper-scissors is pretty simple. By the end my niece mostly got it, her main problem was a faulty random number generator, and insufficient finger dexterity. No amount of superintelligence is going to really help you do better: the optimal strategy is to predict a random number generator, which is impossible. But how much of the real world is like a rock-paper-scissors game? How often does the optimal strategy depend on responding in a relatively straightforward, simple way to what are essentially random events? Well, perhaps a lot.

The world is enormously complex and inherently chaotic, in a way that is likely permanently intractable. Predicting weather patterns or even any given human’s response to a specific stimuli, depends on an enormous number of variables, and complex, interdependent interactions that there may be no good model for, and which are highly sensitive to small changes. The bottom line, theoretically, is that the computational power of the universe will always so vastly outstrip any of our attempts to model it, that predicting the future, and the world’s reactions to our actions, will often be hopeless, no matter how much intelligence you have. Intelligence is not omniscience.

Take for example the Trump candidacy. Could a superintelligence have predicted his eventual election to the highest office in the world? An event of significant importance if you are trying to plan for the future. Could it have beaten the prediction market’s estimation of 80% favoring Hillary in May? Probably, yes. It’s likely the world was fundamentally underestimating the chances of a Trump presidency, and a superintelligence -- like say FiveThirtyEight -- could have done better. But there is no chance that it could have made the prediction with any certainty. The best it could have said is something like 70% favoring Hillary instead of 80%, given the facts at the time. Trump’s ultimate election likely came down to some quite random, small events — Russians hacking servers, James Comey making a personal decision a few days before the election, Hillary just generally bungling it — things that would be extremely hard (impossible?) to predict back in May. It is easy to imagine many scenarios where the ball fell the other way, and we ended up with madam president Clinton.

Knowledge is not free
The world is chaotic, but we can still learn simplified models of it that are usefully accurate. Learning these models, though, is expensive, for a few reasons. Primarily, because you have to actually physically observe and interact with the world to understand it: most knowledge is not deducible from first principles. Knowledge — the accumulation and modeling of input data by the decision making system — is what makes a superintelligence useful. It allows it to make accurate predictions about the future state of its environment and the consequences of its possible actions. But you can’t make these predictions without a lot of observation and interaction with that environment. You can’t just use your super brain to start deducing how everything in the world works. Our world is a specific example of a possible world, but one of an infinite many such possible worlds (presumably). Science, math, physics -- all of human knowledge — is the process of discovering which of those logically and mathematically consistent (we presume) worlds we actually live in. Even the simplest, most fundamental laws of the universe have only revealed themselves after careful and laborious inspection and experimentation. Only close observation of manipulations of our environment allow us to gather knowledge about it and how it works. A superintelligence can draw inferences faster and more accurately given the same data (this is basically the definition of superintelligence), but only within the limits of the given data. Einstein’s theory of relativity would have been wrong in 1850. Well, not so much wrong, but no more right than any other of the infinite possible explanations for observed phenomena, and indeed not the simplest or most likely explanation. It was only the unexplainable observations of the Michelson-Morley experiments that required relativity.

The point is that learning how the world works takes great time and resources, and you can’t just spin up a giant brain in a vacuum and get knowledge for free. Even ingesting all of the observational data available that humans have accumulated to date (books, the internet) will limit you likely to not much more than humans already know. To develop a better model of how humans socialize will require, well, socializing with a lot of humans. Maybe it could be inferred from, say, watching all of the available video content on Youtube, but it’s also likely that doing that will give you a very warped perception of the world.

Which brings us to the bottom line: An agent can’t know if its model of the world is correct without explicit testing, either through sufficiently accurate (and therefore expensive) simulation or through actual interaction with the world. An agent could build an extremely advanced and sophisticated model of the world based on watching Youtube videos, but it won’t know in what ways it’s model is weirdly broken until it tests it in the real world (like, whoa, not everyone is an incessant nattering ass 😀).

Further, humans can, and have, created systems with immunity to intelligence and manipulation. This is why it’s still mostly a “who you know” world — we trust and like the people we spend time with and develop a close connection with. Importantly, we require the other person to make sacrifices of their time and resources before we trust them and do business with them. It is a mechanism that is impervious to intelligence and gaming. And it’s how the world still works, for the most part, for better or worse.

To be clear, none of this is to deny that machine intelligence will eventually surpass and render obsolete human intelligence, or to deny that the evolution of the technology we humans create will eventually displace us. But it will be a scenario that plays out over decades, centuries, or, most likely, millenia; not “hours or days”. It will require, just like the beginning of organic life, the fortuitous mix of a large number of elements in the right environment, and many, many false starts, before the machines supercede us. More likely is that by the time superintelligence happens, humans will have retreated to virtual worlds anyways. You’ve seen the movie.

The perfect smartphone

Let me introduce you to the 2014 BLU Advance 4.0:



By any objective measure, it's a sad, useless device. Sorry BLU, it's true. Retailing for just $70 in 2014, it was an off-brand, bare-bones “hold-over” phone for almost anyone who bought one at the time. The only consistent pro in the Amazon reviews is "price" and the most frequent con is "one of the worst purchases of my life". Now, three years later, an eon in smartphone time, I can’t imagine there is anyone on the entire planet who is still using their BLU Advance 4.0 as their day-to-day phone.

Anyone, that is, except me.

Here in late 2017, with gorgeous, refined, powerful phones released every few months, I use just one smart device: the BLU Advance 4.0. And it’s my favorite phone. Ever.

Why?

Well, with the BLU Advance 4.0 in your hands, it would not be immediately obvious. You’d spend the first few moments desperately tilting the phone back and forth, trying to find a viewing angle where you could actually see the screen. This would be futile. Once your eyes had adjusted to the dark, washed-out -- yet highly reflective -- screen you’d be able to make out the pin lock screen. Moving your thumb to enter the code with the swift, precise motions you’ve honed over thousands of hours of smartphone usage, you’d find none of your taps had registered. You’d then learn, over a frustrating few minutes, that the touchscreen requires enormous and precise pressure to register anything accurately. A few minutes later, your thumb aching, you’d have finally arrived at the home screen, perhaps then venturing to check your email, clicking through a link in one of your emails. As the BLU heroically summoned its resources to render the modern bloated web page it would crumple under the memory load and then force close the browser. Thus would end your (not atypical) session with the BLU Advance 4.0.

So no, it’s not because it is a great phone, per se, that I use it, but really the opposite: the BLU Advance 4.0 is the optimally useless smart device. Smart phones, like most technology, have given us something that “we could never live without”, even though just a few years earlier the world was humming along just splendid without them; and something that -- again like most technology -- has turned on its creator: controlling them; adding stresses, temptations, and tests of will that were unknown to humankind before its introduction. Our phones are constantly demanding our attention -- check your email, play that game, scroll your instagram, watch a youtube highlight, swipe your dignity away on tinder -- each surrender reinforcing the dopamine reward pathways. Do we really get any benefit from having all these things real-time, right in our pocket? How many of them could wait, or just disappear forever? Ever go on a weekend trip where you don't have reception? How blissful and wonderful is that time; how quickly do we forget smartphones ever mattered.

With my BLU Advance 4.0, I don’t have any of these stresses, temptations, micro-addictions, or existential questions: my smartphone has returned to the background; a utility, a tool that I choose to use when I need it -- when I really need it. With 600mb of storage after the OS and pre-installed bloatware, the phone only has room for 5-10 downloaded apps, which means only the essential utilities: Google Maps, messaging (Signal, Whatsapp, Messenger), Audible, Spotify, Podkicker, and Lyft. And if you do try to download Instagram, Youtube, or a functional modern browser, for instance, you will find the BLU’s meek hardware capable of rendering only a maddeningly slow and glitchy experience, slow and glitchy enough to break the dopamine cycle.

Since my Nexus 5x stopped working two months ago and I resurrected my BLU Advance 4.0 from the mass grave of smart devices in the bottom drawer of my Ikea desk at home, I’ve been more peaceful, relaxed and happy. Intermittent downtime that used to be dominated by smartphone usage I now find filled with reflection on my day, observation of my surroundings, and peaceful daydreaming.

You can try to go cold turkey and buy a flip phone instead, or leave your phone at home, but you will not last. Each of us now has “essential” tasks we need to perform on our phones; you won’t be able to escape them all if you plan on integrating into modern (business) life. The BLU Advance 4.0 gives you just enough to perform these essential functions, and absolutely nothing more.

Is it all roses? No. The BLU Advance 4.0 takes grainy smudges for pictures, for instance. But the best pictures are mementos, not masterpieces anyways. And, yes, you will occasionally curse your BLU Advance 4.0 to the grave for being the glitchy, unresponsive piece of shit that it is. BUT this only affords you the opportunity to reflect on the meaning of your life, the insignificance of the day-to-day, and the eternal virtue — and reward — of patience.

Switching to the BLU has been the best lifestyle change I’ve made this year. I’d tell you to go out and buy one right now, but the secret seems to be out already:

Comparison of Semantic Role Labeling (aka Shallow Semantic Parsing) Software

There are quite a few high-quality Semantic Role Labelers out there, I recently tried a few of them out and thought I'd share my experiences.

If you are unfamiliar with SRL, from wikipedia:
Semantic role labeling, sometimes also called shallow semantic parsing, is a task in natural language processing consisting of the detection of the semantic arguments associated with the predicate or verb of a sentence and their classification into their specific roles. For example, given a sentence like "Mary sold the book to John", the task would be to recognize the verb "to sell" as representing the predicate, "Mary" as representing the seller (agent), "the book" as representing the goods (theme), and "John" as representing the recipient. This is an important step towards making sense of the meaning of a sentence. A semantic representation of this sort is at a higher-level of abstraction than a syntax tree. For instance, the sentence "The book was sold by Mary to John" has a different syntactic form, but the same semantic roles.
SRL is generally the final step in an NLP pipeline consisting of tokenizer -> tagger -> syntactic parser -> SRL. The following tools implement various parts of this pipeline, typically using existing external libraries for the steps up to role labeling.

I've provided sample output where possible for the sentence: "Remind me to move my car friday." Ideally, an SRL should extract the two roles (remind and move) and their proper arguments (including the temporal "friday" argument).

Without further ado...

Labelers

In order from most recent release to oldest:

Authors: Anders Björkelund, Bernd Bohnet, Love Hafdell, and Pierre Nugues
Latest release: Nov 2012
Comes with nice server/web-interface. Has trained models for english, chinese and german. Newer version w graph-based parser, but does not provide trained models. Achieved some top scores at CoNLL 2009 shared task (SRL-only). You can try it out yourself here: http://barbar.cs.lth.se:8081/
~1.5gb RAM
Example output

Authors: Dipanjan Das, Andre Martins, Nathan Schneider, Desai Chen and Noah A. Smith at Carnegie Mellon University.
Latest release: May 2012
trained on FrameNet. Extracts nominal frames as well as verbal.
Resource intensive (~8gb RAM for me on 64bit).
Example output

SENNA

Authors: R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa
Latest release: August 2011
The only completely self-contained library on this list. Very fast and efficient c code. Non-commercial license.
~180mb RAM
Author: Mihai Surdeanu
Latest release: 2007
If you want to compile with gcc > 4.3 you need to add some explicit c headers (much easier than trying to install multiple gccs!). I put a patched version up on github if you're interested (I also made some OSX compatibility changes and hacked on a server mode): github.com/kvh/SwiRL
c++ code, uses AdaBoost and Charniak parser. Fast and efficient.
~150mb RAM
Example output

Shalmaneser

Authors: K. Erk and S. Pado
Latest release: 2007
You'll need to download and install TnT, TreeTagger, Collins parser and mallet to get this running. Uses actual framenet labels (including nominal targets),  and comes with pre-trained classifiers for FrameNet 1.3.
Low memory usage.

Curator

I couldn't get a local install of this working. The web demo works though, so you can give that a go. You can see all their software demos here: http://cogcomp.cs.illinois.edu/curator/demo/

LTH

Authors: Lund University
This work has been subsumed by Mate-tools.
Example output

Conclusions

The java libraries can get memory hungry, so if you are looking for something more lightweight, I would recommend either SwiRL or SENNA. In terms of labeling performance, direct comparisons between most of these libraries is hard due to their varied outputs and objectives. Most perform at or near state-of-the-art, so it's more about what fits your needs.

Let me know if I missed any!

Getting started with Ramp: Detecting insults

Ramp is a python library for rapid machine learning prototyping. It provides a simple, declarative syntax for exploring features, algorithms and transformations quickly and efficiently. At its core it's a pandas wrapper around various python machine learning and statistics libraries (scikit-learn, rpy2, etc.). Some features:
  • Fast caching and persistence of all intermediate and final calculations -- nothing is recomputed unnecessarily.
  • Advanced training and preparation logic. Ramp respects the current training set, even when using complex trained features and blended predictions, and also tracks the given preparation set (the x values used in feature preparation -- e.g. the mean and stdev used for feature normalization.)
  • A growing library of feature transformations, metrics and estimators. Ramp's simple API allows for easy extension.

Detecting insults

Let's try Ramp out on the Kaggle Detecting Insults in Social Commentary challenge. I recommend grabbing Ramp straight from the Github repo so you are up-to-date.

First, we load up the data using pandas. (You can download the data from the Kaggle site, you'll have to sign up.)
import pandas

training_data = pandas.read_csv('train.csv')

print training_data

pandas.core.frame.DataFrame
Int64Index: 3947 entries, 0 to 3946
Data columns:
Insult     3947  non-null values
Date       3229  non-null values
Comment    3947  non-null values
dtypes: int64(1), object(2)
We've got about 4000 comments along with the date they were posted and a boolean indicating whether or not the comment was classified as insulting. If you're curious, the insults in question range from the relatively civilized ("... you don't have a basic grasp on biology") to the mundane ("suck my d***, *sshole"), to the truly bottom-of-the-internet horrific (pass).

Anyways, let's set up a DataContext for Ramp. This involves providing a store (to save cached results to) and a pandas DataFrame with our actual data.
from ramp import *

context = DataContext(
              store='~/data/insults/ramp', 
              data=training_data)
We just provided a directory path for the store, so Ramp will use the default HDFPickleStore, which attempts to store objects (on disk) in the fast HDF5 format and falls back to pickling if that is not an option.

Next, we'll specify a base configuration for our analysis.
base_config = Configuration(
    target='Insult',
    metrics=[metrics.AUC()],
    )
Here we have specified the DataFrame column 'Insult' as the target for our classification and the AUC for our metric.

Model exploration

Now comes the fun part -- exploring features and algorithms. We create a ConfigFactory for this purpose, which takes our base config and provides an iterator over declared feature sets and estimators.
import sklearn
from ramp.estimators.sk import BinaryProbabilities

base_features = [
    Length('Comment'),  
    Log(Length('Comment') + 1)
]

factory = ConfigFactory(base_config,
    features=[
        # first feature set is basic attributes
        base_features,

        # second feature set adds word features
        base_features + [
            text.NgramCounts(
                text.Tokenizer('Comment'),
                mindocs=5,
                bool_=True)],

        # third feature set creates character 5-grams
        # and then selects the top 1000 most informative
        base_features + [
            trained.FeatureSelector(
                [text.NgramCounts(
                    text.CharGrams('Comment', chars=5),
                    bool_=True,
                    mindocs=30)
                ],
                selector=selectors.BinaryFeatureSelector(),
                n_keep=1000,
                target=F('Insult')),
            ],

        # the fourth feature set creates 100 latent vectors
        # from the character 5-grams
        base_features + [
            text.LSI(
                text.CharGrams('Comment', chars=5),
                mindocs=30,
                num_topics=100),
            ]
    ],

    # we'll try two estimators (and wrap them so
    # we get class probabilities as output):
    model=[
        BinaryProbabilities(
            sklearn.linear_model.LogisticRegression()),
        BinaryProbabilities(
            sklearn.naive_bayes.GaussianNB())
    ])
We've defined some base features along with four feature sets that seem promising.

 Now, let's run cross-validation and compare the results:
for config in factory:
    models.cv(config, context, folds=5, repeat=2, 
              print_results=True)
Here are a couple snippets of the output:
...

Configuration
 model: Probabilites for LogisticRegression(
          C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty=l2, tol=0.0001)
 3 features
 target: Insult
auc
0.8679 (+/- 0.0101) [0.8533,0.8855]

...

Configuration
 model: Probabilites for GaussianNB()
 3 features
 target: Insult
auc
0.6055 (+/- 0.0171) [0.5627,0.6265]

...
The Logistic Regression model has of course dominated Naive Bayes. The best feature sets are the 100-vector LSI and the 1000-word character 5-grams. Once a feature is computed, it does not need to be computed again in separate contexts. The binary feature selection is an exception to this though: because it uses target "y" values to select features, Ramp needs to recreate it for each cross validation fold using only the given training values (You can also cheat and tell it not to do this, training it just once against the entire data set.)

Predictions

We can also create a quick utility that processes a given comment and spit out it's probability of being an insult:
def probability_of_insult(config, ctx, txt):
    # create a unique index for this text
    idx = int(md5(txt).hexdigest()[:10], 16)

    # add the new comment to our DataFrame
    d = DataFrame(
            {'Comment':[txt]}, 
            index=pandas.Index([idx]))
    ctx.data = ctx.data.append(d)

    # Specify which instances to predict with predict_index
    # and make the prediction
    pred, predict_x, predict_y = models.predict(
            config, 
            ctx,
            predict_index=pandas.Index([idx]))

    return pred[idx]
And we can run it on some sample text:
probability_of_insult(
        logreg_lsi_100_config, 
        context, 
        "ur an idiot")

> .8483555

probability_of_insult(
        logreg_lsi_100_config, 
        context, 
        "ur great")

> .099361
Ramp will need to create the model for the full training data set the first time you make a prediction, but will then cache and store it, allowing you to quickly classify subsequent text.

And more

There's more machine learning goodness to be had with Ramp. Full documentation is coming soon, but for now you can take a look at the code on github: https://github.com/kvh/ramp.  Email me or submit an issue if you have any bugs/suggestions/comments.

What's the best machine learning algorithm?

Correct answer: “it depends”.

Next best answer: Random Forests.

A Random Forest is a machine learning procedure that trains and aggregates a large number of individual decision trees. It works for any generic classification or regression problem; is robust to different variable input types, missing data, and outliers; has been shown to perform extremely well across large classes of data; and scales reasonably well computationally (it’s also map-reducible). Perhaps best of all, it requires little tuning to get good results. Robustness and ease-of-use are not often appreciated as they should be in machine learning (not to the extent buzzwordy names are, anyways), and it's hard to beat tree ensembles, and Random Forests in particular, on these dimensions.

Random forests work by generating (typically hundreds) of decision trees in a specific random way such that each is de-correlated with the others. Since each decision tree is a low-bias, high-variance estimator, and each is relatively uncorrelated with the others, when we aggregate their predictions we get a final prediction with low bias AND low variance. Magic. The trick is in getting trees trained on the same dataset to be uncorrelated. This is accomplished by using randomly sampled subsets of features for evaluation at each node in each tree and a randomly sampled subset (bootstrap) of data points to train each tree.

Put simply, if you have a machine learning problem and you don’t know what to use, you should use random forests. Here, in table form (courtesy of Hastie, Tibshirani and Friedman), is why:




Random forests inherit most of the good attributes of "Trees" in the above chart, but in addition also have state-of-the-art predictive power. Their main drawbacks are a lack of good interpretability, something that most other highly predictive algorithms do even worse on; and computational performance -- if you need something for real-time production, it could be hard to justify using random forests and spending the time to evaluate hundreds or thousands of trees.

If you are interested in playing around, grab the R package.

I recently heard the president of Kaggle, Jeremy Howard, mention that Random Forests seem to show up in a disproportionate number of winning entries in their data mining competitions. Cross-validation, I call that.


More reading:
A Comparison of Decision Tree Ensemble Creation Techniques
An Empirical Comparison of Supervised Learning Algorithms