Issue 30 Spring 2010

A beginner's guide to data mining:

What it's for, and what you need

By Kevin MacDonell

A recent special report in The Economist deals with the issue of managing data.  Information and data, once scarce, are now abundant in every field. We' re drowning in it! And I would say that advancement professionals are in the thick of it. Prospect researchers and their colleagues contribute to the growing mountain of data at their institutions simply by managing their own processes: i.e. coding major gift prospects through a pipeline. Records, donor relations, alumni affairs, events management and other functions add to their own piles through the day-to-day activity of keeping addresses up to date, tracking responses from solicitation programs, processing gifts, and even noting the fact that a certain gala attendee is allergic to shellfish. While our institutions may be able to store all of this information in a central database, individually we may be working the data quite intensively. I think of our own Annual Giving manager, and the complex phonathon and mailing segments she creates. And yet, somehow all this information fails to answer some of our most pressing questions.

Past donor history is often the most important criteria used in forming calling groups for the current phonathon season. (Did they give last year? Better call them again. ) Typically, though, the majority of an institution's alumni have never given. Trying to determine which segment of that vast pool of non-donors you should focus your resources on in order to convert them to donors is a challenge. Obviously you're not going to find the answer in giving history data. The above-mentioned example applies specifically to alumni databases, but the next unknown applies to any non-profit.

Which past donors are most likely to be elevated to higher levels of giving? Your hospital foundation's donor database might contain thousands of donors with low levels of lifetime giving. Certainly some of those donors are ready to be asked to give a lot more, but which ones? No doubt, as well, there  are donors at higher levels who are ready for a personal call from a development officer. There are major gifts out there, but looking at giving history alone, one donor looks just the same as another at their own level. What about planned giving donors, who tend to fly under the radar already? (Hmm, that mass mailing of planned giving packages didn't generate the response you'd hoped.) If there's an area of giving that requires the personal touch, it's planned giving. But you've got limited staff, and a potential prospect pool of thousands. Sure, you can narrow the field to individuals of a certain age who exhibit the rule-of-thumb characteristics (many small gifts over an extended period of time). But these pools are still going to be too large. And if that rule of thumb isn't reliable, you're focusing on the wrong people. Worse, you're probably approaching them too late in life!

Data mining and predictive modeling can help. Often we use these terms interchangeably, but there is  a difference. I associate data mining with the kind of work that an annual giving manager might do to segment the database based on characteristics and assumptions around giving likelihood and channel preference: The results are something the user expects to get already. Predictive analytics, on the other hand, extracts information from the database that the user did not know existed: Relationships between variables and donor behavior which are often non-intuitive. Specifically, a predictive model is a statistical tool that produces a numerical score for every individual in your database; the score being an indicator of how likely it is that an individual will engage in a particular desired behaviour (eg., donating to our cause), based on knowledge we could not have surmised on our own.

The results we get from a predictive model are not definite answers, but probable answers. It's not about picking individual winners, but about dividing whole populations into segments that have varying degrees of likelihood that they will engage in the desired behaviour. Typical statements I could make after creating a model and scoring everyone in the database would be:

"This 10 % of non-donors are most likely to give via the calling program this year"

or

"Here is a list of alumni in the Toronto area in the highest percentile likelihood of being interested in Planned Giving."

Statements such as these should be familiar to anyone who's ever heard Environment Canada report that there is a 60% chance of rain for the weekend. Probable knowledge is imperfect, but it's far preferable to an absence of knowledge.

The list of things you need to get started in predictive modeling is simple to state, yet rather difficult or time-consuming to obtain or master. The three things you need are:

1. A data file

2. Statistics software

3. Something to predict

I'm not being facetious about item number 3. Your choice of desired outcome (annual giving likelihood, planned giving likelihood, or just any giving in general) will determine the type of model you build and in some cases the statistical tools you use to build it with.

The primary hurdles are accessing and organizing the data, and learning how to work with the data using the appropriate software. It's these two steps, not the modeling itself, which are the real barriers to progress for people like you and me who are not data analysts first and foremost. The good news is that it doesn't matter where your data comes from. StFX University uses Banner, but the data could just as easily come from Raisers Edge, or even an Access database. All you need to be able to do is extract the data in a useable form. For me, that means forming Access queries to extract the data I want from Banner, which took me many months to become proficient at!

If that does not appeal, you can work closely with someone in IT, or someone in your department who is already an old hand at pulling data. What should your data "look like"? If your data were to be displayed as a table (or an Excel spreadsheet), there would be one row for each individual in your database, with no duplicate rows. When I build a model, my data file has one row for every single living alumna/us in our database - that's 29,000 records. That's a lot for a spreadsheet program to handle, but not for stats software as long as your computer is reasonably perky.

For columns, you will need:

1. A column for a unique ID, which matches up with the IDs in your database. Ideally, once you've produced predictive scores for everyone, you will upload your scores to your database (or arrange to have this done). Once they're in the database they can be used for the purpose they were created. For example, your Phonathon manager can build his or her calling groups with them.

2. A column for "Total lifetime giving" , if you are building a propensity-to-give model. If you want to build a model specifically for Phonathon, then this column should be limited to lifetime giving to the Phonathon campaign. But if this is your first model, stick with Lifetime Giving, as that will give you the most data to work with, and it's the most straightforward way to proceed. The scores that result will be powerfully predictive not only for Annual Giving, but Planned Giving and major giving as well.

3. Columns for other data, which you will explore for relationships with Lifetime Giving. This can be almost anything, but some of the most useful and common predictors used include Class Year (or age), marital status, home phone number, business phone number, and employment (position or employer).

These few elements of data are more than enough to get started. If those are not available, try looking for event attendance data, survey data, contact preference codes - really, anything you can get your hands on.

Four things to note about your predictor data:

1. You probably don't have a phone number for everyone in your database. Probably EVERY database has tons of missing data. This is not a problem at all, because the very presence or absence of a phone number (or 'position' or 'marital status') will be predictive in and of itself.

2. For the data you do have, it's not terribly important that the data be valid. A phone number may be out of date, but the very fact that you have it at all may be predictive of giving.

3. If you are trying to predict giving potential, then your predictor data must NOT include data that exists solely because the prospect is a donor. This can be a rather subtle point.

For example, a lot of the business phone numbers in your database might be the result of a giving transaction. That's OK, because probably a lot of them aren't. An example of a predictor you should avoid would be "received a donor recognition award" or "attended the donor recognition gala". Clearly these are just stand-ins for "Giving", and using them as predictors isn't going to tell you anything new. (If you are predicting something other than "Giving", which might have a yes/no outcome "Planned Giving potential for example" then giving-based predictors are not a problem. But that case is outside the range of this article.)

4. Predictors may be non-intuitive, so don't toss out a potential predictor without testing it, just because it doesn't make sense. The Economist article mentioned above notes that predictive analytics conducted by American Airlines found that the single most significant predictor that someone booking a seat would actually follow through with the flight is that they ordered a vegetarian meal! We're looking for relationships, not causes. (In other words, "correlation, not causation.") We are not concerned about whether or not event attendance "causes" giving, only that they are associated with each other. Which they usually are. With some data in hand, the next stage begins with pasting the data into a statistical software package. I use Data Desk (www.datadesk.com), but if you work at a university or large hospital you can probably access free stats software used by faculty or medical researchers. One common choice is called SPSS. There's also Minitab, and "R", which is open-source and free. Consult your IT department. (You could use Excel, but it would be cumbersome.) A full description of how to construct a score using the software would depend on what you' re using, and would be too long to describe here anyway. But essentially there are two ways to create a propensity-to-give model: the simple-score method, or the regression method. Each has its pros and cons, but the best introduction to predictive modeling is via the simple-score method. (For various reasons, I choose to use the regression method.) The simple-score method is clearly and succinctly described in a well-known book by Peter Wylie called "Data Mining for Fundraisers."  You really don't need to understand anything about statistics other than what means (a.k.a. averages) and medians are - and Wylie explains those, too.

Briefly, the method involves discovering which predictors are strongly correlated with giving (positively or negatively), and creating a score by summing up all the predictors (+1 for a positive, and -1 for a negative). It can be that simple! After that, it can be as technical and rarefied as you desire. In my experience, making the leap from being interested in predictive modeling to actually doing it involves some combination of training, reading, conferences and conversations with others who are doing this work. As a journalism-school graduate with absolutely no stats background whatsoever, I encourage you to ignore any number-phobia you might have and learn to leverage the power latent in your database!

Kevin MacDonell is Prospect Research Coordinator at St. Francis Xavier University. He writes about predictive modeling for fundraising on his blog at cooldata.wordpress.com.

Kevin also serves on the APRA Canada Board of Directors as the Treasurer

Share this with your networks!