Database Marketing,eloyalty Program Frequency Marketing
About Us
Solutions
Products
Approach
Our Clients
Knowledge Hub
Careers
Home > Knowledge Hub > White Papers | Information tools | Articles | Press Releases
Articles

Stages of data mining

In the first article, we discussed what data mining is and is not, why it is important and how it could affect our lives. In this article, we would explore what are the various stages necessary in the setting up a knowledge discovery environment where data mining algorithms can be effectively applied.

Similar to any serious endeavor, the key word here is plan, plan and plan. A wise man once said that any project manager with two grams of salt would run through their projects twice, once in their head and another during execution. This is exactly the case for any knowledge discovery projects. Don’t be mistaken; the knowledge discovery process need not be a long-winded commitment, requiring an extended period of time. If sufficient preparation work is done, the knowledge discovery process can be relatively short and absolutely painless.
What then are the steps of a knowledge discovery process?

We have much of the usual suspects here as depicted by the diagram below:

Data Load: We typically start with a collection of data from multiple sources and we want to store them into a coherent locality. Typical sources that are great sources of data can range from tables in databases of various systems (if you are lucky), POS (Point of sales) records, Transactional receipts, Facsimiles, Printouts or even a series of screen-scrapping exercise. Data trapped in non-digital records would need to be transcribed, of course.

Data Cleansing: By now we should have all the required data fields in digital form residing within a single coherent locality. While the data fields are still undeniably raw, nor are they any form suitable for analysis yet, but believe me, we are already well on our way in the knowledge discovery process.
We next have to clean that data in our possession to ensure that each table contains data that are relatively free of errors, anomalies and misspellings. The obvious key word here is “relatively”. We need to decide if the extra mile needed to squeeze that extra percentage point of cleanliness in the data is worth the effort. While it is always great to have as accurate a data set to work on as possible, however, more often than not, a percentage of unclean data would not significantly affect the results of any analysis applied on the data set.

Over the years, many data cleansing techniques have been developed; an elaborate discussion of them can indeed fill up an entire article. In fact, there are companies that make a living solely on cleaning data for corporations. The more common data cleansing techniques typically involve some form of custom build or industry strength taxonomy to compare each data item against.

Transformation: Now that the raw data is reasonably cleaned, we need to begin the next process of transforming the data into a form that is suitable for data mining and analysis. The need for data transformation is best thought of using the following example. Consider the records found in the tables in figure 2. The fields that are presented seem to be useless for analysis. What results can one realistically gather from a table showing details of people’s addresses and another showing the locality of where the transactions were made? The situation here could be one of too much information, making analysis useless. On closer observation, those who are a little familiar with the geography of Singapore would notice that everyone in table 1, except for Bob, lives in the vicinity of each other. Similarly, the “Locality” field stores location that appears to be in the vicinity of each other as well. To do effective analysis, like figuring out if there is a relationship between where a person stays and where he buys stuff, we might want to consider transforming the tables into the following, in figure 3, consistently encoding “Address” and “Locality” information in the Braddell-Toa Payoh vicinity as 01 and the Jurong vicinity as 02. There are obviously a lot more to data transformation than transcribing addresses in the vicinity into numbers but further elaboration can be left to a future article.

Records optimization: Some considered this phase as part of data transformation; others tend to keep it separate. I think whether they are separate or not is really pure academic, we are probably not going to worry ourselves with that here.

The Optimization phase actually does the familiar conversion of raw data into formats that make the analysis process easier and more meaningful. The key difference between this phase and the previous one is just that the number of resultant records from this phase would naturally to be less than those that we have begun with. We are again faced with a situation where the raw data has too much information that would make effective analysis reasonable. Think of it this way, if we have 5 million records of the daily transactions of customers patronizing retail outlets, which takes weeks of computing power to churn out useful models, can we optimize the dataset by aggregating the transactions into weekly or even monthly transactions? This can convert the required weeks of processing to days of processing, often more palatable, at least for the initial investigative phase. Again, which records to consolidate, what criteria should be applied is more art than science. An iterative approach of selection is often adapted to suite the resource limitation faced.

Data Upload: Once the records are effectively scrubbed cleaned, transformed and selected, they are loaded into files or tables suitable for analysis.

This should be a relatively straightforward phase, where datasets from multiple tables are converted into file formats that data mining algorithms would discover hidden trends and rules in. Many algorithms prefer CSV (comma separated values); others require the user to import the data into their table formats.

Data Discovery: This is probably the most exciting stage of the knowledge discovery process. This is the time to apply the various data mining algorithms to the data set, sit back and decipher the findings that the algorithms uncover for us. The difficulty is often, where does one begin? There aren’t rules of thumb here, but there are preferences. Many prefer a statistical analysis first, doing the simple things like finding the mean, median, mode and standard deviation of the data elements followed by a rough visualization of the data set across fields that have suspected relationships. You will get a lot of noise, of course; but there would be times in which the data selected would reveal possible trends that analysis algorithms can zero in later.

What typically happens next is a series of iterative steps where one would apply directed and undirected mining algorithms to first a large data set, then to a more selected data set for detailed analysis across a collection of variables, and then zooming back out to a larger data set if the detailed analysis fail to uncover anything interesting. Instead of going on and on, we’ll cover details of this process in a structured and organized way, across multiple articles in the future.

Visualization: Thought we had this before in the pervious stage? Well, yes and no. In the data discovery stage, visualization techniques are used to “have a feel” of what the characteristics of the data is underlying. In this phase, we need to concentrate on the presentation of information. We need to appropriately design the presentation of our findings from the discovery phase in a format that will deliver the best possible impact. Why is this needed? This is because, more frequently than not, the output churned from the discovery phase can be rather boring or difficult for a finance manager, for example, to understand. An experienced data analyst would “massage” the raw data and map them on the appropriate visualization tools to clearly depict the results uncovered in the discovery phase. Many times, showing rudimentary charts such as pie charts and line graphs would do. To fundamentally impress business executives and clearly depict inter variable relationships; 3-d plots are naturally the favorites.

  User Name Address   Transaction Locality
1. John 2 Toa Payoh Lorong 8   XNC 839903 Braddell
2. James 23 Braddell Heights   XBH 2878900 Potong Pasir
3. Mary 78 Daisy Ave   CXV 4908439 Jurong East
4. Helen 93 Woskel Rd   XVG 943003 Jurong West
5. Ann #19-192 Upper Serangoon Rd      
6. Joanna #01-198 Woodsville Rd      
7. May 27 Jalan Lateh      
8. Bob 84 Jurong West      

  User ID Transaction Details   Address Locality
1. 0001 -------   01 01
2. 0002 -------   01 01
3. 0003 -------   01 01

Aren’t they similar to those found in a typical Data Processing Project? Well, one shouldn’t be too remarkably surprised. Knowledge Discovery is after all, from a macro perspective, the processing of data. The age-old adage “GIGO” would naturally hold true and particularly in the case of knowledge discovery, erroneous raw data would lead flawed prediction and conclusion. And similar to any data processing project, good planning is essential to excellent outcome. Too many have fallen into the trap of analyzing data before they are sufficiently cleaned and transformed. A similar number suffered from “analysis paralysis” where way too much resources are spent on analysis that dug too deep into the data to be useful. What is enough and yet not too much? What is useful and significant but yet not too little? This fine line is, perhaps, what separates the art from the science of data mining.



Customer Relationship Marketing,Customer Retention Marketing,Data Analytics

Related Links
Download
our Fact Sheet
on Data Analytics
Read the
HP Case Study
on Data Analytics

Download the Data Analytics Brochure
Click here to download Chapter 1 of our book on Data Analytics
¨ÏCopyright 2010 Edenred Co., Ltd. All rights reserved.

Customer Loyalty Solutions | Partner Relationship Management | Data Analytics | Loyalty Whitepapers | Relationship Management Consulting | Loyalty Case Studies | Loyalty Engine | Loyalty Cube | PRM Solutions | Strategic Planning Process | Loyalty Solutions