Data defines the model by dint of genetic programming, producing the best decile table.


Genetic Data Mining Method for the Proper Use of the Correlation Coefficient
Bruce Ratner, Ph.D.
Live chat by Boldchat
Live chat by Boldchat

Assessing the relationship between a predictor variable and a target variable is an essential task in the statistical linear model building process. If the relationship is identified and tractable, then one of the variables (although, sometimes both are re-expressed) is re-expressed to reflect the uncovered relationship, and consequently tested for inclusion into the model. Most methods of variable assessment are based on the well-known correlation coefficient, which is often misused because its linearity assumption is not tested. The purpose of this article is to illustrate a genetic data mining method – the GenIQ Model© – that is perhaps the best “data-straightener” available today. I use the third pair of x and y values from the well-known Anscombe data.


OUTLINE



I. Ancombe Data

ID     x      y

1      10    7.46
2        8    6.77
3      13  12.74
4        9    7.11
5      11    7.81
6      14    8.84
7        6    6.08
8        4    5.39
9      12    8.15
10      7    6.42
11      5    5.73



II. GenIQ Model (Tree Display) 

GenIQTree_1 

The GenIQ Model (Code)

x1 = .6550772; 
          x2 = x; 
     If x1 NE 0 Then x1 = x2 / x1; Else x1 = 1; 
          x2 = x; 
               x3 = x; 
          x2 = x2 + x3; 
          x2 = Cos(x2); 
     x1 = x1 + x2;
GenIQvar(y) = x1;



III. GenIQ Model Results
The results of the GenIQ Model as a data-straightener are in Table 2. There is a perfect rank-order prediction based on the descending GenIQ model score GenIQvar(y), which is used to order the table.

Table 2. GenIQ Model Rank-order Prediction

ID     x        y      GenIQvar(y)

3      13    12.74       20.4919
6      14      8.84       20.4089
9      12      8.15       18.7426
5      11      7.81       15.7920
1      10      7.46       15.6735
4        9      7.11       14.3992
2        8      6.77       11.2546
10      7      6.42       10.8225
7        6      6.08       10.0031
11      5      5.73         6.7936
8        4      5.39         5.9607



Perhaps, the best way of illustrating the GenIQ Model as a data-straightener, and a data mining tool are the plots below:
Plot y*x and Plot GenIQvar*x.
 
                  ybyxplot
                 

GenIQvarbyx plot
IV. Summary
Perhaps, the GenIQ Model is an excellent data-straightener and data mining tool: All-in-one? What do you think?
Oh, two things - the correlation coefficients between y and x, and GenIQ(y) and x are: 0.81629 and 0.9895, respectively.


For more information about this article, call Bruce Ratner at 516.791.3544 or 1 800 DM STAT-1; or e-mail at br@dmstat1.com.