|
|
Data defines the model by dint of genetic programming, producing the best decile table.
|
|
Genetic Data Mining Method for the Proper Use of the Correlation Coefficient Bruce Ratner, Ph.D. |
|
|
Assessing the relationship between a predictor variable and a target variable is an essential task in the statistical linear model building process. If the relationship is identified and tractable, then one of the variables (although, sometimes both are re-expressed) is re-expressed to reflect the uncovered relationship, and consequently tested for inclusion into the model. Most methods of variable assessment are based on the well-known correlation coefficient, which is often misused because its linearity assumption is not tested. The purpose of this article is to illustrate a genetic data mining method – the GenIQ Model© – that is perhaps the best “data-straightener” available today. I use the third pair of x and y values from the well-known Anscombe data.
OUTLINE
I. Ancombe Data
ID x y
1 10 7.46 2 8 6.77 3 13 12.74 4 9 7.11 5 11 7.81 6 14 8.84 7 6 6.08 8 4 5.39 9 12 8.15 10 7 6.42 11 5 5.73
II. GenIQ Model (Tree Display)
The GenIQ Model (Code)
x1 = .6550772; x2 = x; If x1 NE 0 Then x1 = x2 / x1; Else x1 = 1; x2 = x; x3 = x; x2 = x2 + x3; x2 = Cos(x2); x1 = x1 + x2; GenIQvar(y) = x1;
III. GenIQ Model Results
The results of the GenIQ Model as a data-straightener are in Table 2. There is a perfect rank-order prediction based on the descending GenIQ model score GenIQvar(y), which is used to order the table. Table 2. GenIQ Model Rank-order Prediction
ID x y GenIQvar(y)
3 13 12.74 20.4919 6 14 8.84 20.4089 9 12 8.15 18.7426 5 11 7.81 15.7920 1 10 7.46 15.6735 4 9 7.11 14.3992 2 8 6.77 11.2546 10 7 6.42 10.8225 7 6 6.08 10.0031 11 5 5.73 6.7936 8 4 5.39 5.9607
Perhaps, the best way of illustrating the GenIQ Model as a data-straightener, and a data mining tool are the plots below: Plot y*x and Plot GenIQvar*x. 
 IV. Summary Perhaps, the GenIQ Model is an excellent data-straightener and data mining tool: All-in-one? What do you think? Oh, two things - the correlation coefficients between y and x, and GenIQ(y) and x are: 0.81629 and 0.9895, respectively.
|
| For more information about this article, call Bruce Ratner at 516.791.3544 or 1 800 DM STAT-1; or e-mail at br@dmstat1.com. |
|
|