General CHAID Introductory Overview

CHAID is an acronym that is of prominence in the field of research. The term stands for Chi- Square Automation Interaction Detector. In research, it is one of the oldest tree classification methods which was proposed by Kass in the year 1980. CHAID builds non binary trees which means where it is possible to attach more than two branches to one single root or node. It is dependent upon a single algorithm and suits well for the analysis of larger data sets. The CHAID algorithm gives way to many multiway frequency tables. The popularity of CHAID has been seen in marketing research and more specifically in the field of segmentation studies. Both CHAID and C&RT techniques are used for the construction of trees and here reach non terminal node identifies a split condition so as to come out with an optimum prediction or classification. This is the reason that both these types of algorithm can be seen getting applied in for the analysis of regression type problems or classification types.

For classification problems when the dependant variable has a categorical bent, it relies on Chi Square test for the determination of the best next split at each step. For problems that are regression type problems, the program will compute F tests. There is a specific procedure for Algorithm computation:

  • Preparation of Predictors: The first step is the creation of the categorical predictors out of the continuous predictors by dividing the respective distributions into different categories with an approximation of equal number of observations. For the purpose of categorical predictions, the categories have been defined naturally.
  • Merging of Categories: The next step is to cycle through the predictors for the determination for each predictor the corresponding pair of categories that have least significantly difference with respect to the dependant variable. For classification problems where the dependant variable is categorical, it computes through chi square and for a continuous dependant variable use it would compute through regression, F Tests.
  • Selection of the split Variable: The next step is to choose the split or we could also say the predictor variable with the smallest adjusted P- Value. When we notice that the smallest adjusted P Value for any predictor is greater than some alpha to split value then further splits are not performed and the respective node is called the terminal node.