Abstract
Over the last several years, many companies have adopted some form of automated credit evaluation tool (credit scoring) to evaluate credit risk and improve credit department productivity. Previously, in the March 2003 issue of "Business Credit", we discussed "Judgment or Rules Based Systems ", "Behavior Scoring (Statistical-based Systems) " and systems based on "Neural Networks. " In this article, we want to introduce a new technology, "Genetic Algorithmbased Systems " that has certain characteristics that may make it superior to all of the current automated methods used for solving certain types of risk analysis problems.
In head-to-head tests against standard predictive models developed using techniques like linear and logistic regression, Genetic algorithms (GAs) were found to:
* Deliver anywhere from a 5% to 20% improvement in predicting key operating and financial success factors like probability of payment or collection.
* Provide better insights because they use 100% of the available data rather than an analyst selected subset.
* Take anywhere from 50% to 75% less time to develop with potential commensurate savings in system development and implementation costs.
* Update on a daily basis to reflect the most recent customer behavior (i.e., all customers' terms and limits can be updated daily, and the probability of a given customer's payment for a new order, is based on the most current data).
* Offer clear and understandable results, (i.e. the user does not have to be a statistician to understand the models output).
What are Genetic Algorithms?
The application of genetic algorithms to predictive modeling is based on Darwin's principle of "survival of the fittest." A simple analogy may help to explain the basic underlying technology. Consider the common process used to produce a champion racehorse. Champions are almost always the product of selective breeding. The current Kentucky Derby winner "Smarty Jones" will earn more money for his owners in stud fees than by winning races. Other horse breeders will pay significant money to have their mares produce a colt fathered by "Smarty Jones." Their logic is that likelihood of producing a future winner is significantly increased by breeding their mares with a known winner. In GA terms, the probability of winning a race is a known as a fitness measure and winning the Kentucky Derby is about the highest fitness measure for a horse. If the mare is also a race winner then they have bred two horses with high fitness measures, thereby increasing the chance that the resulting colt will also have a high fitness measure (probability of winning). Unfortunately, a drawback in breeding horses is that from conception to birth takes about 340 days so you don't get a lot of output to review during a breeder's career.
In a loose parallel to selective breeding, a GA breeds an initial "generation" of random predictive models. You can think of a model as a row containing a set of variables, each in a different column. Each model is tested for fitness against user-defined criteria. The process starts with a random selection of variables since at the onset we do not necessarily know the characteristics of a Derby winner versus a plow horse. The initial selection of models that includes all of the available data. After an extensive "trial and error" process that may breed millions of models, over tens of thousands of generations, the GA will come up with a Derby winner (king model) that contains the characteristics (variables) necessary to predict the specific outcome.
As you will see below, better models (those with higher fitness measures) are more likely to be selected for breeding and will tend to survive while weaker models (plow pullers) will die out. By applying the basic genetic principles of cloning, mating and mutation, the GA increases the diversity of models. This search of possible outcomes is repeated for thousands or tens of thousands of generations with the better models surviving. Ultimately, the better "evolved" model "wins" and represents the best solution to the business analytic problem.
Breeding for Success
As noted above, GAs mimic standard biological processes. To illustrate this process envision a series of genes, or gene group, representing everything about an independent (predictive) variable. The types of variables that can be included are continuous (numbers from -oo (infinity) to + ∞), categorical (male, female), date and system derived interaction. For example, if SIC Code was contained in the model as a categorical variable, the gene group SIC Code could contain a gene for every possible code.
Genes are combined to form a chromosome. Chromosomes include gene groups for all variables on the input dataset, along with specific genes that represent interactions between variables. Each chromosome represents a model equation or potential solution to the business problem.
Figure 1 illustrates the relationship between gene groups (variables) and a model chromosome:
A chromosome is made up of gene groups for all variables and interactions.
For example, in Figure 1 above, the variables (Varn) could have the following values (Note: Each variable could contain any possible value for that data element found in the population):
Var 1 = SIC Code
Var 2 = DSO
Var 3 = DSO Trend
Var 4 = DSO Variance
Var 5 = % Accounts 0-30 days
Var 6 = % Accounts 0-30 days Trend
Var 7 = % Accounts 0-30 days Variance, through
Var^sup n^ = Any possible data element that might be related to the decision process the model is being developed for.
In addition to gene groups for each variable, the GA automatically appends genes for interaction variables based upon chromosome configuration parameters. These new variables help the GA discover subtle interactions that greatly enhance the overall model performance. Interaction variables (Int^sup n^) can be any combination of the variables (For example: Int 1 = (Var 5)/(Var8))
While the chromosome contains genes for all available variables, the GA will adjust the variables selected as a means of ensuring stability and avoiding the potential for overfitting of solutions. Figure 2 illustrates the "model" chromosome, identifying those variables that are part of a model equation.
In other words, the GA for Model 1 would use V3, V6, VlO and V13; Model 2 would be Vl, V2, V8 and V12; Model 3 would be V3, V6, V9 and V13 and Model n would be V2, V5, V8 and V12.
How the Evolutionary Process Works
Once the basic structure of the chromosomes is complete, the GA will create an initial population of models. The default initial population will contain 100 chromosomes. Once this initial population is created, each model is tested for fitness based upon user selected fitness metrics. The "most fit" models are more likely to survive and be selected for breeding. The probability of mating is based upon a model's fitness. Imagine a roulette wheel with each model assigned a portion based upon a model's proportion of total fitness. Figure 3 illustrates a model fitness table and corresponding "roulette wheel":
Note that Model 5 has the highest fitness and therefore has the largest portion of the "roulette wheel". As the GA selects models for mating, those with the higher fitness are more likely to be selected.
Following fitness evaluation, the GA will apply the basic principles of genetics to breed the next generation of models. These consist of cloning, mating, mutating and the introduction of random models.
Cloning
Cloning is the process of copying a chromosome from one generation to the next. With the default option, the GA always clones the best model, the "king" model to the next generation.
Mating
For mating, the GA combines pairs of models through the genetic process of "crossover". Portions of the first chromosome are combined with sections of the second chromosome to create a new pair of chromosomes for the next generation. The GA automatically adjusts the rate of crossover based upon learning, or lack thereof.
For example, the GA might combine Models 1 and 2 from Figure 2. Two new models are formed, where Model l(new) contains V3 and V6 from the original Model 1 and V8 and V12 from the original Model 2. Likewise, Model 2 (new) would contain VlO and Vl 3 form the original Model 1 and Vl and V2 from the original Model 2.
Mutation
Following crossover, each of the new chromosomes can be subjected to mutations. During mutation, each bit has a small chance of flipping from a 1 to a O, or a O to a 1. As with crossover, the GA controls the mutation rate and will adjust it as necessary.
Random Models
For each generation, a random set of models is introduced that may be selected for crossover during the subsequent generation process. Introduction of random models helps enhance the overall model fitness and helps avoid a "local optimum". The genetic algorithm controls the introductory rate of random models and will adjust this rate to help the learning process.
Putting it all Together
Using the basic genetic processes, the GA breeds new generations of models. Figure 4 illustrates this process. For each generation, the GA evaluates each model against the user-defined fitness metric. The better models are more likely to be selected for breeding. The end result: a "king" model which represents the best-evolved scoring model (in the form of an equation for use in production scoring).
After the scoring model has run for an adequate number of generations (typically around 5,000 to 10,000), the credit manager has a range of basic analytic tools at his or her disposal. Standard evaluation charts can help show how well the model performed against existing models.
In Summary
Companies that have yet to adopt an automated credit evaluation system should be sure to add genetic algorithms to the technologies they review. And companies that are already using a form of automated credit risk evaluation would do well to determine if a genetic algorithm based system would be an improvement over their existing methodology.
In many cases, predictive modeling software based on genetic algorithms outperforms linear or logistic regressions by 5% to 20% percent when predicting key performance metrics like probability of payment or probability of collection. In addition, GAs can explore a broader range of outcomes and as a result can identify patterns undetectable by other modeling techniques. GAs also consider 100% of the available data and create new virtual variables that are transformation or combinations of existing variables that expand data exploration. And finally, GAs have the ability to update models overnight, thereby reducing the rate of model decay and guaranteeing that all model decisions will be based on the latest available information.
Douglas Newell is Founder ana President of Genafytics, a software company and applications provider that specializes in the application of Genetic Algorithms to variousdecision support requirements, in particular in marketing and credit areas where it is necessary to predict the future behavior of customers. He can be reached at 978-465-6373 or via e-mail at dnewell@senalvtics. com.
Albert Fensterstock is Managing Director of Albert Fensterstock Associates, a consulting company that specializes in the application of the Internet and decision support technology to the credit function. Previously, Mr. Fensterstock helped found CreditRiskMonitor.com, Inc., an Internet-based financial information, analysis and news service created specifically for the corporate credit function. He can be reached at 516-873-6900 or via e-mail at albie@jafassociates. ors.
Copyright Credit Research Foundation Second Quarter 2004
Provided by ProQuest Information and Learning Company. All rights Reserved