# Data mining for patterns

Most technical papers on using artificial intelligence for predicting the markets focus on the approaches we’ve covered these past few months, such as support vector machines and hidden Markov models. Here we will explore methods that can create rules from data. This offers traders an interesting path, depending, of course, on whether the rules are judged to be trusted or are merely based on statistical artifacts of the data.

This type of learning method is designed to work on categorical and symbolic data. There isn’t a good way to apply it to numerical data. We can solve this issue by developing a stock market data mining approach that:

• Transforms the numeric stock data to symbolic sequences,
• Carries out sequential and non-sequential association analysis, and
• Uses the mined rules in classifying/predicting the further price movements.

We will discuss two different types of analysis. The first looks for patterns between within a given time series and the other looks for relationships between different data series.

### Data transformation

One way to change numerical price series into symbolic data is to create a variable that records “UP” or “DOWN” data based on if the day is an up day or a down day. We then can look at sequences of these symbols to find patterns that are predictive.

The numeric-to-symbolic conversion transforms market features (such as open, high, low and closing prices) into a string of symbols:

UP when (close – open) /
close > Threshold
DN when (open – close) /
close > Threshold
Level when (close – open) /
close = Threshold

Such a conversion process can be extended to granulate the numerical data into different time frames and it provides a large collection of symbol strings, hopefully at various time granularities, that can then be used for different applications.

We can create different window sizes in this database and apply associative learning methods to these patterns. If we have daily data, we can look at a pattern over five days and try to predict the next day’s value.

How many data points we use in the pattern determines the window size. These learning methods can search the database to find exploitable patterns and return rules based on them. This analysis can be done for one market or for patterns between multiple markets. We also can analyze multiple features of the same market, such as market patterns and overbought/oversold conditions.

### Associative learning

For associative learning, we use the highest statistical confidence. This is a combination of the number of supporting cases, which is inverse to the window size and accuracy of the pattern.

For the window size, we want to match the mined rule with the longest antecedent with the testing data sequence. Generally speaking, we support rules with smaller, rather than larger, window sizes. This is because lower accuracy is made up for by having more cases, which increases confidence. Longer matching sequences have higher accuracy but fewer supporting cases, resulting in lower confidence levels.

Majority vote is the most common approach for this type of analysis. Simply put, we will use  patterns of different window lengths, add up their results and select based on a threshold of these combined results.

The modified Apriori algorithm is a classic algorithm used in data mining for learning association rules. Suppose you have records of large numbers of transactions at a shopping center as follows:

Transaction     Items bought
T1                     Item 1, item 2, item 3
T2                     Item 1, item 2
T3                     Item 2, item 5
T4                     Item 1, item 2, item 5

In this case, by learning associative rules, we are finding the items that are purchased together more frequently than others. For example, the above data show that Item 1 and item 2 are bought together frequently. Online shopping sites use this type of data to recommend related items. The code for the Apriori algorithm is in the public domain and available online, but it’s easier to understand if we bypass the equations for now. Below, the letters represent different items:

Transaction    Items bought
T1                    T, U, R, K, E, Y
T2                    D, U, R, K, E, Y
T3                    T, A, K, E
T4                    T, U, C, K, Y
T5                    C, U, U, K, I, E

Our first threshold is when an item is said to be “frequently bought” if it is purchased at least 60% of the time. So, for here it should be bought at least three times (see “Seven steps to find frequently bought,” below).

Clearly, for data as simple as this example, we could just eyeball the original and come up with the same answer relatively quickly enough. However, for large sets of data, you’ll need to follow this process (or, more likely, program it).

Note that the Apriori algorithm does not consider the self-joining of an item itself when forming a candidate item set. However, in time series data mining, consecutive appearance of the same symbol may happen, such as three straight “up days” leading to a bearish day.

### Decision trees

In decision tree learning, an iterative dichotomiser 3 (ID3), an algorithm invented by Ross Quinlan, is used to generate a decision tree from a dataset. ID3 cannot handle continuous data and break it into discrete numbers automatically. It also has issues with noisy data.

C4.5 is an algorithm developed by Quinlan that generates decision trees that can be used for classification problems.
It extends the ID3 algorithm by dealing with both continuous and discrete attributes, missing values and pruning trees after construction. (Its commercial successor is C5.0/See5, which is a lot faster than C4.5, is more memory efficient and is used for building smaller decision trees.)

The first of C4.5’s improvements to ID3 is that C4.5 can handle both discrete and continuous attributes. C4.5 creates a threshold and then splits the list into those whose attribute values are above the threshold and those that are less than or equal to it. It also can handle missing data and attributes with differing costs. In addition, we also can prune trees after creation; that is, C4.5 goes back through the tree once it has been created and attempts to remove unnecessary branches.

Being a supervised learning algorithm, it requires a set of training examples. Each example can be seen as a pair: an input object and a desired output value (class). The algorithm analyzes the training set and builds a classifier that must be able to correctly classify both training and test examples. This classifier is based on a decision tree built to be the simplest tree possible. A test example is an input object and the algorithm must predict an output value.

“Choosing a path” (below) is an example of a decision tree built using C4.5. Although much of the open source code is beyond the scope of what we can detail here, the steps for creating decision trees are relatively straightforward. The input for the algorithm consists of a set of examples described by continuous or discrete attributes, each example belonging to one class. The output is a decision tree or a set of rules that assigns a class to a new case.

The algorithm does the following:

1. Checks for the base case.
2. Finds the attribute with the highest informational gain (“A_best”)
3. Partitions S into S1, S2, S3…, according to the values of A_best
4. Repeats the steps for S1, S2, S3.

The base cases are the following:

• All the examples from the training set belong to the same class (a tree leaf labeled with that class is returned).
• The training set is empty (returns a tree leaf called failure).
• The attribute list is empty (returns a leaf labeled with the most frequent class or the disjunction of all the classes).

As mentioned, the C4.5 algorithm improves the ID3 algorithm by allowing numerical attributes, permitting missing values and performing tree pruning. To deal with attributes that have a lot of numerical values, a binary split is performed. The values of the attribute are sorted. Then the gain is computed for every split point of an attribute. It is chosen the best split point (h) and the A attribute is split like this: A ≤ h, A > h.

An optimal binary split is done when it is evaluated on the gain only between points of different classes. For example, we don’t have to evaluate the gain between 75 and 80 because they are assigned to the same class. In addition, we should not repeat the order of the values at each node because the order for a node’s children can be determined from the order of the values of the parent.

Missing values are usually marked with a question mark (“?”). Dealing with missing values involves imputation. Imputation means that if an important feature is missing, it can be estimated from the available data.

Distribution-based imputation is done when we split the example into multiple instances, each with a different value for the missing feature. It is assigned a weight corresponding to the estimated probability for the particular missing value and the weights sum up to 1.

Tree pruning is done to obtain smaller trees and avoid over-fitting the model to the data. Pruning can be done in two ways: Pre-pruning and post-pruning. With pre-pruning, we stop building the tree when the attributes are irrelevant. It is harder to perform, but faster. When post-pruning, we build the entire tree and remove certain branches. We can do this by either using sub-tree raising or sub-tree replacement. The C4.5 algorithm performs sub-tree replacement. This means that a sub-tree is replaced by a leaf if it reduces the classification error.

C4.5 can build models that are easy to interpret and implement, but it is not perfect. One problem is that a small variation in data can lead to different decision trees. Another issue is that this method works better on large data sets. C4.5 is the most used decision tree algorithm and works well on real-world problems. It also allows for simpler rules and more intuitive interpretation because of tree pruning process.

### Rough sets

One of the most powerful technologies for machine learning is rough sets technology. This is the approach that we will use most in future articles on rule induction.

The rough sets theory was proposed by Polish scientist Zdzisław I. Pawlak in 1982. In the rough sets theory, humans use their general knowledge to classify the world around them as abstract or concrete. Everything is classified according to its characteristics, and those with nearly identical characteristics may be put into the same group. This is called indiscernible relation and is the basis of rough sets theory.

One of the main advantages of rough sets theory is that it does not need any preliminary or additional information about data. The main problems that can be approached using rough sets theory include data reduction, discovery of data dependencies, estimation of data significance, generation of decision algorithms from data, approximate classification of data, discovery of patterns in data and discovery of cause-effect relationships.

The rough sets approach to data analysis hinges on two basic concepts, namely the lower and the upper approximations of a set, referring to the elements that doubtlessly belong to the set, and the elements that possibly belong to the set.

The concepts of core and reduct are two very important concepts of the rough sets theory. If the set of attributes is dependent, one can be interested in finding all possible minimal subsets of attributes. These lead to the same number of elementary sets as the whole set of attributes (reducts) in finding the set of all indispensable attributes (core).

The information system has the job of interpreting the rough sets. Simplification of the information system can be used to recognize some values of attributes that are not necessary for the system. For example, some attributes that are redundant can be deleted or be filtered by means of the simplification procedures. If after deleting an attribute the number of elementary sets in the information system is the same, then that attribute is considered dispensable.

Hence, the simplification can contain the minimal subsets of independent attributes that represent the whole set.
The core is the necessary element for representing knowledge or rules and is the common part of all reducts.

### Generating rules

Once we have set up our rough sets we need to use them to generate rules. We will be using the LEMS algorithm to generate rules. LEM2 (Learning by Example Module, Version 2) is a machine learning algorithm based on rough set theory. LEM2 learns a discriminant rule set (that is, it learns the smallest set of minimal rules describing a concept). This algorithm can generate both certain and possible rules from a decision table with attributes being numerical as well categorical.

For inconsistent data, LEM2 induces two sets of rules: The certain rule set and the possible rule set. The first set is computed from lower approximations of concepts and the second one from upper approximations. It is assumed that the rule set is used automatically by a classification component. Nevertheless, induced rules are available and comprehensible by the user. Thus, it is possible to use rules manually, as in other systems.