BBC Russian

An alternating decision tree (ADTree) is a machine learning method for classification. It generalizes decision trees and has connections to boosting.

An ADTree consists of an alternation of decision nodes, which specify a predicate condition, and prediction nodes, which contain a single number. An instance is classified by an ADTree by following all paths for which all decision nodes are true, and summing any prediction nodes that are traversed.

YouTube Encyclopedic

1/5
Views:
51 511
146 193
84 941
66 445
2 402

Transcription

Hi! Here in Lesson 3.4, we're continuing our exploration of simple classifiers by looking at classifiers that produce decision trees. We're going to look at J48. We've used this classifier quite a bit so far. Let's have a look at how it works inside. J48 is based on a top-down strategy, a recursive divide and conquer strategy. You select which attribute to split on at the root node, and then you create a branch for each possible attribute value, and that splits the instances into subsets, one for each branch that extends from the root node. Then you repeat the the procedure recursively for each branch, selecting an attribute at each node, and you use only instances that reach that branch to make the selection. At the end you stop, perhaps you might continue until all instances have the same class. The trick is, the question is, how do you select a good attribute for the root node. This is the weather data, and as you can see, outlook has been selected for the root node. Here are the four possibilities: outlook, windy, humidity, and temperature. These are the consequences of splitting on each of these attributes. What we're really looking for is a pure split, a split into pure nodes. We would be delighted if we found an attribute that split exactly into one node where they are all yeses, another node where they are all nos, and perhaps a third node where they are all yeses again. That would be the best thing. What we don't want is mixtures, because when we get mixtures of yeses and nos at a node, then we've got to split again. You can see that splitting on outlook looks pretty good. We get one branch with two yeses and three nos, then we get a pure yes branch for overcast, and, when outlook is rainy, we get three yeses and two nos. How are we going to quantify this to decide which one of these attributes produces the purest nodes? We're on a quest here for purity. The aim is to get the smallest tree, and top-down tree induction methods use some kind of heuristic. The most popular heuristic to produce pure nodes is an information theory-based heuristic. I'm not going to explain information theory to you, that would be another MOOC of its own -- quite an interesting one, actually. Information theory was founded by Claude Shannon, an American mathematician and scientist who died about 12 years ago. He was an amazing guy. He did some amazing things. One of the most amazing things, I think, is that he could ride a unicycle and juggle clubs at the same time when he was in his 80's. That's pretty impressive. He came up the whole idea of information theory and quantifying entropy, which measures information in bits. This is the formula for entropy: the sum of p log p's for each of the possible outcomes. I'm not really going to explain it to you. All of those minus signs are there because logarithms are negative if numbers are less than 1 and probabilities always are less than 1. So, the entropy comes out to be a positive number. What we do is we look at the information gain. How much information in bits do you gain by knowing the value of an attribute? That is, the entropy of the distribution before the split minus the entropy of the distribution after the split. Here's how it works out for the weather data. These are the number of bits. If you split on outlook, you gain 0.247 bits. I know you might be surprise to see fractional numbers of bits, normally we think of 1 bit or 8 bits or 32 bits, but information theory shows how you can regard bits as fractions. These produce fractional numbers of bits. I don't want to go into the details. You can see, knowing the value for windy gives you only 0.048 bits of information. Humidity is quite a bit better; temperature is way down there at 0.029 bits. We're going to choose the attribute that gains the most bits of information, and that, in this case, is outlook. At the top level of this tree, the root node, we're going to split on outlook. Having decided to split on outlook, we need to look at each of 3 branches that emanate from outlook corresponding to the 3 possible values of outlook, and consider what to do at each of those branches. At the first branch, we might split on temperature, windy or humidity. We're not going to split on outlook again because we know that outlook is sunny. For all instances that reach this place, the outlook is sunny. For the other 3 things, we do exactly the same thing. We evaluate the information gain for temperature at that point, for windy and humidity, and we choose the best. In this case, it's humidity with a gain of 0.971 bits. You can see that, if we branch on humidity, then we get pure nodes: 3 nos in one and 2 yeses in the other. When we get that, we don't need to split anymore. We're on a quest for purity. That's how it works. It just carries on until it reaches the end, until it has pure nodes. Let's open up Weka, and just do this with the nominal weather data. Of course, we've done this before, but I'll just do it again. It won't take long. J48 is the workhorse data mining algorithm. There's the data. We're going to choose J48. It's a tree classifier. We're going to run this, and we get a tree -- the very tree I showed you before -- split first on outlook: sunny, overcast and rainy. Then, if it's sunny, split on humidity, 3 instances reach that node. Then split on normal, 3 yes instances reach that node, and so on. We can look at the tree using Visualize the tree in the right-click menu. Here it is. These are the number of yes instances that reach this node and the number of no instances. In the case of this particular tree, of course we're using cross validation here. It's done an 11th run on the whole dataset. It's given us these numbers by looking at the training set. In fact, this becomes a pure node here. Sometimes you get 2 numbers here -- 3/2 or 3/1. The first number indicates the number of correct things that reach that node, so in this case the number of nos. If there was another number following the 3, that would indicate the number of yeses, that is, incorrect things that reach that node. But that doesn't occur in this very simple situation. There you have it, J48: top-down induction of decision trees. It's soundly based in information theory. It's a pretty good data mining algorithm. 10 years ago I might have said it's the best data mining algorithm, but some even better ones, I think, have been produced since then. However, the real advantage of J48 is that it's reliable and robust, and, most importantly, it produces a tree that people can understand. It's very easy to understand the output of J48. That's really important when you're applying data mining. There are a lot of different criteria you could use for attribute selection. Here we're using information gain. Actually, in practice, these don't normally make a huge difference. There are some important modifications that need to be done to this algorithm to be useful in practice. I've only really explained the basic principles. The actual J48 incorporates some more complex stuff to make it work under different circumstances in practice. We'll talk about those in the next lesson. Section 4.3 of the text Divide-and-conquer: Constructing decision trees explains the simple version of J48 that I've explained here. Now you should go and do the activity associated with this lesson. Good luck! See you next time!

History

ADTrees were introduced by Yoav Freund and Llew Mason.^[1] However, the algorithm as presented had several typographical errors. Clarifications and optimizations were later presented by Bernhard Pfahringer, Geoffrey Holmes and Richard Kirkby.^[2] Implementations are available in Weka and JBoost.

Motivation

Original boosting algorithms typically used either decision stumps or decision trees as weak hypotheses. As an example, boosting decision stumps creates a set of $T$ weighted decision stumps (where $T$ is the number of boosting iterations), which then vote on the final classification according to their weights. Individual decision stumps are weighted according to their ability to classify the data.

Boosting a simple learner results in an unstructured set of $T$ hypotheses, making it difficult to infer correlations between attributes. Alternating decision trees introduce structure to the set of hypotheses by requiring that they build off a hypothesis that was produced in an earlier iteration. The resulting set of hypotheses can be visualized in a tree based on the relationship between a hypothesis and its "parent."

Another important feature of boosted algorithms is that the data is given a different distribution at each iteration. Instances that are misclassified are given a larger weight while accurately classified instances are given reduced weight.

Alternating decision tree structure

An alternating decision tree consists of decision nodes and prediction nodes. Decision nodes specify a predicate condition. Prediction nodes contain a single number. ADTrees always have prediction nodes as both root and leaves. An instance is classified by an ADTree by following all paths for which all decision nodes are true and summing any prediction nodes that are traversed. This is different from binary classification trees such as CART (Classification and regression tree) or C4.5 in which an instance follows only one path through the tree.

Example

The following tree was constructed using JBoost on the spambase dataset^[3] (available from the UCI Machine Learning Repository).^[4] In this example, spam is coded as 1 and regular email is coded as −1.

The following table contains part of the information for a single instance.

An instance to be classified
Feature	Value
char_freq_bang	0.08
word_freq_hp	0.4
capital_run_length_longest	4
char_freq_dollar	0
word_freq_remove	0.9
word_freq_george	0
Other features	...

The instance is scored by summing all of the prediction nodes through which it passes. In the case of the instance above, the score is calculated as

Score for the above instance
Iteration	0	1	2	3	4	5	6
Instance values	—	.08 < .052 = f	.4 < .195 = f	0 < .01 = t	0 < 0.005 = t	—	.9 < .225 = f
Prediction	-0.093	0.74	-1.446	-0.38	0.176	0	1.66

The final score of 0.657 is positive, so the instance is classified as spam. The magnitude of the value is a measure of confidence in the prediction. The original authors list three potential levels of interpretation for the set of attributes identified by an ADTree:

Individual nodes can be evaluated for their own predictive ability.
Sets of nodes on the same path may be interpreted as having a joint effect
The tree can be interpreted as a whole.

Care must be taken when interpreting individual nodes as the scores reflect a re weighting of the data in each iteration.

Description of the algorithm

The inputs to the alternating decision tree algorithm are:

A set of inputs $(x_{1},y_{1}),\ldots ,(x_{m},y_{m})$ where $x_{i}$ is a vector of attributes and $y_{i}$ is either -1 or 1. Inputs are also called instances.
A set of weights $w_{i}$ corresponding to each instance.

The fundamental element of the ADTree algorithm is the rule. A single rule consists of a precondition, a condition, and two scores. A condition is a predicate of the form "attribute <comparison> value." A precondition is simply a logical conjunction of conditions. Evaluation of a rule involves a pair of nested if statements:

1  if (precondition)
2      if (condition)
3          return score_one
4      else
5          return score_two
6      end if
7  else
8      return 0
9  end if

Several auxiliary functions are also required by the algorithm:

$W_{+}(c)$ returns the sum of the weights of all positively labeled examples that satisfy predicate $c$
$W_{-}(c)$ returns the sum of the weights of all negatively labeled examples that satisfy predicate $c$
$W(c)=W_{+}(c)+W_{-}(c)$ returns the sum of the weights of all examples that satisfy predicate $c$

The algorithm is as follows:

1  function ad_tree
2  input Set of  $m$  training instances
3 
4   $w i = 1/ m$  for all  $i$ 
5   $a={\frac {1}{2}}{\textrm {ln}}{\frac {W_{+}(true)}{W_{-}(true)}}$ 
6   $R 0 =$  a rule with scores  $a$  and  $0$ , precondition "true" and condition "true."
7   ${\mathcal {P}}=\{true\}$ 
8   ${\mathcal {C}}=$  the set of all possible conditions
9  for  $j=1\dots T$ 
10       $p\in {\mathcal {P}},c\in {\mathcal {C}}$  get values that minimize  $z=2\left({\sqrt {W_{+}(p\wedge c)W_{-}(p\wedge c)}}+{\sqrt {W_{+}(p\wedge \neg c)W_{-}(p\wedge \neg c)}}\right)+W(\neg p)$ 
11       ${\mathcal {P}}+=p\wedge c+p\wedge \neg c$ 
12       $a_{1}={\frac {1}{2}}{\textrm {ln}}{\frac {W_{+}(p\wedge c)+1}{W_{-}(p\wedge c)+1}}$ 
13       $a_{2}={\frac {1}{2}}{\textrm {ln}}{\frac {W_{+}(p\wedge \neg c)+1}{W_{-}(p\wedge \neg c)+1}}$ 
14       $R j =$  new rule with precondition  $p$ , condition  $c$ , and weights  $a 1$  and  $a 2$ 
15       $w_{i}=w_{i}e^{-y_{i}R_{j}(x_{i})}$ 
16  end for
17  return set of  $R j$

The set ${\mathcal {P}}$ grows by two preconditions in each iteration, and it is possible to derive the tree structure of a set of rules by making note of the precondition that is used in each successive rule.

Empirical results

Figure 6 in the original paper^[1] demonstrates that ADTrees are typically as robust as boosted decision trees and boosted decision stumps. Typically, equivalent accuracy can be achieved with a much simpler tree structure than recursive partitioning algorithms.

References

^ ^a ^b Freund, Y.; Mason, L. (1999). "The alternating decision tree learning algorithm" (PDF). Proceedings of the Sixteenth International Conference on Machine Learning (ICML '99). Morgan Kaufmann. pp. 124–133. ISBN 978-1-55860-612-8.
^ Pfahringer, Bernhard; Holmes, Geoffrey; Kirkby, Richard (2001). "Optimizing the Induction of Alternating Decision Trees" (PDF). Advances in Knowledge Discovery and Data Mining. PAKDD 2001. Lecture Notes in Computer Science. Vol. 2035. Springer. pp. 477–487. doi:10.1007/3-540-45357-1_50. ISBN 978-3-540-45357-4.
^ "Spambase Data Set". UCI Machine Learning Repository. 1999.
^ Dua, D.; Graff, C. (2019). "UCI Machine Learning Repository". University of California, Irvine, School of Information and Computer Sciences.

External links

An introduction to Boosting and ADTrees (Has many graphical examples of alternating decision trees in practice).
JBoost software implementing ADTrees.

This page was last edited on 3 January 2023, at 17:43

Alternating decision tree

From Wikipedia, the free encyclopedia