-
Notifications
You must be signed in to change notification settings - Fork 0
Chapter 1: Decision Trees
In this section we'll implement the algorithms shown as psuedocode in Chapter 1, DecisionTreeTrain and DecisionTreeTest. If you want to follow along in the REPL, please either clone the repository or follow the steps in Project Setup if you haven't done so already.
If you don't already have a REPL session open, start it from the project directory:
cd ciml
lein repl
Require priority-map, pprint, and incanter dataset:
(require '[clojure.data.priority-map :refer :all])
(require '[clojure.pprint :refer [pprint]])
(require '[incanter.core :refer [dataset]])
If you don't want to follow along step by step, just read the source code in decision_tree.clj and skip to Testing.
Create a dataset that represents the training data found in the appendix:
(def training-data
(dataset
[:rating :easy? :ai? :sys? :thy? :morning?]
[[2 1 1 0 1 0]
[2 1 1 0 1 0]
[2 0 1 0 0 0]
[2 0 0 0 1 0]
[2 0 1 1 0 1]
[1 1 1 0 0 0]
[1 1 1 0 1 0]
[1 0 1 0 1 0]
[0 0 0 0 0 1]
[0 1 0 0 1 1]
[0 0 1 0 1 0]
[0 1 1 1 1 1]
[-1 1 1 1 0 1]
[-1 0 0 1 1 0]
[-1 0 0 1 0 1]
[-1 1 0 1 0 1]
[-2 0 0 1 1 0]
[-2 0 1 1 0 1]
[-2 1 0 1 0 0]
[-2 1 0 1 0 1]]))
Just for fun, pretty print the rows:
(pprint (:rows training-data))
; => ({:morning? 0, :thy? 1, :sys? 0, :ai? 1, :easy? 1, :rating 2}
; {:morning? 0, :thy? 1, :sys? 0, :ai? 1, :easy? 1, :rating 2}
; {:morning? 0, :thy? 0, :sys? 0, :ai? 1, :easy? 0, :rating 2}
; {:morning? 0, :thy? 1, :sys? 0, :ai? 0, :easy? 0, :rating 2}
; {:morning? 1, :thy? 0, :sys? 1, :ai? 1, :easy? 0, :rating 2}
; {:morning? 0, :thy? 0, :sys? 0, :ai? 1, :easy? 1, :rating 1}
; {:morning? 0, :thy? 1, :sys? 0, :ai? 1, :easy? 1, :rating 1}
; {:morning? 0, :thy? 1, :sys? 0, :ai? 1, :easy? 0, :rating 1}
; {:morning? 1, :thy? 0, :sys? 0, :ai? 0, :easy? 0, :rating 0}
; {:morning? 1, :thy? 1, :sys? 0, :ai? 0, :easy? 1, :rating 0}
; {:morning? 0, :thy? 1, :sys? 0, :ai? 1, :easy? 0, :rating 0}
; {:morning? 1, :thy? 1, :sys? 1, :ai? 1, :easy? 1, :rating 0}
; {:morning? 1, :thy? 0, :sys? 1, :ai? 1, :easy? 1, :rating -1}
; {:morning? 0, :thy? 1, :sys? 1, :ai? 0, :easy? 0, :rating -1}
; {:morning? 1, :thy? 0, :sys? 1, :ai? 0, :easy? 0, :rating -1}
; {:morning? 1, :thy? 0, :sys? 1, :ai? 0, :easy? 1, :rating -1}
; {:morning? 0, :thy? 1, :sys? 1, :ai? 0, :easy? 0, :rating -2}
; {:morning? 1, :thy? 0, :sys? 1, :ai? 1, :easy? 0, :rating -2}
; {:morning? 0, :thy? 0, :sys? 1, :ai? 0, :easy? 1, :rating -2}
; {:morning? 1, :thy? 0, :sys? 1, :ai? 0, :easy? 1, :rating -2})
Now let's look at DecisionTreeTrain. Kindly follow along using the psuedocode listed in the book. The first line indicates that we need a way to determine the most frequent answer in the dataset. Let's start with a function that returns a label for a given row. Rows with a rating of 0 or more are labelled :like. Rows with a rating of less than 0 are labelled :hate.
(defn label [row]
(if (< (row :rating) 0) :hate :like))
(->> training-data :rows first label)
; => :like
(->> training-data :rows last label)
; => :hate
Looks good. Now define a function that will label all rows:
(defn labels [rows]
(->> rows (map label)))
(->> training-data :rows labels)
; => (:like :like :like :like :like :like :like :like :like :like :like :like :hate :hate :hate :hate :hate :hate :hate :hate)
Now let's go about finding the most common label:
(->> training-data :rows labels frequencies)
; => {:like 12, :hate 8}
(->> training-data :rows labels frequencies (into (priority-map-by >)) first)
; => [:like 12]
This returns the first entry, which we ensure has the largest value by using priority-map-by. Use ffirst instead of first to get the key of the map entry:
(->> training-data :rows labels frequencies (into (priority-map-by >)) ffirst)
; => :like
Now let's define that as a function:
(defn most-common-label [rows]
(->> rows labels frequencies (into (priority-map-by >)) ffirst))
(most-common-label (:rows training-data))
; => :like
Next, the psuedocode has a conditional statement that needs to determine if the labels are unambiguous. Let's do that by creating a hashset and checking to see if the size of the set is 1.
(->> training-data :rows labels (into #{}))
; => #{:like :hate}
(->> training-data :rows labels (into #{}) count)
; => 2
(->> training-data :rows labels (into #{}) count (= 1))
; => false
Define that as a function:
(defn unambiguous [rows]
(->> rows labels (into #{}) count (= 1)))
(unambiguous [{:rating 2} {:rating 1} {:rating 0}])
; => true
(unambiguous [{:rating 2} {:rating 1} {:rating -2}])
; => false
Next the psuedocode has a conditional statement that checks if the set of remaining features is empty. We can use clojure's built-in empty? function.
After that, the meat of the algorithm starts. For each of the remaining features, it splits up the dataset into subsets based upon the value of the given feature.
(->> training-data :rows (group-by :easy?) pprint)
; => {1
; [{:morning? 0, :thy? 1, :sys? 0, :ai? 1, :easy? 1, :rating 2}
; {:morning? 0, :thy? 1, :sys? 0, :ai? 1, :easy? 1, :rating 2}
; {:morning? 0, :thy? 0, :sys? 0, :ai? 1, :easy? 1, :rating 1}
; {:morning? 0, :thy? 1, :sys? 0, :ai? 1, :easy? 1, :rating 1}
; {:morning? 1, :thy? 1, :sys? 0, :ai? 0, :easy? 1, :rating 0}
; {:morning? 1, :thy? 1, :sys? 1, :ai? 1, :easy? 1, :rating 0}
; {:morning? 1, :thy? 0, :sys? 1, :ai? 1, :easy? 1, :rating -1}
; {:morning? 1, :thy? 0, :sys? 1, :ai? 0, :easy? 1, :rating -1}
; {:morning? 0, :thy? 0, :sys? 1, :ai? 0, :easy? 1, :rating -2}
; {:morning? 1, :thy? 0, :sys? 1, :ai? 0, :easy? 1, :rating -2}],
; 0
; [{:morning? 0, :thy? 0, :sys? 0, :ai? 1, :easy? 0, :rating 2}
; {:morning? 0, :thy? 1, :sys? 0, :ai? 0, :easy? 0, :rating 2}
; {:morning? 1, :thy? 0, :sys? 1, :ai? 1, :easy? 0, :rating 2}
; {:morning? 0, :thy? 1, :sys? 0, :ai? 1, :easy? 0, :rating 1}
; {:morning? 1, :thy? 0, :sys? 0, :ai? 0, :easy? 0, :rating 0}
; {:morning? 0, :thy? 1, :sys? 0, :ai? 1, :easy? 0, :rating 0}
; {:morning? 0, :thy? 1, :sys? 1, :ai? 0, :easy? 0, :rating -1}
; {:morning? 1, :thy? 0, :sys? 1, :ai? 0, :easy? 0, :rating -1}
; {:morning? 0, :thy? 1, :sys? 1, :ai? 0, :easy? 0, :rating -2}
; {:morning? 1, :thy? 0, :sys? 1, :ai? 1, :easy? 0, :rating -2}]}
Simple enough. We only need the subsets, so call vals on the result:
(->> training-data :rows (group-by :easy?) vals pprint)
; => ([{:morning? 0, :thy? 1, :sys? 0, :ai? 1, :easy? 1, :rating 2}
; {:morning? 0, :thy? 1, :sys? 0, :ai? 1, :easy? 1, :rating 2}
; {:morning? 0, :thy? 0, :sys? 0, :ai? 1, :easy? 1, :rating 1}
; {:morning? 0, :thy? 1, :sys? 0, :ai? 1, :easy? 1, :rating 1}
; {:morning? 1, :thy? 1, :sys? 0, :ai? 0, :easy? 1, :rating 0}
; {:morning? 1, :thy? 1, :sys? 1, :ai? 1, :easy? 1, :rating 0}
; {:morning? 1, :thy? 0, :sys? 1, :ai? 1, :easy? 1, :rating -1}
; {:morning? 1, :thy? 0, :sys? 1, :ai? 0, :easy? 1, :rating -1}
; {:morning? 0, :thy? 0, :sys? 1, :ai? 0, :easy? 1, :rating -2}
; {:morning? 1, :thy? 0, :sys? 1, :ai? 0, :easy? 1, :rating -2}]
; [{:morning? 0, :thy? 0, :sys? 0, :ai? 1, :easy? 0, :rating 2}
; {:morning? 0, :thy? 1, :sys? 0, :ai? 0, :easy? 0, :rating 2}
; {:morning? 1, :thy? 0, :sys? 1, :ai? 1, :easy? 0, :rating 2}
; {:morning? 0, :thy? 1, :sys? 0, :ai? 1, :easy? 0, :rating 1}
; {:morning? 1, :thy? 0, :sys? 0, :ai? 0, :easy? 0, :rating 0}
; {:morning? 0, :thy? 1, :sys? 0, :ai? 1, :easy? 0, :rating 0}
; {:morning? 0, :thy? 1, :sys? 1, :ai? 0, :easy? 0, :rating -1}
; {:morning? 1, :thy? 0, :sys? 1, :ai? 0, :easy? 0, :rating -1}
; {:morning? 0, :thy? 1, :sys? 1, :ai? 0, :easy? 0, :rating -2}
; {:morning? 1, :thy? 0, :sys? 1, :ai? 1, :easy? 0, :rating -2}])
Define that as a function:
(defn split-by-feature-values [rows feature]
(->> rows (group-by feature) vals))
Now we need a way to count the majority vote for each subset. To facilitate interactive development of that function, get the first subset:
(def subset (first (split-by-feature-values (:rows training-data) :morning?)))
Group the subset by label:
(->> subset (group-by label) vals pprint)
; => ([{:morning? 0, :thy? 1, :sys? 0, :ai? 1, :easy? 1, :rating 2}
; {:morning? 0, :thy? 1, :sys? 0, :ai? 1, :easy? 1, :rating 2}
; {:morning? 0, :thy? 0, :sys? 0, :ai? 1, :easy? 0, :rating 2}
; {:morning? 0, :thy? 1, :sys? 0, :ai? 0, :easy? 0, :rating 2}
; {:morning? 0, :thy? 0, :sys? 0, :ai? 1, :easy? 1, :rating 1}
; {:morning? 0, :thy? 1, :sys? 0, :ai? 1, :easy? 1, :rating 1}
; {:morning? 0, :thy? 1, :sys? 0, :ai? 1, :easy? 0, :rating 1}
; {:morning? 0, :thy? 1, :sys? 0, :ai? 1, :easy? 0, :rating 0}]
; [{:morning? 0, :thy? 1, :sys? 1, :ai? 0, :easy? 0, :rating -1}
; {:morning? 0, :thy? 1, :sys? 1, :ai? 0, :easy? 0, :rating -2}
; {:morning? 0, :thy? 0, :sys? 1, :ai? 0, :easy? 1, :rating -2}])
Count the population of each group:
(->> subset (group-by label) vals (map count))
; (8 3)
...and get the maximum:
(->> subset (group-by label) vals (map count) (apply max))
; 8
Now we know how to implement majority-vote-count:
(defn majority-vote-count [rows]
(->> rows (group-by label) vals (map count) (apply max)))
Now, given split-by-feature-values and majority-vote-count, we can implement a function that scores a given feature:
(defn score-feature [rows feature]
(->> (split-by-feature-values rows feature)
(map majority-vote-count)
(apply +)))
(score-feature (:rows training-data) :morning?)
; => 13
(score-feature (:rows training-data) :sys?)
; => 18
Given score-feature, we can implement a function that scores multiple features:
(defn score-features [rows features]
(zipmap features (map (partial score-feature rows) features)))
(score-features (:rows training-data) #{:morning? :thy? :sys? :ai? :easy?})
; => {:easy? 12, :thy? 14, :sys? 18, :morning? 13, :ai? 15}
With the functions that we have implemented, we're almost ready to implement the training function. We just need a few more functions to handle some odds and ends that aren't explicit in the psuedocode.
Given an incanter dataset that represents our training data, we need a way to get a hash set that represents its features:
(defn feature-set [dataset]
(let [column-set (->> dataset :column-names (into #{}))]
(disj column-set :rating)))
(feature-set training-data)
; => #{:ai? :morning? :sys? :thy? :easy?}
We also need a way to determine which features are relevant for a given group of rows. For example, consider the following rows:
[{:rating 2 :morning? 1 :ai? 0}
{:rating 2 :morning? 1 :ai? 1}
{:rating -2 :morning? 1 :ai? 1}]
In this case, :morning? is not relevant, so it should be excluded from consideration during scoring.
(defn relevant-feature? [rows feature]
(< 1 (count (split-by-feature-values rows feature))))
(defn retain-relevant [rows features]
(into #{} (filter (partial relevant-feature? rows) features)))
(retain-relevant [{:rating 2 :morning? 1 :ai? 0}
{:rating 2 :morning? 1 :ai? 1}
{:rating -2 :morning? 1 :ai? 1}]
#{:morning? :ai?})
; => #{:ai?}
Now we are ready to implement the training function.
(defn decision-tree-train
([dataset] (decision-tree-train (:rows dataset) (feature-set dataset)))
([rows features]
(if (or (unambiguous rows) (empty? features))
(most-common-label rows)
(let [relevant-features (retain-relevant rows features)
scored-features (score-features rows relevant-features)
sorted-by-score (into (priority-map-by >) scored-features)
winning-feature (ffirst sorted-by-score)
split-rows (group-by winning-feature rows)
remaining-features (disj relevant-features winning-feature)]
{:feature winning-feature
0 (decision-tree-train (split-rows 0) remaining-features)
1 (decision-tree-train (split-rows 1) remaining-features)}))))
This function has two implementations, the first having one parameter, the incanter Dataset that represents the training data. This implementation simply pulls the rows and the set of features from the dataset and calls the second implementation, and decorates the results with the syntax of an anonymous function.
The second implementation is almost a direct translation of the pseudocode listed in the chapter. First, if the rows are unambigous or if the remaining feature set is empty, it returns the most common label. Otherwise, it determines which features are still relevant, scores those features, picks a winning feature (the one with the highest score), splits the rows based upon the feature values of the winning feature, and determines which features remain to be considered. Finally, the function builds subtrees by recursively calling itself for each subset of rows.
Observe the output:
(pprint (decision-tree-train training-data))
; => {:feature :sys?,
; 0 :like,
; 1
; {:feature :ai?,
; 0 :hate,
; 1
; {:feature :thy?,
; 0 {:feature :easy?, 0 :like, 1 :hate},
; 1 :like}}}
Later on the chapter suggests to modify the training function to accept a limit as a hyperparameter. This is a simple modification of the original training function:
(defn decision-tree-train-with-limit
([dataset limit]
(decision-tree-train-with-limit (:rows dataset) (feature-set dataset) limit))
([rows features limit]
(if (or (= 0 limit) (unambiguous rows) (empty? features))
(most-common-label rows)
(let [relevant-features (retain-relevant rows features)
scored-features (score-features rows relevant-features)
sorted-by-score (into (priority-map-by >) scored-features)
winning-feature (ffirst sorted-by-score)
split-rows (group-by winning-feature rows)
remaining-features (disj relevant-features winning-feature)]
{:feature winning-feature
0 (decision-tree-train-with-limit (split-rows 0) remaining-features (dec limit))
1 (decision-tree-train-with-limit (split-rows 1) remaining-features (dec limit))}))))
(decision-tree-train-with-limit training-data 0)
; => :like
(decision-tree-train-with-limit training-data 1)
; => {:feature :sys?, 0 :like, 1 :hate}
(decision-tree-train-with-limit training-data 2)
; => {:feature :sys?, 0 :like, 1 {:feature :ai?, 0 :hate, 1 :like}}
The test function is simple:
(defn decision-tree-test [tree test-point]
(if (map? tree)
(decision-tree-test (tree (test-point (tree :feature))) test-point)
tree))
If you didn't follow along and implement all of the functions in the REPL, you can get them from the project:
(use 'ciml.decision-trees)
(use 'incanter.core)
To test out the decision tree:
(-> training-data
decision-tree-train
(decision-tree-test {:ai? 0 :easy? 1 :morning? 0 :thy? 1 :sys? 1}))
; => :hate
(-> training-data
decision-tree-train
(decision-tree-test {:ai? 0 :easy? 0 :morning? 1 :thy? 1 :sys? 0}))
; => :like
(def dessert-decision-tree
(decision-tree-train
(dataset [:rating :chocolate? :vanilla? :strawberry?]
[[2 1 0 0]
[1 0 1 0]
[1 0 0 1]
[-1 0 1 1]
[-2 1 1 1]])))
(decision-tree-test dessert-decision-tree {:strawberry? 1, :chocolate? 1, :vanilla? 0})
; => :like
(decision-tree-test dessert-decision-tree {:strawberry? 1, :chocolate? 1, :vanilla? 1})
; => :hate
Not too bad, even after the in-depth discussion about implementing the algorithm, the implementation is under 100 lines of code, and that's including the sample training data and both versions of the training function. Most of it is composed of small functions that are very specific and concise.
This implementation does have its limitations however. In particular, it only works for binary values. If the training data includes nominal or continuous variables, it will result in a decision tree that returns nil for some inputs.