Running XGBoost with R Studio
Let's say we are trying to use xgboost to make prediction about our data and here is a sample data that we're going to be using :-
Some terminology before moving on. R uses the term label to say, this is our expected output when we're building our model. Yes, it is really confusing. A label to becomes final output of our predictions.
Basically what we're trying to find is relation between smoking and high sugar intake will lead to a person having disease. These are fake data of course. There are people who smokes and eat as much choc as they like, they still look sharp. (not me tho)
First we will create these data using R. Code example shows we're loading some libraries and then create a data frame called 'a'. Next it convert 'a' into a data table 'd'.
require(xgboost)
require(Matrix)
require(data.table)
if (!require('vcd')) install.packages('vcd')
a = data.frame(id=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17, 18, 19,20), smoked=c('Yes','No','Casual', 'Casual', 'Casual','Yes', 'Yes', 'Yes','Yes', 'Yes', 'Yes','Yes', 'Yes', 'Yes','Yes','Yes','Yes', 'Yes', 'Casual','Casual'), highIntakeSugar=c('Yes','No','Yes', 'Yes','Yes', 'Yes', 'Yes','Yes','Yes', 'Yes','Yes','Yes', 'Yes','Yes','Yes', 'Yes','Yes', 'Yes', 'Yes','Yes'), sex=c('M','F','F', 'M','F','F', 'M','F','F', 'M','F','F', 'M','F','F','F','F', 'M','F','F'), disease=c('Yes','No','Unknown','Unknown','Yes','Unknown', 'Unknown','Yes','Yes', 'Yes','Yes','Yes', 'Yes', 'Yes','Yes', 'Yes','Yes', 'Yes', 'Yes','Yes'), age=c(20,21,45, 45, 40,45, 35, 40,45, 45, 40,45, 45,40,45,40,45,45,40,45))
d[,id:=NULL]
Because xgboost typically deals with numeric data, we use a sparse matrix to help us. We're going to omit the age in our sparse, as shown below :-
s <- age="" data="d)</span" sparse.model.matrix=""><- age="" data="d)</pre" sparse.model.matrix="">->->
Looking at the variable s, we will have dataset below. Smoke column separated into 2 columns.
Other columns like Sex becomes SexM and holes 1 or 0. The value 1 denotes "Male" while 0 denotes "Female". The same principle applies to other columns like SugarYes.
This gives us a table that looks like below :-
Next thing we need to do is to train our model with the expected results using label - yeap it is bit confusing but what R calls it.
ov = d[,disease] == 'Yes'
Here, we trying to say we expect disease results to be like above, learn from the model. And next time would be to create our xgboost model
Wow, our model is perfect. :) Its really because our data is simple and straight forward.
To see what are the important features our our model, you can try using the following commandmdl <- binary:logistic="" data="s," eta="1," font="" label="ov," max_depth="4," nrounds="10,objective" nthread="2," xgboost="">->
[1] train-error:0.000000
[2] train-error:0.000000
[3] train-error:0.000000
[4] train-error:0.000000
[5] train-error:0.000000
[6] train-error:0.000000
[7] train-error:0.000000
[8] train-error:0.000000
[9] train-error:0.000000
[10] train-error:0.000000
xgb.importance(feature_names = colnames(s), model = mdl)
You can always try out using the code sample below :-
Comments