Running XGBoost with R Studio

Let's say we are trying to use xgboost to make prediction about our data and here is a sample data that we're going to be using :-

Some terminology before moving on. R uses the term label to say, this is our expected output when we're building our model. Yes, it is really confusing. A label to becomes final output of our predictions.

Basically what we're trying to find is relation between smoking and high sugar intake will lead to a person having disease. These are fake data of course. There are people who smokes and eat as much choc as they like, they still look sharp. (not me tho)

First we will create these data using R. Code example shows we're loading some libraries and then create a data frame called 'a'. Next it convert 'a' into a data table 'd'.


require(xgboost)

require(Matrix)

require(data.table)

if (!require('vcd')) install.packages('vcd')

a = data.frame(id=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17, 18, 19,20), smoked=c('Yes','No','Casual', 'Casual', 'Casual','Yes', 'Yes', 'Yes','Yes', 'Yes', 'Yes','Yes', 'Yes', 'Yes','Yes','Yes','Yes', 'Yes', 'Casual','Casual'), highIntakeSugar=c('Yes','No','Yes', 'Yes','Yes', 'Yes', 'Yes','Yes','Yes', 'Yes','Yes','Yes', 'Yes','Yes','Yes', 'Yes','Yes', 'Yes', 'Yes','Yes'), sex=c('M','F','F', 'M','F','F', 'M','F','F', 'M','F','F', 'M','F','F','F','F', 'M','F','F'), disease=c('Yes','No','Unknown','Unknown','Yes','Unknown', 'Unknown','Yes','Yes', 'Yes','Yes','Yes', 'Yes', 'Yes','Yes', 'Yes','Yes', 'Yes', 'Yes','Yes'), age=c(20,21,45, 45, 40,45, 35, 40,45, 45, 40,45, 45,40,45,40,45,45,40,45))

d[,id:=NULL]

Because xgboost typically deals with numeric data, we use a sparse matrix to help us. We're going to omit the age in our sparse, as shown below :-

 s <- age="" data="d)</span" sparse.model.matrix=""><- age="" data="d)</pre" sparse.model.matrix="">

Looking at the variable s, we will have dataset below. Smoke column separated into 2 columns.

Other columns like Sex becomes SexM and holes 1 or 0. The value 1 denotes "Male" while 0 denotes "Female". The same principle applies to other columns like SugarYes.

This gives us a table that looks like below :-

Next thing we need to do is to train our model with the expected results using label - yeap it is bit confusing but what R calls it.

 ov = d[,disease] == 'Yes'

Here, we trying to say we expect disease results to be like above, learn from the model. And next time would be to create our xgboost model

mdl <- binary:logistic="" data="s," eta="1," font="" label="ov," max_depth="4," nrounds="10,objective" nthread="2," xgboost="">

 [1] train-error:0.000000 
[2] train-error:0.000000 
[3] train-error:0.000000 
[4] train-error:0.000000 
[5] train-error:0.000000 
[6] train-error:0.000000 
[7] train-error:0.000000 
[8] train-error:0.000000 
[9] train-error:0.000000 
[10] train-error:0.000000

Wow, our model is perfect. :) Its really because our data is simple and straight forward.

To see what are the important features our our model, you can try using the following command

 xgb.importance(feature_names = colnames(s), model = mdl)

You can always try out using the code sample below :-

Search This Blog

mitzen

Running XGBoost with R Studio

Comments

Popular posts from this blog

The specified initialization vector (IV) does not match the block size for this algorithm

git subtree add gives you "Working tree has modifications. Cannot add"

Azure function error : Missing value for AzureWebJobsStorage in local.settings.json