Skip to main content

Running XGBoost with R Studio



Let's say we are trying to use xgboost to make prediction about our data and here is a sample data that we're going to be using :-





Some terminology before moving on. R uses the term label to say, this is our expected output when we're building our model. Yes, it is really confusing. A label to becomes final output of our predictions.



Basically what we're trying to find is relation between smoking and high sugar intake will lead to a person having disease.  These are fake data of course.  There are people who smokes and eat as much choc as they like,  they still look sharp. (not me tho)

First we will create these data using R. Code example shows we're loading some libraries and then create a data frame called  'a'. Next it convert 'a' into a data table 'd'.


require(xgboost)

require(Matrix)

require(data.table)

if (!require('vcd')) install.packages('vcd')

a = data.frame(id=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17, 18, 19,20), smoked=c('Yes','No','Casual', 'Casual', 'Casual','Yes', 'Yes', 'Yes','Yes', 'Yes', 'Yes','Yes', 'Yes', 'Yes','Yes','Yes','Yes', 'Yes', 'Casual','Casual'), highIntakeSugar=c('Yes','No','Yes', 'Yes','Yes', 'Yes', 'Yes','Yes','Yes', 'Yes','Yes','Yes', 'Yes','Yes','Yes', 'Yes','Yes', 'Yes', 'Yes','Yes'), sex=c('M','F','F', 'M','F','F', 'M','F','F', 'M','F','F', 'M','F','F','F','F', 'M','F','F'), disease=c('Yes','No','Unknown','Unknown','Yes','Unknown', 'Unknown','Yes','Yes', 'Yes','Yes','Yes', 'Yes', 'Yes','Yes', 'Yes','Yes', 'Yes', 'Yes','Yes'), age=c(20,21,45, 45, 40,45, 35, 40,45, 45, 40,45, 45,40,45,40,45,45,40,45))

d[,id:=NULL]





Because xgboost typically deals with numeric data, we use a sparse matrix to  help us. We're going to omit the age in our sparse, as shown below :-

 s <- age="" data="d)</span" sparse.model.matrix=""><- age="" data="d)</pre" sparse.model.matrix="">
Looking at the variable s, we will have dataset below. Smoke column separated into 2 columns. 

Other columns like Sex becomes SexM and holes 1 or 0. The value 1 denotes "Male" while 0 denotes "Female".  The same principle applies to other columns like  SugarYes.

This gives us a table that looks like below :-







Next thing we need to do is to train our model with the expected results using label - yeap it is  bit confusing but what R calls it. 
 ov = d[,disease] == 'Yes'


Here, we trying to say we expect disease results to be like above, learn from the model.  And next time would be to create our xgboost model

mdl <- binary:logistic="" data="s," eta="1," font="" label="ov," max_depth="4," nrounds="10,objective" nthread="2," xgboost="">


 [1] train-error:0.000000 
[2] train-error:0.000000 
[3] train-error:0.000000 
[4] train-error:0.000000 
[5] train-error:0.000000 
[6] train-error:0.000000 
[7] train-error:0.000000 
[8] train-error:0.000000 
[9] train-error:0.000000 
[10] train-error:0.000000 
Wow, our model is perfect. :) Its really because our data is simple and straight forward.

To see what are the important features our our model, you can try using the following command



 xgb.importance(feature_names = colnames(s), model = mdl)

You can always try out using the code sample below :-











Comments

Popular posts from this blog

Android Programmatically apply style to your view

Applying style to your view (button in this case) dynamically is pretty easy. All you have to do is place the following in your layout folder (res/layout)
Let's call this file : buttonstyle.xml
<?xml version="1.0" encoding="utf-8"?> <selector xmlns:android="http://schemas.android.com/apk/res/android"> <item android:state_pressed="true" > <shape> <solid android:color="#449def" /> <stroke android:width="1dp" android:color="#2f6699" /> <corners android:radius="3dp" /> <padding android:left="10dp" android:top="10dp" android:right="10dp" android:bottom="10dp" /> </shape> </item> <item> <shape> <gradient android:startColor="#449def" a…

OpenCover code coverage for .Net Core

I know there are many post out there getting code coverage for .dotnetcore. I'm using opencover to address this needs.

In case, you do no want to use opencover and wanted to stick with vs2015 code coverage, you can try to copy Microsoft.VisualStudio.CodeCoverage.Shim.dll from C:\Program Files (x86)\Microsoft Visual Studio 14.0\Team Tools\Dynamic Code Coverage Tools\coreclr\ and drop it into your project "bin\Debug\netcoreapp1.0" folder.  Please note : you need to be on VS2015 Enterprise to do this. 

To get started, I guess we need to add OpenCover and ReportGenerator for our test projects, as shown in diagram below :-



When nuget packge gets restored, we will have some binaries downloaded to our machine and we going to use this to generate some statistics. I think the biggest issue is to getting those command lines work.

In dotnetcore, we run test project using "dotnet test" (assuming you are in the test project folder - if not please go there)  So we add this …

DataTable does not have AsEnumerable

I have problem locating my AsEnumerable extension method in my DataTabe (System.Data). Thank god for this post by Angel
(http://blogs.msdn.com/angelsb/archive/2007/02/23/does-not-contain-a-definition-for.aspx)

I was able to find this method once i have added reference to the following assembly.

C:\Program Files\Reference Assemblies\Microsoft\Framework\v3.5\System.Data.DataSetExtensions.dll

Try to do a dummy Build and you should be able to get it.