Fast.Ai Machine Learning: 1

Savan Nahar
8 min readNov 3, 2018

This series will take you through fast.ai’ s Introduction to Machine Learning for Coders. I’ll explain you in brief about the lessons and also explain you the code and what goes behind the scene. Thanks to Jeremy and Rachel who has created the best guide for all the coders out there.

Lessons : 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 12

This is Part 1/12 Lecture Notes -INTRODUCTION TO RANDOM FORESTS

Lesson 1 will show you how to create a “random forest” — perhaps the most widely applicable machine learning model — to create a solution to the “Bull Book for Bulldozers” Kaggle competition, which will get you in to the top 25% on the leaderboard. You’ll learn how to use a Jupyter Notebook to build and analyze models, how to download data, and other basic skills you need to get started with machine learning in practice.

Environment Setup

If you are using PaperSpace then use the script given here. Assuming you have your GPU and Anaconda Setup (Preferably CUDA ≥9):

Conda install for GPU :
conda install -c pytorch pytorch-nightly cuda92
conda install -c fastai torchvision-nightly
conda install -c fastai fastai
Conda install for CPU :
conda install -c pytorch pytorch-nightly-cpu
conda install -c fastai torchvision-nightly-cpu
conda install -c fastai fastai

Getting started with the Code

You can find my notebook here (Added comments and Explanation) and find the Kaggle dataset here.

Importing the required libraries

%load_ext autoreload
%autoreload 2
%matplotlib inline
  • autoreload : reloads modules automatically before entering the execution.
  • %matplotlib inline :is a magic function which sets the backend of matplotlib to the ‘inline’ backend. With this backend, the output of plotting commands is displayed inline within the Jupyter notebook.
from fastai.imports import *
from fastai.structured import *
from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display
from sklearn import metrics
  • fastai.structured: this module works with Pandas DataFrames, is not dependent on PyTorch, and can be used separately from the rest of the fastai library to process and work with tabular data.
  • fastai.imports : this module contains all the basic + necessary libraries.
  • DataFrameSummary : expect a pandas DataFrame to summarise.
  • RandomForestRegressor : meta estimator that fits a number of classifical decision trees on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.
  • metrics : this module includes score functions, performance metrics and pairwise metrics and distance computations.

Now if you want to know what does a particular library do and what is the code running behind it, then do the following things If you want to get the description of a library and its function then type

?ImportModuleName or ShiftTab

If you want to view the source code of module then type

??ImportModuleName or ShiftTab ShiftTab

About the Dataset

We will be looking at the Blue Book for Bulldozers Kaggle Competition: “The goal of the contest is to predict the sale price of a particular piece of heavy equipment at auction based on its usage, equipment type, and configuration. The data is sourced from auction result postings and includes information on usage and equipment configurations.” This is a very common type of dataset and prediction problem, and similar to what you may see in your project or workplace.

Reading the Dataset

Usually, the data we get is in CSV format, we can read the data using the shell command

!head Data/Train.csv

But this is difficult to read, so we use Pandas .pandas is the most important library when you are working with structured data which is usually imported as pd.

df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory=False, 
parse_dates=["saledate"])
  • parse_dates —Parses a list of any columns that contain dates
  • low_memory=False —Internally process the file in chunks, resulting in lower memory use while parsing. To ensure no mixed types either set False or specify the type with the dtype parameter.

Evaluation

It’s important to note what metric is being used for a project. Generally, selecting the metric(s) is an important part of the project setup. However, in this case, Kaggle tells us what metric to use: RMSLE (root mean squared log error) between the actual and predicted auction prices. Therefore we take the log of the prices so that RMSE will give us what we need.

df_raw.SalePrice = np.log(df_raw.SalePrice)
  • np — Numpy lets us treat arrays, matrices, vectors, high dimensional tensors as if they are Python variables.

Preprocessing

This dataset contains a mix of continuous and categorical variables.

  • continuous — numbers where the meaning is numeric such as price.
  • categorical — either numbers where the meaning is not continuous like zip code or string such as “large”, “medium”, “small”
add_datepart(df_raw, 'saledate')

The method extracts particular date fields from a complete datetime for the purpose of constructing categoricals. You should always consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can’t capture any trend/cyclical behaviour as a function of time at any of these granularities.

df_raw.head()

This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

train_cats(df_raw)
df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)

Pandas has a concept of a category data type, but by default it would not turn anything into a category for you. Fast.ai provides a function called train_cats which creates categorical variables for everything that is a String. Behind the scenes, it creates a column that is an integer and it is going to store a mapping from the integers to the strings. train_cats is called “train” because it is training data specific. It is important that validation and test sets will use the same category mappings (in other words, if you used 1 for “high” for a training dataset, then 1 should also be for “high” in validation and test datasets). For validation and test dataset, use apply_cats instead.

  • df_raw.UsageBand.cat.cat gives you access to things assuming something is a category.

The order does not matter too much, but since we are going to be creating a decision tree that split things at a single point (i.e. High vs. Low and Medium , High and Low vs. Medium ) which is a little bit weird.

  • inplace will ask Pandas to change the existing dataframe rather than returning a new one.

There is a kind of categorical variable called “ordinal”. An ordinal categorical variable has some kind of order (e.g. “Low” < “Medium” < “High”). Random forests are not terribly sensitive for that fact, but it is worth noting.

df_raw.dtypes

This will give you the summary of all the datatypes that the dataframe has, this is particularly useful to check the categorical variable

Handling Empty/Null values is one of important thing in preprocessing

display_all(df_raw.isnull().sum().sort_index()/len(df_raw))

The above will add a number of empty values for each series, we sort them by the index (pandas.Series.sort_index), and divide by a number of dataset.

Saving the work

Reading CSV took about 10 seconds, and processing took another 10 seconds, so if we do not want to wait again, it is a good idea to save them. Here we will save it in a feather format. What this is going to do is to save it to disk in exactly the same basic format that it is in RAM. This is by far the fastest way to save something, and also to read it back. Feather format is becoming standard in not only Pandas but in Java, Apache Spark, etc.

os.makedirs('tmp', exist_ok=True)
df_raw.to_feather('tmp/bulldozers-raw')

We can read it back as so:

df_raw = pd.read_feather('tmp/raw')

Fixing missing values and categorical variable

proc_df will make a copy of dataframe ,seperate the dependant variable and then it’ll fix the missing values and categorical variable.Once we execute it, df has all the rows which are numeric and y has the SalePrice.

df, y, nas = proc_df(df_raw, 'SalePrice')

proc_df function inside the Structured.fastai does the following thing :

  • Grabs a copy of the df
  • Grab the dependent column.
  • Dependent column is dropped.
  • Missing Values are fixed.
  • Fix Missing
    - Numeric values: If it does have missing values, then create a new column named Col_na (Boolean column) and replace the _na with the median
    - Non-Numeric and Categorical: Replace with the code and add 1.

What is a random forest?

Random forest is a universal machine learning technique.

  • It can predict something that can be of any kind — it could be a category (classification), a continuous variable (regression).
  • It can predict with columns of any kind — pixels, zip codes, revenues, etc (i.e. both structured and unstructured data).
  • It does not generally overfit too badly, and it is very easy to stop it from overfitting.
  • You do not need a separate validation set in general. It can tell you how well it generalizes even if you only have one dataset.
  • It has few, if any, statistical assumptions. It does not assume that your data is normally distributed, the relationship is linear, or you have specified interactions.
  • It requires very few pieces of feature engineering. For many different types of situation, you do not have to take the log of the data or multiply interactions together.

Running Regressor

scikit-learn

Most popular and important package for machine learning in Python. It is not the best at everything (e.g. XGBoost is better than Gradient Boosting Tree), but pretty good at nearly everything.RandomForestRegressor is imported from sklearn.ensemble module of scikit-learn.

m = RandomForestRegressor(n_jobs=-1)
m.fit(df, y)
m.score(df,y)
  • n_jobs=1 it uses 100% of the cpu of one of the cores. Each process is run in a different core. In linux with 4 cores I can clearly see the cpu usage:(100%,~5%, ~5%, ~5%) when I run n_jobs=1 and (100%, 100%, 100%, 100%)
    when running with n_jobs=-1.Each process takes the 100% usage of a given core, but if you have n_jobs=1 only one core is used.
  • RandomForestRegressor : will create a model and fit method will fit the data to it.
  • Score : Returns the coefficient of determination R² of the prediction.
def rmse(x,y): return math.sqrt(((x-y)**2).mean())def print_score(m):
res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
m.score(X_train, y_train), m.score(X_valid, y_valid)]
if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
print(res)

Checking Overfitting

Here we are only checking on the training set that too with all the data being trained.Usually we split the train.csv into train , test and validation set, and then use the model.predict method on the actual test set(test.csv).This part will be covered in later part of the series.

  • We can create a Validation Dataset.Sorted by Date, The Most Recent 12,000 dates will be the validation set.
def split_vals(a,n): return a[:n].copy(), a[n:].copy()n_valid = 12000  # same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)
m = RandomForestRegressor(n_jobs=-1)
%time m.fit(X_train, y_train)
print_score(m)
CPU times: user 1min 3s, sys: 356 ms, total: 1min 3s
Wall time: 8.46 s
[0.09044244804386327, 0.2508166961122146,
0.98290459302099709, 0.88765316048270615]

The last row gives us the scores [training rmse, validation rmse, r² for training set, r² for validation set]

Thus we can conclude that without any thinking or intensive Feature Engineering, without defining/worrying about any statistical assumption-we get a decent score and we are in Kaggle’s top 25% for this competition.

If you found this article to be useful and would like to stay in touch, you can find me on Twitter , linkedin.

Lessons : 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 12

--

--