Source: www.kaggle.com/learn/intro-to-machine-learning


Learn Intro to Machine Learning Tutorials

Learn the core ideas in machine learning, and build your first models.


Step 0: Setup

# Code you have previously used to load data
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice
feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_columns]

# Specify Model
iowa_model = DecisionTreeRegressor()
# Fit Model
iowa_model.fit(X, y)

print("First in-sample predictions:", iowa_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())

# Set up code checking
from learntools.core import binder
from learntools.machine_learning.ex4 import *
print("Setup Complete")



Step 1: Split Your Data

from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)

random_state int, RandomState instance or None, default=None

Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. See Glossary.


train_test_split 함수는 입력을 random_state 옵션에 맞추어, train과 test의 2 묶음으로 나누어 준다.  

입력이라 함은 하나가 될 수도 있고 여러개가 될 수도 있다. 입력의 타입은 list, nupy array, scipy-sparse matrix, pandas dataframe이다.


예시 코드에서는 X와 y를 넣어 주었으므로 리턴값은 X_train, X_test, y_train, y_test 이렇게 된다.



Step 2: Specify and Fit the Model

# step 0에서 필요한 라이브러리를 이미 import 했다.
# Specify the model
iowa_model = DecisionTreeRegressor(random_state=1)

# Fit iowa_model with the training data.
iowa_model.fit(train_X, train_y)


Step 3: Make Predictions with Validation data

val_predictions = iowa_model.predict(val_X)

predict(X, check_input=True)

Predict class or regression value for X.
For a classification model, the predicted class for each sample in X is returned. For a regression model, the predicted value based on X is returned.


def predict(self, X, check_input=True):
        """Predict class or regression value for X.
        For a classification model, the predicted class for each sample in X is
        returned. For a regression model, the predicted value based on X is
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
            The input samples. Internally, it will be converted to
            ``dtype=np.float32`` and if a sparse matrix is provided
            to a sparse ``csr_matrix``.
        check_input : bool, default=True
            Allow to bypass several input checking.
            Don't use this parameter unless you know what you do.
        y : array-like of shape (n_samples,) or (n_samples, n_outputs)
            The predicted classes, or the predict values.
        X = self._validate_X_predict(X, check_input)
        proba = self.tree_.predict(X)
        n_samples = X.shape[0]

        # Classification
        if is_classifier(self):
            if self.n_outputs_ == 1:
                return self.classes_.take(np.argmax(proba, axis=1), axis=0)

                class_type = self.classes_[0].dtype
                predictions = np.zeros((n_samples, self.n_outputs_),
                for k in range(self.n_outputs_):
                    predictions[:, k] = self.classes_[k].take(
                        np.argmax(proba[:, k], axis=1),

                return predictions

        # Regression
            if self.n_outputs_ == 1:
                return proba[:, 0]

                return proba[:, :, 0]


# validation prediction 상위 5개만 출력하기

# validation data에서 진짜 price 4개만 출력하기

n개를 선택할 때에, val_predictions[:n] : 상위에서 n개만 출력한다.



Step 4: Calculate the Mean Absolute Error in Validation Data

from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_predictions, val_y)

# uncomment following line to see the validation_mae

# Check your answer
