🌑

☀️

Stephen's Blog

Home Archives About

Python for Data Science Cheat Sheet with Scikit-Learn

Stephen Cheng

Intro of Scikit-Learn

Scikit-learn is an open source Python library that implements a range of machine learning, data preprocessing, cross-validation and visualization algorithms using a unified interface.

The whole workflow of data science includes:

Loading the data
Training and test data
Preprocessing ehe data
Create your model
Model fitting
Prediction
Evaluate your model’a performance
Tune your model

Here I give you a basic example for reference.

>>> from sklearn import neighbors, datasets, preprocessing
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import accuracy_score
>>> iris = datasets.load_iris()
>>> X, y = iris.data[:, :2], iris.target
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)
>>> scaler = preprocessing.StandardScaler().fit(X_train)
>>> X_train = scaler.transform(X_train)
>>> X_test = scaler.transform(X_test)
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)
>>> knn.fit(X_train, y_train)
>>> y_pred = knn.predict(X_test)
>>> accuracy_score(y_test, y_pred)

Loading The Data

Your data needs to be numeric and stored as NumPy arrays or SciPy sparse matrices. Other types that are convertible to numeric arrays, such as Pandas DataFrame, are also acceptable.

>>> import numpy as np
>>> X = np.random.random((10,5))
>>> y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])
>>> X[X < 0.7] = 0

Training And Test Data

1 2	>>> from sklearn.model_selection import train_test_split >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Preprocessing The Data

Imputing Missing Values

1
2
3

>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(missing_values=0, strategy='mean', axis=0)
>>> imp.fit_transform(X_train)

Standardization

>>> from sklearn.preprocessing import StandardScaler
>>> scaler = StandardScaler().fit(X_train)
>>> standardized_X = scaler.transform(X_train)
>>> standardized_X_test = scaler.transform(X_test)

Normalization

>>> from sklearn.preprocessing import Normalizer
>>> scaler = Normalizer().fit(X_train)
>>> normalized_X = scaler.transform(X_train)
>>> normalized_X_test = scaler.transform(X_test)

Binarization

1
2
3

>>> from sklearn.preprocessing import Binarizer
>>> binarizer = Binarizer(threshold=0.0).fit(X)
>>> binary_X = binarizer.transform(X)

Encoding Categorical Features

1
2
3

>>> from sklearn.preprocessing import LabelEncoder
>>> enc = LabelEncoder()
>>> y = enc.fit_transform(y)

Generating Polynomial Features

1
2
3

>>> from sklearn.preprocessing import PolynomialFeatures
>>> poly = PolynomialFeatures(5)
>>> poly.fit_transform(X)

Create Your Model

Supervised Learning Estimators

Linear Regression

1 2	>>> from sklearn.linear_model import LinearRegression >>> lr = LinearRegression(normalize=True)

Support Vector Machines (SVM)

1 2	>>> from sklearn.svm import SVC >>> svc = SVC(kernel='linear')

Naive Bayes

1 2	>>> from sklearn.naive_bayes import GaussianNB >>> gnb = GaussianNB()

KNN

1 2	>>> from sklearn import neighbors >>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)

Unsupervised Learning Estimators

Principal Component Analysis (PCA)

1 2	>>> from sklearn.decomposition import PCA >>> pca = PCA(n_components=0.95)

K Means

1 2	>>> from sklearn.cluster import KMeans >>> k_means = KMeans(n_clusters=3, random_state=0)

Model Fitting

Supervised learning

Fit the model to the data.

1
2
3

>>> lr.fit(X, y)
>>> knn.fit(X_train, y_train)
>>> svc.fit(X_train, y_train)

Unsupervised Learning

Fit the model to the data.
Fit to data, then transform it.

1 2	>>> k_means.fit(X_train) >>> pca_model = pca.fit_transform(X_train)

Prediction

Supervised Estimators

Predict labels. Predict labels. Estimate probability of a label.

1
2
3

>>> y_pred = svc.predict(np.random.random((2,5)))
>>> y_pred = lr.predict(X_test)
>>> y_pred = knn.predict_proba(X_test)

Unsupervised Estimators

Predict labels in clustering algos.

1	>>> y_pred = k_means.predict(X_test)

Evaluate Your Model’s Performance

Classification Metrics

Accuracy Score

Estimator score method. Metric scoring functions.

1
2
3

>>> knn.score(X_test, y_test)
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_test, y_pred)

Classification Report

Precision, recall, f1-score and support.

1 2	>>> from sklearn.metrics import classification_report >>> print(classification_report(y_test, y_pred))

Confusion Matrix

1 2	>>> from sklearn.metrics import confusion_matrix >>> print(confusion_matrix(y_test, y_pred))

Regression Metrics

Mean Absolute Error

1
2
3

>>> from sklearn.metrics import mean_absolute_error
>>> y_true = [3, -0.5, 2]
>>> mean_absolute_error(y_true, y_pred)

Mean Squared Error

1 2	>>> from sklearn.metrics import mean_squared_error >>> mean_squared_error(y_test, y_pred)

R² Score

1 2	>>> from sklearn.metrics import r2_score >>> r2_score(y_true, y_pred)

Clustering Metrics

Adjusted Rand Index

1 2	>>> from sklearn.metrics import adjusted_rand_score >>> adjusted_rand_score(y_true, y_pred)

Homogeneity

1 2	>>> from sklearn.metrics import homogeneity_score >>> homogeneity_score(y_true, y_pred)

V-measure

1 2	>>> from sklearn.metrics import v_measure_score >>> metrics.v_measure_score(y_true, y_pred)

Cross-Validation

1
2
3

>>> from sklearn.cross_validation import cross_val_score
>>> print(cross_val_score(knn, X_train, y_train, cv=4))
>>> print(cross_val_score(lr, X, y, cv=2))

Tune Your Model

Grid Search

>>> from sklearn.grid_search import GridSearchCV
>>> params = {"n_neighbors": np.arange(1,3), "metric": ["euclidean", "cityblock"]}
>>> grid = GridSearchCV(estimator=knn, param_grid=params)
>>> grid.fit(X_train, y_train)
>>> print(grid.best_score_)
>>> print(grid.best_estimator_.n_neighbors)

Randomized Parameter Optimization

>>> from sklearn.grid_search import RandomizedSearchCV
>>> params = {"n_neighbors": range(1,5), "weights": ["uniform", "distance"]}
>>> rsearch = RandomizedSearchCV(estimator=knn,
                                param_distributions=params,
                                cv=4, n_iter=8,random_state=5)
>>> rsearch.fit(X_train, y_train)
>>> print(rsearch.best_score_)

Data-Science, Machine-Learning, Python, Scikit-Learn, Sklearn — Jun 18, 2019

Search

Made with ❤️ and ☀️ on Earth.