Python for Data Science Cheat Sheet with Scikit-Learn 
   
Stephen Cheng 
 
Intro of Scikit-Learn Scikit-learn is an open source Python library that implements a range of machine learning, data preprocessing, cross-validation and visualization algorithms using a unified interface.
The whole workflow of data science includes:
Loading the data 
Training and test data 
Preprocessing ehe data 
Create your model 
Model fitting 
Prediction 
Evaluate your model’a performance 
Tune your model 
 
Here I give you a basic example for reference.
1 2 3 4 5 6 7 8 9 10 11 12 13 >>>  from  sklearn import  neighbors, datasets, preprocessing>>>  from  sklearn.model_selection import  train_test_split>>>  from  sklearn.metrics import  accuracy_score>>>  iris = datasets.load_iris()>>>  X, y = iris.data[:, :2 ], iris.target>>>  X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33 )>>>  scaler = preprocessing.StandardScaler().fit(X_train)>>>  X_train = scaler.transform(X_train)>>>  X_test = scaler.transform(X_test)>>>  knn = neighbors.KNeighborsClassifier(n_neighbors=5 )>>>  knn.fit(X_train, y_train)>>>  y_pred = knn.predict(X_test)>>>  accuracy_score(y_test, y_pred)
 
 
Loading The Data Your data needs to be numeric and stored as NumPy arrays or SciPy sparse matrices. Other types that are convertible to numeric arrays, such as Pandas DataFrame, are also acceptable.
1 2 3 4 >>>  import  numpy as  np>>>  X = np.random.random((10 ,5 ))>>>  y = np.array(['M' ,'M' ,'F' ,'F' ,'M' ,'F' ,'M' ,'M' ,'F' ,'F' ,'F' ])>>>  X[X < 0.7 ] = 0 
 
 
Training And Test Data 1 2 >>>  from  sklearn.model_selection import  train_test_split>>>  X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0 )
 
 
Preprocessing The Data Imputing Missing Values 1 2 3 >>>  from  sklearn.preprocessing import  Imputer>>>  imp = Imputer(missing_values=0 , strategy='mean' , axis=0 )>>>  imp.fit_transform(X_train)
 
Standardization 1 2 3 4 >>>  from  sklearn.preprocessing import  StandardScaler>>>  scaler = StandardScaler().fit(X_train)>>>  standardized_X = scaler.transform(X_train)>>>  standardized_X_test = scaler.transform(X_test)
 
Normalization 1 2 3 4 >>>  from  sklearn.preprocessing import  Normalizer>>>  scaler = Normalizer().fit(X_train)>>>  normalized_X = scaler.transform(X_train)>>>  normalized_X_test = scaler.transform(X_test)
 
Binarization 1 2 3 >>>  from  sklearn.preprocessing import  Binarizer>>>  binarizer = Binarizer(threshold=0.0 ).fit(X)>>>  binary_X = binarizer.transform(X)
 
Encoding Categorical Features 1 2 3 >>>  from  sklearn.preprocessing import  LabelEncoder>>>  enc = LabelEncoder()>>>  y = enc.fit_transform(y)
 
Generating Polynomial Features 1 2 3 >>>  from  sklearn.preprocessing import  PolynomialFeatures>>>  poly = PolynomialFeatures(5 )>>>  poly.fit_transform(X)
 
 
Create Your Model Supervised Learning Estimators Linear Regression 
1 2 >>>  from  sklearn.linear_model import  LinearRegression>>>  lr = LinearRegression(normalize=True )
 
Support Vector Machines (SVM) 
1 2 >>>  from  sklearn.svm import  SVC>>>  svc = SVC(kernel='linear' )
 
Naive Bayes 
1 2 >>>  from  sklearn.naive_bayes import  GaussianNB>>>  gnb = GaussianNB()
 
KNN 
1 2 >>>  from  sklearn import  neighbors>>>  knn = neighbors.KNeighborsClassifier(n_neighbors=5 )
 
Unsupervised Learning Estimators Principal Component Analysis (PCA) 
1 2 >>>  from  sklearn.decomposition import  PCA>>>  pca = PCA(n_components=0.95 )
 
K Means 
1 2 >>>  from  sklearn.cluster import  KMeans>>>  k_means = KMeans(n_clusters=3 , random_state=0 )
 
 
Model Fitting Supervised learning Fit the model to the data.
1 2 3 >>>  lr.fit(X, y)>>>  knn.fit(X_train, y_train)>>>  svc.fit(X_train, y_train)
 
Unsupervised Learning Fit the model to the data. Fit to data, then transform it.
1 2 >>>  k_means.fit(X_train)>>>  pca_model = pca.fit_transform(X_train)
 
 
Prediction Supervised Estimators Predict labels. Predict labels. Estimate probability of a label.
1 2 3 >>>  y_pred = svc.predict(np.random.random((2 ,5 )))>>>  y_pred = lr.predict(X_test)>>>  y_pred = knn.predict_proba(X_test)
 
Unsupervised Estimators Predict labels in clustering algos.
1 >>>  y_pred = k_means.predict(X_test)
 
 
Classification Metrics Accuracy Score 
Estimator score method. Metric scoring functions.
1 2 3 >>>  knn.score(X_test, y_test)>>>  from  sklearn.metrics import  accuracy_score>>>  accuracy_score(y_test, y_pred)
 
Classification Report 
Precision, recall, f1-score and support.
1 2 >>>  from  sklearn.metrics import  classification_report>>>  print(classification_report(y_test, y_pred))
 
Confusion Matrix 
1 2 >>>  from  sklearn.metrics import  confusion_matrix>>>  print(confusion_matrix(y_test, y_pred))
 
Regression Metrics Mean Absolute Error 
1 2 3 >>>  from  sklearn.metrics import  mean_absolute_error>>>  y_true = [3 , -0.5 , 2 ]>>>  mean_absolute_error(y_true, y_pred)
 
Mean Squared Error 
1 2 >>>  from  sklearn.metrics import  mean_squared_error>>>  mean_squared_error(y_test, y_pred)
 
R² Score 
1 2 >>>  from  sklearn.metrics import  r2_score>>>  r2_score(y_true, y_pred)
 
Clustering Metrics Adjusted Rand Index 
1 2 >>>  from  sklearn.metrics import  adjusted_rand_score>>>  adjusted_rand_score(y_true, y_pred)
 
Homogeneity 
1 2 >>>  from  sklearn.metrics import  homogeneity_score>>>  homogeneity_score(y_true, y_pred)
 
V-measure 
1 2 >>>  from  sklearn.metrics import  v_measure_score>>>  metrics.v_measure_score(y_true, y_pred)
 
Cross-Validation 1 2 3 >>>  from  sklearn.cross_validation import  cross_val_score>>>  print(cross_val_score(knn, X_train, y_train, cv=4 ))>>>  print(cross_val_score(lr, X, y, cv=2 ))
 
Tune Your Model Grid Search 1 2 3 4 5 6 >>>  from  sklearn.grid_search import  GridSearchCV>>>  params = {"n_neighbors" : np.arange(1 ,3 ), "metric" : ["euclidean" , "cityblock" ]}>>>  grid = GridSearchCV(estimator=knn, param_grid=params)>>>  grid.fit(X_train, y_train)>>>  print(grid.best_score_)>>>  print(grid.best_estimator_.n_neighbors)
 
Randomized Parameter Optimization 1 2 3 4 5 6 7 >>>  from  sklearn.grid_search import  RandomizedSearchCV>>>  params = {"n_neighbors" : range(1 ,5 ), "weights" : ["uniform" , "distance" ]}>>>  rsearch = RandomizedSearchCV(estimator=knn,                                param_distributions=params,                                 cv=4 , n_iter=8 ,random_state=5 ) >>>  rsearch.fit(X_train, y_train)>>>  print(rsearch.best_score_)
 
  Data-Science , Machine-Learning , Python , Scikit-Learn , Sklearn  — Jun 18, 2019