scikit-learn

scikit-learn (imported as sklearn) is a Python library implementing machine-learning algorithms through a uniform interface: regression, classification, clustering, dimensionality reduction, model selection, evaluation metrics. It’s the default machine-learning toolkit for Python, and for any task that isn’t deep learning, the right place to start.

The library is organized into submodules by what kind of work each does:

sklearn.preprocessing — normalization, scaling, encoding (StandardScaler, MinMaxScaler, OneHotEncoder).
sklearn.decomposition — dimensionality reduction (PCA, factor analysis).
sklearn.manifold — non-linear dimensionality reduction (t-SNE, UMAP through external packages, Isomap).
sklearn.linear_model — linear regression, Logistic regression, regularized variants.
sklearn.tree, sklearn.ensemble — decision trees, random forests, gradient boosting.
sklearn.svm — support vector machines.
sklearn.neighbors — k-nearest-neighbors classifiers and regressors.
sklearn.model_selection — test split, cross-validation, hyperparameter search.
sklearn.metrics — accuracy, precision, recall, ROC curve, AUC, Confusion matrix.
sklearn.pipeline — compose preprocessing and modelling steps into a single object.

The interface is the same across estimators. Almost every estimator has the same three methods:

.fit(X, y) — learn parameters from training data. Preprocessors usually accept just .fit(X) (the target y is ignored).
.transform(X) (for preprocessors) or .predict(X) (for models) — apply the learned transformation or make predictions on new data.
.fit_transform(X) — fit then transform in one call (preprocessors). The rule is never call any flavour of .fit on test data: that includes .fit, .fit_transform, and re-fitting a fresh estimator on the test set. Fit only on training data; apply the fitted object to test data via .transform or .predict. See Data leakage.

A complete pipeline looks like:

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
clf = make_pipeline(StandardScaler(), LogisticRegression(max_iter=10000))
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

The make_pipeline(...) call chains preprocessing and modelling into a single object that handles them in order: fits the scaler on training data, transforms training data, fits the classifier on the transformed data. On .predict(), it transforms new data through the fitted scaler and runs it through the classifier. This is the standard way to avoid data leakage in scikit-learn workflows.

Deep learning (large neural networks, transformers, convnets) lives in PyTorch and TensorFlow. For everything else, feature engineering, classical models, evaluation pipelines, scikit-learn is where the code goes.

Idriss Rami — Notes

Explorer

scikit-learn

Graph View

Backlinks