scikit-learn (imported as sklearn) is a Python library implementing a wide range of machine-learning algorithms — regression, classification, clustering, dimensionality reduction, model selection, evaluation metrics — through a uniform interface. It’s the default machine-learning toolkit for Python and, for any task that isn’t deep learning, the right place to start.

The library is organized into submodules by what kind of work each does:

  • sklearn.preprocessing — normalization, scaling, encoding (StandardScaler, MinMaxScaler, OneHotEncoder).
  • sklearn.decomposition — dimensionality reduction (PCA, factor analysis).
  • sklearn.manifold — non-linear dimensionality reduction (t-SNE, UMAP through external packages, Isomap).
  • sklearn.linear_model — linear regression, Logistic regression, regularized variants.
  • sklearn.tree, sklearn.ensemble — decision trees, random forests, gradient boosting.
  • sklearn.svm — support vector machines.
  • sklearn.neighbors — k-nearest-neighbors classifiers and regressors.
  • sklearn.model_selectiontest split, cross-validation, hyperparameter search.
  • sklearn.metrics — accuracy, precision, recall, ROC curve, AUC, Confusion matrix.
  • sklearn.pipeline — compose preprocessing and modelling steps into a single object.

The interface is uniform across estimators. Almost every estimator has the same three methods:

  • .fit(X, y) — learn parameters from training data. Preprocessors usually accept just .fit(X) (the target y is ignored).
  • .transform(X) (for preprocessors) or .predict(X) (for models) — apply the learned transformation or make predictions on new data.
  • .fit_transform(X) — fit then transform in one call (preprocessors). The general rule is never call any flavour of .fit on test data — that includes .fit, .fit_transform, and re-fitting a fresh estimator on the test set. Fit only on training data; apply the fitted object to test data via .transform or .predict. See Data leakage.

A complete pipeline looks like:

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
clf = make_pipeline(StandardScaler(), LogisticRegression(max_iter=10000))
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

The make_pipeline(...) call chains preprocessing and modelling into a single object that handles them in order — fits the scaler on training data, transforms training data, fits the classifier on the transformed data; on .predict(), transforms new data through the fitted scaler and runs it through the classifier. This is the standard way to avoid data leakage in scikit-learn workflows.

scikit-learn is the de facto starting point for tabular machine-learning tasks. Deep learning (large neural networks, transformers, convnets) lives in PyTorch and TensorFlow. For everything else — feature engineering, classical models, evaluation pipelines — scikit-learn is where the code goes.