scikit-learn (imported as sklearn) is a Python library implementing a wide range of machine-learning algorithms — regression, classification, clustering, dimensionality reduction, model selection, evaluation metrics — through a uniform interface. It’s the default machine-learning toolkit for Python and, for any task that isn’t deep learning, the right place to start.
The library is organized into submodules by what kind of work each does:
sklearn.preprocessing— normalization, scaling, encoding (StandardScaler,MinMaxScaler,OneHotEncoder).sklearn.decomposition— dimensionality reduction (PCA, factor analysis).sklearn.manifold— non-linear dimensionality reduction (t-SNE, UMAP through external packages, Isomap).sklearn.linear_model— linear regression, Logistic regression, regularized variants.sklearn.tree,sklearn.ensemble— decision trees, random forests, gradient boosting.sklearn.svm— support vector machines.sklearn.neighbors— k-nearest-neighbors classifiers and regressors.sklearn.model_selection— test split, cross-validation, hyperparameter search.sklearn.metrics— accuracy, precision, recall, ROC curve, AUC, Confusion matrix.sklearn.pipeline— compose preprocessing and modelling steps into a single object.
The interface is uniform across estimators. Almost every estimator has the same three methods:
.fit(X, y)— learn parameters from training data. Preprocessors usually accept just.fit(X)(the targetyis ignored)..transform(X)(for preprocessors) or.predict(X)(for models) — apply the learned transformation or make predictions on new data..fit_transform(X)— fit then transform in one call (preprocessors). The general rule is never call any flavour of.fiton test data — that includes.fit,.fit_transform, and re-fitting a fresh estimator on the test set. Fit only on training data; apply the fitted object to test data via.transformor.predict. See Data leakage.
A complete pipeline looks like:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
clf = make_pipeline(StandardScaler(), LogisticRegression(max_iter=10000))
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)The make_pipeline(...) call chains preprocessing and modelling into a single object that handles them in order — fits the scaler on training data, transforms training data, fits the classifier on the transformed data; on .predict(), transforms new data through the fitted scaler and runs it through the classifier. This is the standard way to avoid data leakage in scikit-learn workflows.
scikit-learn is the de facto starting point for tabular machine-learning tasks. Deep learning (large neural networks, transformers, convnets) lives in PyTorch and TensorFlow. For everything else — feature engineering, classical models, evaluation pipelines — scikit-learn is where the code goes.