Enhancing Productivity in ML Projects with Powerful Pipelines in Python and scikit-learn

Harshit Lohani
6 min readJun 21, 2024

A Beginner’s Guide to Building Custom Transformers and Integrating Them into Your Pipelines

Table Of Contents

  1. What are scikit-learn pipelines?
  2. Why should we use scikit-learn pipelines?
  3. Implementing a simple pipeline using scikit-learn
  4. Implementing a pipeline that handles numerical and categorical data
  5. Conclusion

What are scikit-learn Pipelines?

Scikit-learn pipelines are specific implementations of ML pipelines. They are a sequence of transformers (for pre-processing steps) and an estimator (for the model). Each step is defined using scikit-learn’s API, ensuring compatibility and ease of integration.

Why should we use scikit-learn pipelines?

Having an organized workflow for any project is really important and productive for the developers. Using scikit-learn pipelines offers several advantages that help streamline and enhance the machine learning workflow. Following are some of the reasons for using scikit-learn pipelines:

  1. Streamlined Workflow : Pipelines help create a seamless, end-to-end workflow that chains together multiple steps in a machine learning process, such as data preprocessing, feature extraction, and model training. This eliminates the need to manually handle intermediate data transformations.
  2. Code Cleanliness and readability : By encapsulating the entire process within a pipeline, the code becomes cleaner, more readable, and easier to maintain.
  3. Consistency : Pipelines ensure that the same transformations are applied consistently to both training and testing datasets. This is crucial for maintaining the integrity of the machine learning model’s performance.
  4. Reusability : Once a pipeline is created it can be used with different datasets or as a part of a different workflow without the need of modifications.

These are some of the benefits of using the scikit-learn pipelines in our ML workflow. Using scikit-learn pipelines brings multiple benefits, including streamlined workflows, consistency in data transformation, code cleanliness, and reusability. This makes the machine learning process more efficient, reliable, and easier to manage.

Implementing a simple pipeline with scikit-learn

First lets start off by creating a simple and basic pipeline using scikit-learn.

Let’s say we have a dataset consisting entirely of numerical values, and we need to handle missing values, apply a standard scaler to normalize the distribution (so that each feature has a mean of 0 and a variance of 1), and then apply linear regression (assuming that the features have linear combination between them). So how do we create a pipeline for the above case?

# Assuming that we have a dataset "train_data" and "test_data"

# Importing the classes
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import numpy as np

# Assuming that the features in train_data ahve linear relation in them.
from sklearn.linear_model import LinearRegression

# Steps matrix -> denotes the steps that are invloved in the Pipeline
# steps -> array of name tuples
steps = [("imputer", SimpleImputer(missing_values = np.nan, strategy = "mean")), ("scaler", StandardScaler()), ("linear_estimator", LinearRegression())]

# Creating the pipeline
pipe = Pipeline(steps)

Visualizing the above created pipeline

# Visualizing the pipeline
from sklearn import set_config

set_config(display = "diagram") # for more intuitive and graphical representation of the pipeline's structure

pipe
pipeline created

Congratulations!! You have created your first pipeline using sklearn.

The flow depicted by the image of the pipeline is quite self-explanatory of how the pipeline will be working, but still lets decode the working of this pipeline.

  • As the data matrix is passed through the pipeline the data is transformed starting from the 1st transformation (SimpleImputer() in this case) and goes on till the last (generally the model estimator).
  • The transformations are applied in the order in which they appear in the pipeline, so the order in which one writes the steps array is important.

It is important to note that pipelines can contain only transformers and an estimator.

Implementing pipelines that are able to handle numerical as well as categorical data using scikit-learn

Let’s start off with the initial imports

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
# Steps that are to be done for numerical columns
steps_num_cols = [("imputer", SimpleImputer(missing_values = np.nan, strategy = "mean")), ("scaler", StandardScaler())]

# Steps that arex to be done for the object/categorical columns
steps_obj_cols = [("imputer", SimpleImputer(strategy = "constant", fill_values = "missing")), ("one_hot_encoder", OneHotEncoder(handle_unknown = "ignore"))]

Creating the pipelines of these 2 steps

pipe1 = Pipeline(steps_num_cols) # Pipeline for numerical cols
pipe2 = Pipeline(steps_obj_cols) # Pipeline for obj/cat cols

Now what we have are 2 different pipelines for numerical and object/categorical data. We need to combine these 2 pipelines.

# Combining the 2 pipelines with the help of ColumnTransformer
from sklearn.compose import ColumnTransformer

# Considering a dataset X
num_cols = X.select_dtypes(include = np.number).columns
cat_cols = X.select_dtypes(include = object).columns

# num_cols and cat_cols are the columns that contain numeric and categorical data respectively
combined_pipe = ColumnTransformer(transformers = [("int_cols", pipe1, num_cols), ("obj_cols", pipe2, cat_cols)])

How does the pipeline looks right now?

Pipeline

As demonstrated, our code configures the pipeline to appropriately transform the data by handling numerical and categorical columns in their respective manners. This ensures that numerical columns are imputed with the “mean” in missing blocks, scaled and normalized, while categorical columns are imputed with a constant value of “missing” wherever the data is missing, encoded and processed correctly, leading to a more robust and effective structure for data transformations.

Now we can use the Estimator to finally do the predictions. In this case let’s perform Random Forest regression on the preprocessed data.

# Importing Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor

# Normally we create an object of the regressor and then fit the training data
# and then predict the values. But this would be different

final_pipe = Pipeline(steps = [("preprocessor", combined_pipe), ("Estimator", RandomForestRegressor(n_estimators = 1000, max_depth = 5))])

'''
OR we can do another way also
by the use of make_pipeline

from sklearn.pipeline import mamke_pipeline

final_pipe = make_pipeline(combined_pipe, RandomForestRegressor(n_estimators = 1000, max_depth = 5))

This would also do the same
'''
final_pipe # Displaying the final pipeline

This is what we have created by just a few lines of code. Beautiful isn’t it, just by looking at it you can tell how organized our workflow has become. We just need to fit the training data in the pipeline and then predict the outcomes on the test dataset.

The pipelines created in this blog are the basic pipelines and won’t work if the data contains missing values or some other issues.

Conclusion

Using pipelines in a machine learning workflow streamlines preprocessing and modeling by ensuring consistent and reproducible transformations. This modular approach enhances code readability, maintainability, and facilitates experimentation, ultimately leading to more efficient and effective model development. Embracing pipelines is a best practice for building robust and scalable machine learning solutions.

If you reached here Congratulations!! and Thank You as well….Hope this blog fulfilled its purpose and helped you create an organized workflow.

You can find the final code on my github repository.

In case of any doubts feel free to comment or reach out to me on LinkedIn.

--

--

Harshit Lohani
Harshit Lohani

Written by Harshit Lohani

0 Followers

I am a dedicated Machine Learning Engineer with a passion for transforming data into actionable intelligence using Machine Learning and AI technologies.

No responses yet