Monday, July 29, 2024

5 DIY Python Functions for Data Cleaning

 5 DIY Python Functions for Data Cleaning

Image by Author | Midjourney

Data cleaning: whether you love it or hate it, you likely spend a lot of time doing it.

It’s what we signed up for. There’s no understanding, analyzing, or modeling data without first cleaning it. Making sure we have reusable tools handy for data cleaning is essential. To that end, here are 5 DIY functions to give you a some examples and starting points for building up your own data cleaning tool chest.

The functions are well-documented, and include explicit descriptions or function parameters and return types. Type hinting is also employed to ensure both that the functions are called in the manner they were intended, and that they can be well understood by you, the reader.

Before we get started, let’s take care of the imports.

With that done, let’s get on to the functions.

1. Remove Multiple Spaces

Our first DIY function is meant to remove excessive whitespace from text. If we want neither multiple spaces within a string, nor excessive leading or trailing spaces, this single line function will take care of it for us. We make use of regular expressions for internal spaces, as well as strip() for trailing/leading whitespace.

Testing:

Output:

2. Standardize Date Formats

Do you have datasets with dates running the gamut of internationally acceptable formats? This function will standardize them all to our specified format (YYYY-MM-DD).

Testing:

Output:

3. Handle Missing Values

Let’s deal with those pesky missing values. We can specify our numeric data strategy to use (‘mean’, ‘median’, or ‘mode’), as well as our categorical data strategy (‘mode’ or ‘dummy’).

Testing:

Output:

4. Remove Outliers

Outliers causing you problems? Not any more. This DIY function uses the IQR method for removing outliers from our data. You just pass in the data and specify the columns to check for outliers, it returns an outlier-free dataframe.

Testing:

Output:

5. Normalize Text Data

Let’s get normal! When you want to convert all text to lowercase, strip of whitespace, and remove special characters, this DIY function will do the trick.

Testing:

Output:

Final Thoughts

Well that’s that. We went presented 5 different DIY functions that will perform specific data cleaning tasks. We test drove them all, and checked out their results. You should now have some idea of where to go on your own from here, and don’t forget to save these functions for use later.

5 DIY Python Functions for Data Cleaning

  Image by Author | Midjourney Data cleaning: whether you love it or hate it, you likely spend a lot of time doing it. It’s what we signed u...