Image by Author | Midjourney
Data cleaning: whether you love it or hate it, you likely spend a lot of time doing it.
It’s what we signed up for. There’s no understanding, analyzing, or modeling data without first cleaning it. Making sure we have reusable tools handy for data cleaning is essential. To that end, here are 5 DIY functions to give you a some examples and starting points for building up your own data cleaning tool chest.
The functions are well-documented, and include explicit descriptions or function parameters and return types. Type hinting is also employed to ensure both that the functions are called in the manner they were intended, and that they can be well understood by you, the reader.
Before we get started, let’s take care of the imports.
| import re from datetime import datetime import pandas as pd import numpy as np from typing import List, Union, Optional |
With that done, let’s get on to the functions.
1. Remove Multiple Spaces
Our first DIY function is meant to remove excessive whitespace from text. If we want neither multiple spaces within a string, nor excessive leading or trailing spaces, this single line function will take care of it for us. We make use of regular expressions for internal spaces, as well as strip() for trailing/leading whitespace.
| def clean_spaces(text: str) -> str: """ Remove multiple spaces from a string and trim leading/trailing spaces. :param text: The input string to clean :returns: A string with multiple spaces removed and trimmed """ return re.sub(' +', ' ', str(text).strip()) |
Testing:
| messy_text = "This has too many spaces" clean_text = clean_spaces(messy_text) print(clean_text) |
Output:
2. Standardize Date Formats
Do you have datasets with dates running the gamut of internationally acceptable formats? This function will standardize them all to our specified format (YYYY-MM-DD).
| def standardize_date(date_string: str) -> Optional[str]: """ Convert various date formats to YYYY-MM-DD. :param date_string: The input date string to standardize :returns: A standardized date string in YYYY-MM-DD format, or None if parsing fails """ date_formats = ["%Y-%m-%d", "%d-%m-%Y", "%m/%d/%Y", "%d/%m/%Y", "%B %d, %Y"] for fmt in date_formats: try: return datetime.strptime(date_string, fmt).strftime("%Y-%m-%d") except ValueError: pass # Return None if no format matches return None |
Testing:
| dates = ["2023-04-01", "01-04-2023", "04/01/2023", "April 1, 2023"] standardized_dates = [standardize_date(date) for date in dates] print(standardized_dates) |
Output:
| ['2023-04-01', '2023-04-01', '2023-04-01', '2023-04-01'] |
3. Handle Missing Values
Let’s deal with those pesky missing values. We can specify our numeric data strategy to use (‘mean’, ‘median’, or ‘mode’), as well as our categorical data strategy (‘mode’ or ‘dummy’).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | def handle_missing(df: pd.DataFrame, numeric_strategy: str = 'mean', categorical_strategy: str = 'mode') -> pd.DataFrame: """ Fill missing values in a DataFrame. :param df: The input DataFrame :param numeric_strategy: Strategy for handling missing numeric values ('mean', 'median', or 'mode') :param categorical_strategy: Strategy for handling missing categorical values ('mode' or 'dummy') :returns: A DataFrame with missing values filled """ for column in df.columns: if df[column].dtype in ['int64', 'float64']: if numeric_strategy == 'mean': df[column].fillna(df[column].mean(), inplace=True) elif numeric_strategy == 'median': df[column].fillna(df[column].median(), inplace=True) elif numeric_strategy == 'mode': df[column].fillna(df[column].mode()[0], inplace=True) else: if categorical_strategy == 'mode': df[column].fillna(df[column].mode()[0], inplace=True) elif categorical_strategy == 'dummy': df[column].fillna('Unknown', inplace=True) return df |
Testing:
| df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': ['x', 'y', np.nan, 'z']}) cleaned_df = handle_missing(df) print(cleaned_df) |
Output:
| df[column].fillna(df[column].mode()[0], inplace=True) A B 0 1.000000 x 1 2.000000 y 2 2.333333 x 3 4.000000 z |
4. Remove Outliers
Outliers causing you problems? Not any more. This DIY function uses the IQR method for removing outliers from our data. You just pass in the data and specify the columns to check for outliers, it returns an outlier-free dataframe.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | import pandas as pd import numpy as np from typing import List def remove_outliers_iqr(df: pd.DataFrame, columns: List[str], factor: float = 1.5) -> pd.DataFrame: """ Remove outliers from specified columns using the Interquartile Range (IQR) method. :param df: The input DataFrame :param columns: List of column names to check for outliers :param factor: The IQR factor to use (default is 1.5) :returns: A DataFrame with outliers removed """ mask = pd.Series(True, index=df.index) for col in columns: Q1 = df[col].quantile(0.25) Q3 = df[col].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - factor * IQR upper_bound = Q3 + factor * IQR mask &= (df[col] >= lower_bound) & (df[col] <= upper_bound) cleaned_df = df[mask] return cleaned_df |
Testing:
| df = pd.DataFrame({'A': [1, 2, 3, 100, 4, 5], 'B': [10, 20, 30, 40, 50, 1000]}) print("Original DataFrame:") print(df) print("\nCleaned DataFrame:") cleaned_df = remove_outliers_iqr(df, ['A', 'B']) print(cleaned_df) |
Output:
| Original DataFrame: A B 0 1 10 1 2 20 2 3 30 3 100 40 4 4 50 5 5 1000 Cleaned DataFrame: A B 0 1 10 1 2 20 2 3 30 4 4 50 |
5. Normalize Text Data
Let’s get normal! When you want to convert all text to lowercase, strip of whitespace, and remove special characters, this DIY function will do the trick.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | def normalize_text(text: str) -> str: """ Normalize text data by converting to lowercase, removing special characters, and extra spaces. :param text: The input text to normalize :returns: Normalized text """ # Convert to lowercase text = str(text).lower() # Remove special characters text = re.sub(r'[^\w\s]', '', text) # Remove extra spaces text = re.sub(r'\s+', ' ', text).strip() return text |
Testing:
| messy_text = "This is MESSY!!! Text with $pecial ch@racters." clean_text = normalize_text(messy_text) print(clean_text) |
Output:
| this is messy text with pecial chracters |
Final Thoughts
Well that’s that. We went presented 5 different DIY functions that will perform specific data cleaning tasks. We test drove them all, and checked out their results. You should now have some idea of where to go on your own from here, and don’t forget to save these functions for use later.