Working with Heterogeneous Data in Python Pandas

Working with Heterogeneous Data in Python Pandas: Complete Guide

Dealing with data from multiple sources that have different column structures is a common challenge in data analysis. This guide shows you exactly how to combine different datasets into a single DataFrame using Python pandas, even when they have completely different columns.

Understanding Python Pandas

What is pandas in Python? Pandas is a powerful data manipulation library that provides easy-to-use data structures for handling structured data. It's the go-to tool for data analysis tasks in Python.

What are pandas in Python? Pandas offers two main data structures: Series (1D) and DataFrame (2D). These can handle different data types including numbers, text, dates, and more, making them perfect for real-world data scenarios.

What is python pandas used for? Python pandas excels at data cleaning, merging, analysis, and transformation tasks. Whether you're working with CSV files, databases, or APIs, pandas handles it all.

Installation and Setup

Installing Pandas

How to install pandas in python? Use pip for a simple installation:

pip install pandas

How to install pandas for python? If you're using Anaconda:

conda install pandas

Importing Pandas

How to import pandas in python? Import pandas with the standard convention:

import pandas as pd

Working with Different Data Sources: The Main Challenge

What is panda in python? (Note: it's "pandas" with an 's') When you have multiple data sources with different column names and structures, combining them requires specific techniques.

Let's look at a practical example with three different data sources:

import pandas as pd

# Data Source 1: Website customers
website_data = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'full_name': ['John Doe', 'Jane Smith', 'Bob Johnson', 'Alice Brown'],
    'email_address': ['john@email.com', 'jane@email.com', 'bob@email.com', 'alice@email.com'],
    'purchase_total': [150.50, 89.99, 234.75, 67.80]
})

# Data Source 2: Mobile app users
mobile_data = pd.DataFrame({
    'user_id': [5, 6, 7, 8],
    'name': ['Charlie Wilson', 'Diana Prince', 'Eve Adams', 'Frank Miller'],
    'email': ['charlie@email.com', 'diana@email.com', 'eve@email.com', 'frank@email.com'],
    'app_spending': [45.99, 156.30, 78.45, 123.67],
    'device': ['iOS', 'Android', 'iOS', 'Android']
})

# Data Source 3: Social media leads
social_data = pd.DataFrame({
    'lead_id': [9, 10, 11, 12],
    'username': ['grace_h', 'henry_k', 'ivy_j', 'jack_l'],
    'contact_email': ['grace@email.com', 'henry@email.com', 'ivy@email.com', 'jack@email.com'],
    'followers': [1250, 890, 2340, 567],
    'engagement': [3.5, 4.2, 2.8, 5.1]
})

Solution 1: Simple Concatenation (Keeps All Columns)

How to use pandas in python to combine datasets with different columns:

Important: No Common Columns Required!

Key Point: Pandas concat() can combine DataFrames with completely different columns. You don't need any shared columns - pandas automatically handles this by creating all unique columns and filling missing values with NaN.

# Method 1: Simple concatenation - keeps all columns
def combine_different_datasets(datasets, source_names):
    combined_data = []

    # enumerate() gives us both index (i) and DataFrame (df)
    # i=0: first DataFrame, i=1: second DataFrame, etc.
    for i, df in enumerate(datasets):
        # copy() protects original data from being modified
        df_copy = df.copy()
        # Add source column using index: source_names[0], source_names[1], etc.
        df_copy['data_source'] = source_names[i]
        combined_data.append(df_copy)

    # concat() combines all DataFrames, even with different columns
    # ignore_index=True creates new row numbers 0,1,2,3...
    result = pd.concat(combined_data, ignore_index=True, sort=False)
    return result

# Combine the datasets
datasets = [website_data, mobile_data, social_data]
source_names = ['website', 'mobile_app', 'social_media']

combined_df = combine_different_datasets(datasets, source_names)
print("Combined DataFrame:")
print(combined_df)
print(f"\nShape: {combined_df.shape}")
print(f"Columns: {combined_df.columns.tolist()}")

What happens during concatenation:

Website data has: customer_id, full_name, email_address, purchase_total
Mobile data has: user_id, name, email, app_spending, device
Social data has: lead_id, username, contact_email, followers, engagement
Result has: ALL columns from all sources
Missing values automatically filled with NaN

Solution 2: Column Mapping and Standardization

# Method 2: Standardize column names before combining
def standardize_and_combine(datasets, column_mappings, source_names):
    standardized_data = []

    for i, df in enumerate(datasets):
        df_std = df.copy()

        # Apply column mappings for this dataset
        if source_names[i] in column_mappings:
            df_std = df_std.rename(columns=column_mappings[source_names[i]])

        # Add source information
        df_std['source'] = source_names[i]
        standardized_data.append(df_std)

    # Combine standardized datasets
    result = pd.concat(standardized_data, ignore_index=True, sort=False)
    return result

# Define column mappings
column_mappings = {
    'website': {
        'customer_id': 'id',
        'full_name': 'name',
        'email_address': 'email',
        'purchase_total': 'amount'
    },
    'mobile_app': {
        'user_id': 'id',
        'name': 'name',
        'email': 'email',
        'app_spending': 'amount'
    },
    'social_media': {
        'lead_id': 'id',
        'username': 'name',
        'contact_email': 'email'
    }
}

standardized_df = standardize_and_combine(datasets, column_mappings, source_names)
print("Standardized DataFrame:")
print(standardized_df)

Key Benefits of These Approaches

Flexibility: Handle any number of data sources with different structures
Data Preservation: Keep all original data while adding source tracking
Scalability: Easy to add new data sources without changing core logic
Error Handling: Built-in checks for empty datasets and data quality

Best Practices

# Best practices for data integration
def robust_data_combination(datasets, source_names):
    """
    Robust function with best practices implemented
    """
    if len(datasets) != len(source_names):
        raise ValueError("Number of datasets must match number of source names")

    combined_data = []

    for i, df in enumerate(datasets):
        # Check if DataFrame is empty
        if df.empty:
            print(f"Warning: {source_names[i]} dataset is empty")
            continue

        # Always create a copy to protect original data
        df_copy = df.copy()

        # Add consistent metadata
        df_copy['data_source'] = source_names[i]
        df_copy['original_row_count'] = len(df)
        df_copy['processing_timestamp'] = pd.Timestamp.now()

        combined_data.append(df_copy)
        print(f"Processed {source_names[i]}: {len(df)} rows")

    if not combined_data:
        print("No valid datasets to combine")
        return pd.DataFrame()

    # Combine all datasets
    result = pd.concat(combined_data, ignore_index=True, sort=False)

    # Optional: Fill NaN values with meaningful defaults
    result = result.fillna('Not Available')

    print(f"\nFinal combined dataset: {len(result)} rows, {len(result.columns)} columns")
    return result

# Example usage
robust_combined = robust_data_combination(datasets, source_names)
print("\nFirst few rows of robust combined data:")
print(robust_combined.head())

Important Guidelines:

Always create copies of original DataFrames to avoid modifying source data
Add source identification columns to track data origin
Handle missing values appropriately for your use case
Use consistent naming conventions for standardized columns
Document your column mapping logic for future reference

Conclusion

What is pandas for python? As shown in this guide, pandas is the essential tool for combining heterogeneous data sources in Python. The techniques demonstrated here provide practical solutions for real-world data integration challenges.

These methods allow you to:

Combine datasets with completely different column structures
Maintain data integrity while adding useful metadata
Create flexible, reusable functions for ongoing data integration needs
Handle various data source formats and structures

Whether you're working with customer data, sales information, or any other type of structured data from multiple sources, these pandas techniques will help you create unified datasets efficiently and reliably.

Remember: Pandas concat() doesn't require any common columns between DataFrames. It automatically creates a union of all columns and handles missing data gracefully, making it perfect for real-world heterogeneous data scenarios.

About Siddartha Kumar Das

Tech Enthusiast

Topics

Python