Dealing with data from multiple sources that have different column structures is a common challenge in data analysis. This guide shows you exactly how to combine different datasets into a single DataFrame using Python pandas, even when they have completely different columns.
Understanding Python Pandas
What is pandas in Python? Pandas is a powerful data manipulation library that provides easy-to-use data structures for handling structured data. It's the go-to tool for data analysis tasks in Python.
What are pandas in Python? Pandas offers two main data structures: Series (1D) and DataFrame (2D). These can handle different data types including numbers, text, dates, and more, making them perfect for real-world data scenarios.
What is python pandas used for? Python pandas excels at data cleaning, merging, analysis, and transformation tasks. Whether you're working with CSV files, databases, or APIs, pandas handles it all.
Installation and Setup
Installing Pandas
How to install pandas in python? Use pip for a simple installation:
pip install pandas
How to install pandas for python? If you're using Anaconda:
conda install pandas
Importing Pandas
How to import pandas in python? Import pandas with the standard convention:
import pandas as pd
Working with Different Data Sources: The Main Challenge
What is panda in python? (Note: it's "pandas" with an 's') When you have multiple data sources with different column names and structures, combining them requires specific techniques.
Let's look at a practical example with three different data sources:
import pandas as pd
# Data Source 1: Website customers
website_data = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'full_name': ['John Doe', 'Jane Smith', 'Bob Johnson', 'Alice Brown'],
'email_address': ['john@email.com', 'jane@email.com', 'bob@email.com', 'alice@email.com'],
'purchase_total': [150.50, 89.99, 234.75, 67.80]
})
# Data Source 2: Mobile app users
mobile_data = pd.DataFrame({
'user_id': [5, 6, 7, 8],
'name': ['Charlie Wilson', 'Diana Prince', 'Eve Adams', 'Frank Miller'],
'email': ['charlie@email.com', 'diana@email.com', 'eve@email.com', 'frank@email.com'],
'app_spending': [45.99, 156.30, 78.45, 123.67],
'device': ['iOS', 'Android', 'iOS', 'Android']
})
# Data Source 3: Social media leads
social_data = pd.DataFrame({
'lead_id': [9, 10, 11, 12],
'username': ['grace_h', 'henry_k', 'ivy_j', 'jack_l'],
'contact_email': ['grace@email.com', 'henry@email.com', 'ivy@email.com', 'jack@email.com'],
'followers': [1250, 890, 2340, 567],
'engagement': [3.5, 4.2, 2.8, 5.1]
})
Solution 1: Simple Concatenation (Keeps All Columns)
How to use pandas in python to combine datasets with different columns:
Important: No Common Columns Required!
Key Point: Pandas concat() can combine DataFrames with completely different columns. You don't need any shared columns - pandas automatically handles this by creating all unique columns and filling missing values with NaN.
# Method 1: Simple concatenation - keeps all columns
def combine_different_datasets(datasets, source_names):
combined_data = []
# enumerate() gives us both index (i) and DataFrame (df)
# i=0: first DataFrame, i=1: second DataFrame, etc.
for i, df in enumerate(datasets):
# copy() protects original data from being modified
df_copy = df.copy()
# Add source column using index: source_names[0], source_names[1], etc.
df_copy['data_source'] = source_names[i]
combined_data.append(df_copy)
# concat() combines all DataFrames, even with different columns
# ignore_index=True creates new row numbers 0,1,2,3...
result = pd.concat(combined_data, ignore_index=True, sort=False)
return result
# Combine the datasets
datasets = [website_data, mobile_data, social_data]
source_names = ['website', 'mobile_app', 'social_media']
combined_df = combine_different_datasets(datasets, source_names)
print("Combined DataFrame:")
print(combined_df)
print(f"\nShape: {combined_df.shape}")
print(f"Columns: {combined_df.columns.tolist()}")
What happens during concatenation:
- Website data has:
customer_id, full_name, email_address, purchase_total - Mobile data has:
user_id, name, email, app_spending, device - Social data has:
lead_id, username, contact_email, followers, engagement - Result has: ALL columns from all sources
- Missing values automatically filled with
NaN
Solution 2: Column Mapping and Standardization
# Method 2: Standardize column names before combining
def standardize_and_combine(datasets, column_mappings, source_names):
standardized_data = []
for i, df in enumerate(datasets):
df_std = df.copy()
# Apply column mappings for this dataset
if source_names[i] in column_mappings:
df_std = df_std.rename(columns=column_mappings[source_names[i]])
# Add source information
df_std['source'] = source_names[i]
standardized_data.append(df_std)
# Combine standardized datasets
result = pd.concat(standardized_data, ignore_index=True, sort=False)
return result
# Define column mappings
column_mappings = {
'website': {
'customer_id': 'id',
'full_name': 'name',
'email_address': 'email',
'purchase_total': 'amount'
},
'mobile_app': {
'user_id': 'id',
'name': 'name',
'email': 'email',
'app_spending': 'amount'
},
'social_media': {
'lead_id': 'id',
'username': 'name',
'contact_email': 'email'
}
}
standardized_df = standardize_and_combine(datasets, column_mappings, source_names)
print("Standardized DataFrame:")
print(standardized_df)
Key Benefits of These Approaches
- Flexibility: Handle any number of data sources with different structures
- Data Preservation: Keep all original data while adding source tracking
- Scalability: Easy to add new data sources without changing core logic
- Error Handling: Built-in checks for empty datasets and data quality
Best Practices
# Best practices for data integration
def robust_data_combination(datasets, source_names):
"""
Robust function with best practices implemented
"""
if len(datasets) != len(source_names):
raise ValueError("Number of datasets must match number of source names")
combined_data = []
for i, df in enumerate(datasets):
# Check if DataFrame is empty
if df.empty:
print(f"Warning: {source_names[i]} dataset is empty")
continue
# Always create a copy to protect original data
df_copy = df.copy()
# Add consistent metadata
df_copy['data_source'] = source_names[i]
df_copy['original_row_count'] = len(df)
df_copy['processing_timestamp'] = pd.Timestamp.now()
combined_data.append(df_copy)
print(f"Processed {source_names[i]}: {len(df)} rows")
if not combined_data:
print("No valid datasets to combine")
return pd.DataFrame()
# Combine all datasets
result = pd.concat(combined_data, ignore_index=True, sort=False)
# Optional: Fill NaN values with meaningful defaults
result = result.fillna('Not Available')
print(f"\nFinal combined dataset: {len(result)} rows, {len(result.columns)} columns")
return result
# Example usage
robust_combined = robust_data_combination(datasets, source_names)
print("\nFirst few rows of robust combined data:")
print(robust_combined.head())
Important Guidelines:
- Always create copies of original DataFrames to avoid modifying source data
- Add source identification columns to track data origin
- Handle missing values appropriately for your use case
- Use consistent naming conventions for standardized columns
- Document your column mapping logic for future reference
Conclusion
What is pandas for python? As shown in this guide, pandas is the essential tool for combining heterogeneous data sources in Python. The techniques demonstrated here provide practical solutions for real-world data integration challenges.
These methods allow you to:
- Combine datasets with completely different column structures
- Maintain data integrity while adding useful metadata
- Create flexible, reusable functions for ongoing data integration needs
- Handle various data source formats and structures
Whether you're working with customer data, sales information, or any other type of structured data from multiple sources, these pandas techniques will help you create unified datasets efficiently and reliably.
Remember: Pandas concat() doesn't require any common columns between DataFrames. It automatically creates a union of all columns and handles missing data gracefully, making it perfect for real-world heterogeneous data scenarios.