SQL vs. Pandas: A Head-to-Head for Data Manipulation

Introduction: Your Data, Two Powerful Tools

In the world of data, getting your hands on raw information is just the first step. The real magic happens when you start to clean, transform, and analyze that data to extract meaningful insights. Two of the most ubiquitous and powerful tools for this job are SQL (Structured Query Language) and Python's Pandas library.

You've likely encountered data stored in databases (where SQL shines) or found yourself wrestling with spreadsheets and CSVs (a perfect playground for Pandas). While they come from different ecosystems—SQL from the realm of relational databases and Pandas from the flexible world of Python programming—they often tackle the very same data manipulation tasks.

This blog post will take you on a journey through six common data operations, demonstrating how you can achieve similar results using both SQL and Pandas. Whether you're a SQL wizard looking to learn Python, a Pandas enthusiast curious about database queries, or just starting your data journey, understanding both approaches will significantly boost your data skills and help you choose the right tool for the right task.

Let's dive in and see how these two data titans compare!

________________________________________

Data Setup

Before we dive into the operations, let's establish the sample data we'll be using for both SQL and Pandas examples.

SQL Data Creation

First, we'll create two tables: Departments and Employees.

SQL

  • - Create Departments table

CREATE TABLE Departments (

DepartmentID INT PRIMARY KEY,

DepartmentName VARCHAR(50)

);

  • - Insert data into Departments

INSERT INTO Departments (DepartmentID, DepartmentName) VALUES

(101, 'Sales'),

(102, 'Marketing'),

(103, 'Engineering'),

(104, 'HR');

  • - Create Employees table

CREATE TABLE Employees (

EmployeeID INT PRIMARY KEY,

FirstName VARCHAR(50),

LastName VARCHAR(50),

DepartmentID INT,

Salary DECIMAL(10, 2),

HireDate DATE,

Email VARCHAR(100) UNIQUE,

CONSTRAINT FK_Department FOREIGN KEY (DepartmentID) REFERENCES Departments(DepartmentID)

);

  • - Insert data into Employees

INSERT INTO Employees (EmployeeID, FirstName, LastName, DepartmentID, Salary, HireDate, Email) VALUES

(1, 'Alice', 'Smith', 101, 60000.00, '2020-01-15', 'alice.smith@example.com'),

(2, 'Bob', 'Johnson', 102, 75000.00, '2019-03-20', 'bob.johnson@example.com'),

(3, 'Charlie', 'Brown', 101, 62000.00, '2021-07-01', 'charlie.brown@example.com'),

(4, 'Diana', 'Prince', 103, 90000.00, '2018-11-10', 'diana.prince@example.com'),

(5, 'Eve', 'Adams', 102, 70000.00, '2022-05-22', 'eve.adams@example.com'),

(6, 'Frank', 'White', 103, 95000.00, '2017-09-01', 'frank.white@example.com'),

(7, 'Grace', 'Lee', NULL, 55000.00, '2023-01-30', 'grace.lee@example.com'); -- Employee with no department

Pandas Data Loading

For Pandas, we'll simulate this data by creating DataFrames directly using Python.

Python

import pandas as pd

import numpy as np # For NaN values

# Create Departments DataFrame

departments_data = {

'DepartmentID': [101, 102, 103, 104],

'DepartmentName': ['Sales', 'Marketing', 'Engineering', 'HR']

}

df_departments = pd.DataFrame(departments_data)

# Create Employees DataFrame

employees_data = {

'EmployeeID': [1, 2, 3, 4, 5, 6, 7],

'FirstName': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace'],

'LastName': ['Smith', 'Johnson', 'Brown', 'Prince', 'Adams', 'White', 'Lee'],

'DepartmentID': [101, 102, 101, 103, 102, 103, np.nan], # Using np.nan for missing department

'Salary': [60000.00, 75000.00, 62000.00, 90000.00, 70000.00, 95000.00, 55000.00],

'HireDate': pd.to_datetime(['2020-01-15', '2019-03-20', '2021-07-01', '2018-11-10', '2022-05-22', '2017-09-01', '2023-01-30']),

'Email': ['alice.smith@example.com', 'bob.johnson@example.com', 'charlie.brown@example.com', 'diana.prince@example.com', 'eve.adams@example.com', 'frank.white@example.com', 'grace.lee@example.com']

}

df_employees = pd.DataFrame(employees_data)

print("df_employees:")

print(df_employees)

print("\ndf_departments:")

print(df_departments)

________________________________________

Task 1: Filtering Data

Goal: Select rows from a dataset based on one or more conditions. For instance, finding all employees who earn more than $70,000.

SQL Approach: Using the WHERE Clause

In SQL, the WHERE clause is your go-to for filtering. You specify conditions using comparison operators (=, !=, >, <, >=, <=) and logical operators (AND, OR, NOT).

Code Example (SQL):

Let's find all employees who work in DepartmentID 101 and have a Salary greater than or equal to $60,000.

SQL

SELECT

EmployeeID,

FirstName,

LastName,

DepartmentID,

Salary

FROM

Employees

WHERE

DepartmentID = 101 AND Salary >= 60000.00;

Explanation:

  • SELECT ... FROM Employees: This part indicates we want to retrieve columns from the Employees table.
  • WHERE DepartmentID = 101: This is our first condition, filtering for rows where the DepartmentID is exactly 101.
  • AND Salary >= 60000.00: This is our second condition. The AND operator means both conditions must be true for a row to be included in the result.

Pandas Approach: Using Boolean Indexing

Pandas uses what's called boolean indexing (or boolean masking) to filter DataFrames. You create a series of True/False values based on your conditions, and then use that series to select rows.

Code Example (Pandas):

Using our df_employees DataFrame, let's achieve the same filtering as the SQL example.

Python

import pandas as pd

import numpy as np

# Re-creating df_employees for context, assuming it's available from previous step

employees_data = {

'EmployeeID': [1, 2, 3, 4, 5, 6, 7],

'FirstName': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace'],

'LastName': ['Smith', 'Johnson', 'Brown', 'Prince', 'Adams', 'White', 'Lee'],

'DepartmentID': [101, 102, 101, 103, 102, 103, np.nan],

'Salary': [60000.00, 75000.00, 62000.00, 90000.00, 70000.00, 95000.00, 55000.00],

'HireDate': pd.to_datetime(['2020-01-15', '2019-03-20', '2021-07-01', '2018-11-10', '2022-05-22', '2017-09-01', '2023-01-30']),

'Email': ['alice.smith@example.com', 'bob.johnson@example.com', 'charlie.brown@example.com', 'diana.prince@example.com', 'eve.adams@example.com', 'frank.white@example.com', 'grace.lee@example.com']

}

df_employees = pd.DataFrame(employees_data)

# Filter employees

filtered_employees_df = df_employees[

(df_employees['DepartmentID'] == 101) &

(df_employees['Salary'] >= 60000.00)

]

print(filtered_employees_df)

Explanation:

  • df_employees['DepartmentID'] == 101: This part creates a boolean Series where each element is True if the DepartmentID for that row is 101, and False otherwise.
  • df_employees['Salary'] >= 60000.00: Similarly, this creates another boolean Series for the salary condition.
  • &: The & symbol is the bitwise AND operator in Python. When used with boolean Series, it performs an element-wise AND operation. Both conditions must be True for the combined result to be True.
  • df_employees[...]: We then pass this combined boolean Series inside the square brackets [] of the DataFrame. Pandas returns only the rows where the corresponding boolean value is True.

o Important Note: In Pandas, you must wrap each condition in parentheses () when combining them with & (AND) or | (OR) because of operator precedence.

Both SQL's WHERE clause and Pandas' boolean indexing are powerful ways to filter data. SQL's syntax might feel more like natural language, while Pandas uses a more programmatic approach with boolean Series.

________________________________________

Task 2: Selecting Specific Columns

Goal: Retrieve only a subset of columns from a dataset. For example, getting just the FirstName, LastName, and Email of employees.

SQL Approach: Naming Columns in the SELECT Statement

In SQL, you explicitly list the column names you want to retrieve right after the SELECT keyword. If you want all columns, you use *.

Code Example (SQL):

Let's select the FirstName, LastName, and Salary for all employees.

SQL

SELECT

FirstName,

LastName,

Salary

FROM

Employees;

Explanation:

  • SELECT FirstName, LastName, Salary: Here, we're simply telling the database exactly which columns we want to see in our result set.
  • FROM Employees: Specifies the table from which these columns should be retrieved.

If you wanted all columns, you would write SELECT * FROM Employees;.

Pandas Approach: Using a List of Column Names

In Pandas, you can select specific columns by passing a list of column names to your DataFrame using double square brackets [[]].

Code Example (Pandas):

Using our df_employees DataFrame, let's select the same columns (FirstName, LastName, Salary).

Python

import pandas as pd

import numpy as np

# Re-creating df_employees for context

employees_data = {

'EmployeeID': [1, 2, 3, 4, 5, 6, 7],

'FirstName': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace'],

'LastName': ['Smith', 'Johnson', 'Brown', 'Prince', 'Adams', 'White', 'Lee'],

'DepartmentID': [101, 102, 101, 103, 102, 103, np.nan],

'Salary': [60000.00, 75000.00, 62000.00, 90000.00, 70000.00, 95000.00, 55000.00],

'HireDate': pd.to_datetime(['2020-01-15', '2019-03-20', '2021-07-01', '2018-11-10', '2022-05-22', '2017-09-01', '2023-01-30']),

'Email': ['alice.smith@example.com', 'bob.johnson@example.com', 'charlie.brown@example.com', 'diana.prince@example.com', 'eve.adams@example.com', 'frank.white@example.com', 'grace.lee@example.com']

}

df_employees = pd.DataFrame(employees_data)

# Select specific columns

selected_columns_df = df_employees[['FirstName', 'LastName', 'Salary']]

print(selected_columns_df)

Explanation:

  • df_employees[['FirstName', 'LastName', 'Salary']]:

o The outer [] are for indexing the DataFrame.

o The inner [] ['FirstName', 'LastName', 'Salary'] is a standard Python list containing the names of the columns you want to select.

o When you pass a list of column names to the DataFrame's indexing operator, Pandas returns a new DataFrame containing only those columns in the order you specified.

If you wanted to select a single column in Pandas, you could use df_employees['FirstName'], which would return a Pandas Series (a single column). To keep it as a DataFrame even with one column, you'd still use the double brackets: df_employees[['FirstName']].

Both methods are straightforward. SQL requires listing them directly, while Pandas uses a list within its indexing syntax.

________________________________________

Task 3: Aggregating Data

Goal: Summarize numerical data, often by grouping it based on categorical columns. For example, finding the average salary for each department, or the total number of employees in each department.

SQL Approach: Using GROUP BY and Aggregate Functions

SQL provides built-in aggregate functions (like COUNT(), SUM(), AVG(), MIN(), MAX()) that work in conjunction with the GROUP BY clause. The GROUP BY clause divides the rows into groups, and then the aggregate function is applied to each group.

Code Example (SQL):

Let's find the average salary for each DepartmentName. This will require a JOIN to get the department names, and then a GROUP BY clause.

SQL

SELECT

d.DepartmentName,

COUNT(e.EmployeeID) AS NumberOfEmployees,

AVG(e.Salary) AS AverageSalary,

SUM(e.Salary) AS TotalSalary

FROM

Employees e

JOIN

Departments d ON e.DepartmentID = d.DepartmentID

GROUP BY

d.DepartmentName

ORDER BY

d.DepartmentName;

Explanation:

  • SELECT d.DepartmentName, COUNT(e.EmployeeID) AS NumberOfEmployees, AVG(e.Salary) AS AverageSalary, SUM(e.Salary) AS TotalSalary: We're selecting the DepartmentName (from the Departments table, aliased as d) and then applying three aggregate functions:

o COUNT(e.EmployeeID): Counts the number of employees in each group.

o AVG(e.Salary): Calculates the average Salary in each group.

o SUM(e.Salary): Calculates the total Salary in each group.

o AS NumberOfEmployees, AS AverageSalary, AS TotalSalary: These create aliases for the calculated columns, making the output more readable.

  • FROM Employees e JOIN Departments d ON e.DepartmentID = d.DepartmentID: This performs an INNER JOIN between the Employees and Departments tables on their common DepartmentID column. This is necessary to link employees to their department names. We use e and d as aliases for brevity.
  • GROUP BY d.DepartmentName: This is the crucial part. It groups all rows that have the same DepartmentName together. The aggregate functions then operate on these distinct groups.
  • ORDER BY d.DepartmentName: This just sorts the final result by department name for clarity.

Pandas Approach: Using .groupby() and .agg()

Pandas has a very similar and intuitive way to handle aggregations using the .groupby() method, often followed by .agg() (or direct aggregation methods like .mean(), .sum(), .count()).

Code Example (Pandas):

Let's perform the same aggregation using our DataFrames. First, we need to merge the DataFrames to get department names.

Python

import pandas as pd

import numpy as np

# Re-creating DataFrames for context

departments_data = {

'DepartmentID': [101, 102, 103, 104],

'DepartmentName': ['Sales', 'Marketing', 'Engineering', 'HR']

}

df_departments = pd.DataFrame(departments_data)

employees_data = {

'EmployeeID': [1, 2, 3, 4, 5, 6, 7],

'FirstName': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace'],

'LastName': ['Smith', 'Johnson', 'Brown', 'Prince', 'Adams', 'White', 'Lee'],

'DepartmentID': [101, 102, 101, 103, 102, 103, np.nan],

'Salary': [60000.00, 75000.00, 62000.00, 90000.00, 70000.00, 95000.00, 55000.00],

'HireDate': pd.to_datetime(['2020-01-15', '2019-03-20', '2021-07-01', '2018-11-10', '2022-05-22', '2017-09-01', '2023-01-30']),

'Email': ['alice.smith@example.com', 'bob.johnson@example.com', 'charlie.brown@example.com', 'diana.prince@example.com', 'eve.adams@example.com', 'frank.white@example.com', 'grace.lee@example.com']

}

df_employees = pd.DataFrame(employees_data)

# Merge DataFrames first (similar to SQL JOIN)

merged_df = pd.merge(df_employees, df_departments, on='DepartmentID', how='inner')

# Perform aggregation

department_summary = merged_df.groupby('DepartmentName').agg(

NumberOfEmployees=('EmployeeID', 'count'),

AverageSalary=('Salary', 'mean'),

TotalSalary=('Salary', 'sum')

).reset_index() # .reset_index() turns the grouped column back into a regular column

print(department_summary)

Explanation:

  • merged_df = pd.merge(df_employees, df_departments, on='DepartmentID', how='inner'): Before grouping, we need to combine the employee and department information. We use pd.merge (which we'll cover more in the "Joining" section) to do an inner join on DepartmentID.
  • merged_df.groupby('DepartmentName'): This is the equivalent of SQL's GROUP BY clause. It creates a GroupBy object, which is a temporary structure that groups rows by unique values in the DepartmentName column.
  • .agg(...): The .agg() method allows you to apply one or more aggregation functions to one or more columns.

o NumberOfEmployees=('EmployeeID', 'count'): This is a new way of specifying aggregations (Python 3.6+). It creates a new column named NumberOfEmployees by counting the EmployeeID in each group.

o AverageSalary=('Salary', 'mean'): Creates AverageSalary by calculating the mean of Salary.

o TotalSalary=('Salary', 'sum'): Creates TotalSalary by calculating the sum of Salary.

o Alternatively, you could use a dictionary format:

Python

merged_df.groupby('DepartmentName').agg({

'EmployeeID': 'count',

'Salary': ['mean', 'sum']

})

Or apply direct methods:

Python

merged_df.groupby('DepartmentName')['Salary'].mean()

merged_df.groupby('DepartmentName')['EmployeeID'].count()

  • .reset_index(): After grouping, the DepartmentName becomes the index of the resulting DataFrame. Calling .reset_index() converts it back into a regular column, which is often desired for further analysis or display.

Both SQL and Pandas offer robust ways to aggregate data. SQL's GROUP BY is very explicit, while Pandas' .groupby() provides a flexible object-oriented approach that integrates seamlessly with other DataFrame operations.

________________________________________

Task 4: Joining Tables/DataFrames

Goal: Combine rows from two or more tables (in SQL) or DataFrames (in Pandas) based on a common column or set of columns. This is essential when your data is spread across multiple related sources. For example, linking employee records to their respective department names.

SQL Approach: Using JOIN Clauses

SQL uses various JOIN types (e.g., INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL OUTER JOIN) to combine tables. The INNER JOIN is the most common and returns only the rows where there's a match in both tables.

Code Example (SQL - Inner Join):

Let's combine our Employees and Departments tables to show each employee's first name, last name, salary, and their department's name.

SQL

SELECT

e.FirstName,

e.LastName,

e.Salary,

d.DepartmentName

FROM

Employees e

INNER JOIN

Departments d ON e.DepartmentID = d.DepartmentID;

Explanation:

  • SELECT e.FirstName, e.LastName, e.Salary, d.DepartmentName: We're selecting columns from both tables. e. and d. are aliases that make the query shorter and clearer, specifying which table each column comes from.
  • FROM Employees e: We start by selecting from the Employees table, aliased as e.
  • INNER JOIN Departments d ON e.DepartmentID = d.DepartmentID: This is the JOIN clause.

o INNER JOIN Departments d: Specifies that we want to join with the Departments table, aliased as d.

o ON e.DepartmentID = d.DepartmentID: This is the join condition. It tells SQL to match rows where the DepartmentID in the Employees table is equal to the DepartmentID in the Departments table. Only rows where this condition is true in both tables will be included in the result.

Briefly on other JOIN types:

  • LEFT JOIN (or LEFT OUTER JOIN): Returns all rows from the left table, and the matched rows from the right table. If there's no match on the right, NULL values appear for right table columns.
  • RIGHT JOIN (or RIGHT OUTER JOIN): Similar to LEFT JOIN, but returns all rows from the right table and matched rows from the left table.
  • FULL OUTER JOIN: Returns all rows when there is a match in one of the tables. If there's no match, NULL values are used. (Note: Not supported by all SQL databases, like MySQL).

Pandas Approach: Using .merge()

Pandas provides the powerful pd.merge() function (or the DataFrame method .merge()) to combine DataFrames. It's very flexible and supports various join types, similar to SQL.

Code Example (Pandas - Inner Merge):

Let's perform the same inner join operation to link df_employees with df_departments.

Python

import pandas as pd

import numpy as np

# Re-creating DataFrames for context

departments_data = {

'DepartmentID': [101, 102, 103, 104],

'DepartmentName': ['Sales', 'Marketing', 'Engineering', 'HR']

}

df_departments = pd.DataFrame(departments_data)

employees_data = {

'EmployeeID': [1, 2, 3, 4, 5, 6, 7],

'FirstName': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace'],

'LastName': ['Smith', 'Johnson', 'Brown', 'Prince', 'Adams', 'White', 'Lee'],

'DepartmentID': [101, 102, 101, 103, 102, 103, np.nan],

'Salary': [60000.00, 75000.00, 62000.00, 90000.00, 70000.00, 95000.00, 55000.00],

'HireDate': pd.to_datetime(['2020-01-15', '2019-03-20', '2021-07-01', '2018-11-10', '2022-05-22', '2017-09-01', '2023-01-30']),

'Email': ['alice.smith@example.com', 'bob.johnson@example.com', 'charlie.brown@example.com', 'diana.prince@example.com', 'eve.adams@example.com', 'frank.white@example.com', 'grace.lee@example.com']

}

df_employees = pd.DataFrame(employees_data)

# Perform the inner merge

merged_employees_departments = pd.merge(

df_employees, # Left DataFrame

df_departments, # Right DataFrame

on='DepartmentID', # Column(s) to join on

how='inner' # Type of join

)

# Select desired columns after merge

final_merged_df = merged_employees_departments[['FirstName', 'LastName', 'Salary', 'DepartmentName']]

print(final_merged_df)

Explanation:

  • pd.merge(df_employees, df_departments, ...): We call the merge function, passing the two DataFrames we want to combine.

o df_employees: This is our "left" DataFrame.

o df_departments: This is our "right" DataFrame.

  • on='DepartmentID': This specifies the key column(s) for merging. Pandas will look for matching values in the DepartmentID column in both DataFrames. If the column names were different (e.g., EmpDeptID in df_employees and DeptID in df_departments), you'd use left_on='EmpDeptID', right_on='DeptID'.
  • how='inner': This argument dictates the type of merge, mirroring SQL's JOIN types:

o 'inner' (default): Returns only rows where the key exists in both DataFrames.

o 'left': Returns all rows from the left DataFrame and matching rows from the right.

o 'right': Returns all rows from the right DataFrame and matching rows from the left.

o 'outer': Returns all rows from both DataFrames, filling NaN where no match exists.

  • final_merged_df = merged_employees_departments[['FirstName', 'LastName', 'Salary', 'DepartmentName']]: After the merge, the resulting DataFrame merged_employees_departments will contain all columns from both original DataFrames. We then select only the specific columns we want for the final output, just like in the "Selecting Columns" task.

Both SQL's JOIN and Pandas' .merge() are incredibly powerful for integrating disparate datasets. While the syntax differs, the underlying concepts of matching records on common keys are very similar.

________________________________________

Task 5: Sorting Data

Goal: Arrange the rows of a dataset in a specific order, either ascending (A-Z, 0-9) or descending (Z-A, 9-0), based on the values in one or more columns. For example, listing employees by salary from highest to lowest.

SQL Approach: Using ORDER BY

In SQL, the ORDER BY clause is used to sort the result set of a query. You specify the column(s) to sort by and whether the order should be ASC (ascending, which is the default) or DESC (descending).

Code Example (SQL):

Let's retrieve employees, sorted by Salary in descending order, and then by LastName in ascending order for employees with the same salary.

SQL

SELECT

EmployeeID,

FirstName,

LastName,

Salary,

HireDate

FROM

Employees

ORDER BY

Salary DESC,

LastName ASC;

Explanation:

  • SELECT ... FROM Employees: Standard selection from the Employees table.
  • ORDER BY Salary DESC, LastName ASC: This is the sorting clause.

o Salary DESC: The primary sort is by the Salary column in descending order (highest salary first).

o LastName ASC: If two employees have the exact same salary, they will then be sorted by their LastName in ascending order (alphabetical). This demonstrates multi-column sorting.

Pandas Approach: Using .sort_values()

Pandas DataFrames have a very convenient method called .sort_values() that allows you to sort rows by one or more columns, specifying the ascending or descending order.

Code Example (Pandas):

Using our df_employees DataFrame, let's replicate the sorting from the SQL example.

Python

import pandas as pd

import numpy as np

# Re-creating df_employees for context

employees_data = {

'EmployeeID': [1, 2, 3, 4, 5, 6, 7],

'FirstName': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace'],

'LastName': ['Smith', 'Johnson', 'Brown', 'Prince', 'Adams', 'White', 'Lee'],

'DepartmentID': [101, 102, 101, 103, 102, 103, np.nan],

'Salary': [60000.00, 75000.00, 62000.00, 90000.00, 70000.00, 95000.00, 55000.00],

'HireDate': pd.to_datetime(['2020-01-15', '2019-03-20', '2021-07-01', '2018-11-10', '2022-05-22', '2017-09-01', '2023-01-30']),

'Email': ['alice.smith@example.com', 'bob.johnson@example.com', 'charlie.brown@example.com', 'diana.prince@example.com', 'eve.adams@example.com', 'frank.white@example.com', 'grace.lee@example.com']

}

df_employees = pd.DataFrame(employees_data)

# Sort employees

sorted_employees_df = df_employees.sort_values(

by=['Salary', 'LastName'],

ascending=[False, True]

)

print(sorted_employees_df)

Explanation:

  • df_employees.sort_values(...): This method is called directly on the DataFrame.
  • by=['Salary', 'LastName']: The by argument takes a list of column names that you want to sort by. The order in this list determines the priority of sorting (primary, secondary, etc.).
  • ascending=[False, True]: This is a list of booleans that corresponds to the by list.

o False for Salary means sort in descending order.

o True for LastName means sort in ascending order.

o If you only had one column to sort by, you could pass a single column name and a single boolean: df.sort_values(by='Salary', ascending=False).

Both SQL's ORDER BY and Pandas' .sort_values() provide intuitive ways to order your data. The key is remembering to specify the order (ASC/DESC in SQL, True/False in Pandas) for each column you're sorting by.

________________________________________

Task 6: Handling Missing Values

Goal: Identify, understand, and manage data points that are absent or 'null'. This can involve finding them, removing rows/columns with missing data, or filling them with a suitable value. In our df_employees data, Grace Lee has a DepartmentID as np.nan (Not a Number), representing a missing value.

SQL Approach: Using IS NULL / IS NOT NULL and COALESCE / IFNULL

SQL typically uses IS NULL and IS NOT NULL to identify missing values. For replacing them, functions like COALESCE (standard SQL) or IFNULL (MySQL/SQLite) are common.

Code Example (SQL):

1. Finding rows with missing values:

SQL

SELECT

EmployeeID,

FirstName,

LastName,

DepartmentID,

Salary

FROM

Employees

WHERE

DepartmentID IS NULL;

Explanation: The WHERE DepartmentID IS NULL clause precisely identifies rows where the DepartmentID column has no value.

2. Replacing missing values (e.g., with a default 'Unknown' department ID):

SQL

SELECT

EmployeeID,

FirstName,

LastName,

COALESCE(DepartmentID, 999) AS DepartmentID_Cleaned, -- 999 could be an 'Unknown' department

Salary

FROM

Employees;

  • - For MySQL/SQLite, IFNULL is equivalent:
  • - SELECT
  • - EmployeeID,
  • - FirstName,
  • - LastName,
  • - IFNULL(DepartmentID, 999) AS DepartmentID_Cleaned,
  • - Salary
  • - FROM
  • - Employees;

```

Explanation:

  • COALESCE(DepartmentID, 999): This function takes multiple arguments and returns the first non-NULL expression in the list. If DepartmentID is not NULL, it returns DepartmentID. If DepartmentID is NULL, it returns 999. This is a common way to impute (fill in) missing values with a default.

Pandas Approach: Using .isna(), .dropna(), and .fillna()

Pandas provides a rich set of methods for detecting, dropping, and filling missing values (represented as NaN or None).

Code Example (Pandas):

Python

import pandas as pd

import numpy as np

# Re-creating df_employees for context

employees_data = {

'EmployeeID': [1, 2, 3, 4, 5, 6, 7],

'FirstName': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace'],

'LastName': ['Smith', 'Johnson', 'Brown', 'Prince', 'Adams', 'White', 'Lee'],

'DepartmentID': [101, 102, 101, 103, 102, 103, np.nan], # Grace has NaN

'Salary': [60000.00, 75000.00, 62000.00, 90000.00, 70000.00, 95000.00, 55000.00],

'HireDate': pd.to_datetime(['2020-01-15', '2019-03-20', '2021-07-01', '2018-11-10', '2022-05-22', '2017-09-01', '2023-01-30']),

'Email': ['alice.smith@example.com', 'bob.johnson@example.com', 'charlie.brown@example.com', 'diana.prince@example.com', 'eve.adams@example.com', 'frank.white@example.com', 'grace.lee@example.com']

}

df_employees = pd.DataFrame(employees_data)

print("Original DataFrame:")

print(df_employees)

print("\n--- Finding rows with missing DepartmentID ---")

# Using .isna() (or .isnull()) to find missing values

missing_dept_df = df_employees[df_employees['DepartmentID'].isna()]

print(missing_dept_df)

print("\n--- Dropping rows with any missing values ---")

# Creates a new DataFrame with rows containing any NaN dropped

# Be cautious: this can remove a lot of data if many NaNs exist

df_no_nan_rows = df_employees.dropna()

print(df_no_nan_rows)

print("\n--- Filling missing DepartmentID with 999 ---")

# Fills NaN values in 'DepartmentID' column with 999

# .fillna() returns a new DataFrame, so assign it or use inplace=True

df_filled_dept = df_employees.copy() # Work on a copy to preserve original

df_filled_dept['DepartmentID'] = df_filled_dept['DepartmentID'].fillna(999).astype(int) # .astype(int) converts float to int after filling

print(df_filled_dept)

print("\n--- Filling missing values using a strategy (e.g., forward fill) ---")

# Example: Fill missing values with the previous valid observation

df_ffill = df_employees.fillna(method='ffill')

# print(df_ffill) # Uncomment to see the result

Explanation:

  • Identifying:

o df_employees['DepartmentID'].isna(): This method returns a boolean Series (True for NaN, False otherwise). You can then use this Series for boolean indexing to filter rows, just like we did for filtering.

o df_employees.isna().sum(): A useful way to count missing values per column.

  • Dropping:

o df_employees.dropna(axis=0): Drops rows where any NaN values are present (axis=0 is default for rows).

o df_employees.dropna(axis=1): Drops columns where any NaN values are present.

o df_employees.dropna(subset=['DepartmentID']): Drops rows only if NaN is in the DepartmentID column.

  • Filling:

o df_employees['DepartmentID'].fillna(999): This fills all NaN values in the DepartmentID Series with the value 999.

o df_employees.fillna(method='ffill'): Forward fill replaces a NaN with the last valid observation.

o df_employees.fillna(method='bfill'): Backward fill replaces a NaN with the next valid observation.

o You can also fill with calculated values, like the mean or median: df_employees['Salary'].fillna(df_employees['Salary'].mean()).

  • .astype(int): After filling np.nan (which makes the column a float type), if your original column was intended to be an integer (like DepartmentID), you'll need to cast it back to int using .astype(int).

Both SQL and Pandas offer robust functionalities for handling missing data, but their approaches differ. SQL uses NULL as a distinct state and provides operators and functions to work with it, while Pandas uses NaN (Not a Number) and offers methods that integrate seamlessly with DataFrame operations.

________________________________________

Conclusion

We've explored six fundamental data manipulation tasks, comparing how they are performed using SQL and Python Pandas.

  • Filtering: SQL's WHERE clause versus Pandas' boolean indexing.
  • Selecting Columns: SQL's explicit SELECT list versus Pandas' list of column names within [[]].
  • Aggregating Data: SQL's GROUP BY with aggregate functions versus Pandas' .groupby().agg().
  • Joining Data: SQL's various JOIN types versus Pandas' flexible .merge() function.
  • Sorting Data: SQL's ORDER BY clause versus Pandas' .sort_values() method.
  • Handling Missing Values: SQL's IS NULL and COALESCE versus Pandas' .isna(), .dropna(), and .fillna() methods.

When to Use Which?

  • SQL: Ideal for interacting directly with relational databases, especially for querying large datasets where performance is critical. It's the language of choice for database administrators and backend data operations. SQL queries are often easier to read for simple selections and aggregations.
  • Pandas: Excellent for in-memory data manipulation, especially when working with smaller to medium-sized datasets that fit into RAM. It excels at complex transformations, statistical analysis, data cleaning, and integrates perfectly with the broader Python ecosystem for machine learning, visualization, and scripting. Pandas provides a more programmatic and flexible approach to data wrangling.

Ultimately, mastering both SQL and Pandas equips you with a versatile toolkit for almost any data challenge. Many data professionals find themselves using both tools in their daily workflows, leveraging SQL for initial data extraction and aggregation, and then switching to Pandas for deeper analysis, cleaning, and modeling