SQL to Pandas: A Comprehensive Guide for Data Manipulation

For data professionals familiar with SQL, transitioning to pandas in Python can significantly enhance data manipulation and analysis workflows. This guide serves as a practical “Sql To Pandas Guide”, bridging the gap between SQL and pandas by demonstrating how common SQL operations are performed using pandas. Whether you’re looking to leverage Python’s rich ecosystem for data science or seeking more efficient ways to handle data in memory, understanding the pandas equivalents of SQL queries is invaluable.

If you are new to pandas, it’s recommended to first go through a basic introduction to pandas to familiarize yourself with its fundamental data structures and operations.

Let’s begin by importing pandas and NumPy, which are essential libraries for data manipulation in Python:

import pandas as pd
import numpy as np

Throughout this guide, we will use the tips dataset, a common dataset in pandas examples, to illustrate the concepts. We will load this dataset into a pandas DataFrame named tips, mimicking a SQL table named tips with the same structure.

url = (
    "https://raw.githubusercontent.com/pandas-dev"
    "/pandas/main/pandas/tests/io/data/csv/tips.csv"
)
tips = pd.read_csv(url)
tips

Understanding Copies vs. In-place Operations in Pandas

A crucial concept to grasp when moving from SQL to pandas is how operations are handled in pandas. Most pandas operations are designed to return copies of the DataFrame or Series, rather than modifying the original data in place. This behavior is different from some SQL operations and requires a shift in mindset.

To retain changes made by a pandas operation, you typically need to assign the result to a new variable or overwrite the original DataFrame.

For example, to sort a DataFrame by a column, you would do:

sorted_tips = tips.sort_values("total_bill") # Creates a new sorted DataFrame
tips = tips.sort_values("total_bill") # Overwrites the original DataFrame with the sorted one

It’s important to note that while some pandas methods offer inplace=True or copy=False parameters, their use is being actively discouraged in favor of the copy-on-write mechanism for better performance and predictability. For most methods, including dropna, these parameters might be deprecated in future versions of pandas. Methods like replace may retain inplace. You can find more details on this proposal in the pandas development discussions.

SELECT Statements: Column Selection in Pandas

In SQL, the SELECT statement is used to choose specific columns from a table.

SELECT total_bill, tip, smoker, time FROM tips;

In pandas, you achieve column selection by passing a list of column names to your DataFrame using bracket notation. This is a fundamental operation in any “sql to pandas guide”.

tips[["total_bill", "tip", "smoker", "time"]]

To select all columns in SQL, you use SELECT *. In pandas, simply calling the DataFrame name without specifying columns displays all columns, similar to SELECT *.

Calculated Columns: Enhancing Selection with New Data

SQL allows creating calculated columns within a SELECT statement, adding new information derived from existing columns.

SELECT *, tip / total_bill AS tip_rate FROM tips;

Pandas offers the powerful assign() method to create new columns based on existing ones. This is a clean and efficient way to add calculated fields to your DataFrame, enhancing your “sql to pandas guide” toolkit.

tips.assign(tip_rate = tips["tip"] / tips["total_bill"])

WHERE Clause: Filtering Data in Pandas

The SQL WHERE clause is used to filter rows based on specified conditions.

SELECT * FROM tips WHERE time = 'Dinner';

Pandas offers several ways to filter DataFrames, with boolean indexing being the most common and intuitive. This method involves creating a boolean Series based on a condition and using it to select rows from the DataFrame, a core technique in any “sql to pandas guide”.

tips[tips["total_bill"] > 10]

This works because tips["total_bill"] > 10 returns a Series of True and False values, which is then used to index the DataFrame, selecting only the rows where the condition is True.

is_dinner = tips["time"] == "Dinner"
is_dinner.value_counts()
tips[is_dinner]

Just like SQL’s AND and OR operators, pandas uses & (AND) and | (OR) to combine multiple conditions. Remember to enclose each condition in parentheses to ensure correct operator precedence.

Combining Conditions: Advanced Filtering Techniques

To find tips greater than $5 at Dinner:

SELECT * FROM tips WHERE time = 'Dinner' AND tip > 5.00;

In pandas:

tips[(tips["time"] == "Dinner") & (tips["tip"] > 5.00)]

To find tips from parties of 5 or more OR where the total bill exceeded $45:

SELECT * FROM tips WHERE size >= 5 OR total_bill > 45;

In pandas:

tips[(tips["size"] >= 5) | (tips["total_bill"] > 45)]

Handling NULL Values: Checking for Missing Data

SQL uses IS NULL and IS NOT NULL to check for missing values (NULL). Pandas uses isna() and notna() methods, which are essential for data cleaning and manipulation and should be part of any comprehensive “sql to pandas guide”.

Let’s create a DataFrame with some missing values:

frame = pd.DataFrame({"col1": ["A", "B", np.nan, "C", "D"], "col2": ["F", np.nan, "G", "H", "I"]})
frame

To select rows where col2 is NULL in SQL:

SELECT * FROM frame WHERE col2 IS NULL;

In pandas:

frame[frame["col2"].isna()]

To select rows where col1 is NOT NULL in SQL:

SELECT * FROM frame WHERE col1 IS NOT NULL;

In pandas:

frame[frame["col1"].notna()]

GROUP BY Clause: Aggregating Data in Pandas

SQL’s GROUP BY clause is used to group rows with the same values in one or more columns and then perform aggregate functions on each group. Pandas achieves this using the groupby() method, a cornerstone of data analysis in pandas and a key topic in our “sql to pandas guide”.

A common SQL operation is to count records within each group. For example, to count tips by sex:

SELECT sex, COUNT(*) FROM tips GROUP BY sex;
/*
Female 87
Male 157
*/

The pandas equivalent using groupby() and size():

tips.groupby("sex").size()

It’s important to use size() instead of count() here. count() in pandas, when used with groupby(), applies the count function to each column, returning the count of non-NA values in each column for each group.

tips.groupby("sex").count()

To get the count for a specific column, like total_bill, within each group:

tips.groupby("sex")["total_bill"].count()

Applying Multiple Aggregations: Gaining Deeper Insights

To apply multiple aggregation functions at once, pandas uses the agg() method. This allows for flexible and powerful data summarization, a vital skill highlighted in any “sql to pandas guide”.

For example, to see the average tip amount and the number of records for each day:

SELECT day, AVG(tip), COUNT(*) FROM tips GROUP BY day;
/*
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thu 2.771452 62
*/

In pandas:

tips.groupby("day").agg({"tip": "mean", "day": "size"})

Grouping by multiple columns is straightforward in pandas; just pass a list of column names to groupby().

SELECT smoker, day, COUNT(*), AVG(tip) FROM tips GROUP BY smoker, day;
/*
smoker day
No Fri 4 2.812500
 Sat 45 3.102889
 Sun 57 3.167895
 Thu 45 2.673778
Yes Fri 15 2.714000
 Sat 42 2.875476
 Sun 19 3.516842
 Thu 17 3.030000
*/

In pandas, to get size and mean of tip, grouped by smoker and day:

tips.groupby(["smoker", "day"]).agg({"tip": ["size", "mean"]})

JOIN Operations: Merging DataFrames in Pandas

SQL JOIN operations combine rows from two or more tables based on a related column. Pandas offers merge() and join() methods to perform similar operations on DataFrames. merge() is more versatile and commonly used, making it an essential part of our “sql to pandas guide”.

By default, join() joins DataFrames based on their indices. merge() is more flexible, allowing joins on columns or indices, and supports different types of joins (INNER, LEFT, RIGHT, FULL OUTER).

Warning: Be mindful of null values in join keys. Pandas merge() will match rows where the key is null, which differs from standard SQL behavior and can lead to unexpected results if not handled carefully.

Let’s create two DataFrames for join examples:

df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})
df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)})

INNER JOIN: Matching Rows

SQL INNER JOIN returns rows only when there is a match in both tables based on the join condition.

SELECT * FROM df1 INNER JOIN df2 ON df1.key = df2.key;

Pandas merge() performs an INNER JOIN by default when you specify the on column:

pd.merge(df1, df2, on="key")

merge() also supports joining on index. To join df1 with df2‘s index using df1‘s ‘key’ column:

indexed_df2 = df2.set_index("key")
pd.merge(df1, indexed_df2, left_on="key", right_index=True)

LEFT OUTER JOIN: Preserving Left Table Rows

SQL LEFT OUTER JOIN returns all rows from the left table (df1) and the matching rows from the right table (df2). If there’s no match in the right table, NULL values are returned for the right table’s columns.

SELECT * FROM df1 LEFT OUTER JOIN df2 ON df1.key = df2.key;

In pandas, use how="left" in merge():

pd.merge(df1, df2, on="key", how="left")

RIGHT JOIN: Preserving Right Table Rows

SQL RIGHT OUTER JOIN is similar to LEFT JOIN, but it preserves all rows from the right table (df2) and matches from the left table (df1).

SELECT * FROM df1 RIGHT OUTER JOIN df2 ON df1.key = df2.key;

In pandas, use how="right" in merge():

pd.merge(df1, df2, on="key", how="right")

FULL OUTER JOIN: Preserving All Rows

SQL FULL OUTER JOIN (or FULL JOIN) returns all rows from both tables, with matches where possible, and NULLs where there’s no match. Note that FULL OUTER JOIN is not supported in all database systems like MySQL.

SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.key = df2.key;

Pandas supports FULL OUTER JOIN using how="outer" in merge():

pd.merge(df1, df2, on="key", how="outer")

UNION Operations: Combining DataFrames

SQL UNION and UNION ALL combine the result-sets of two or more SELECT statements. UNION ALL includes all rows, including duplicates, while UNION removes duplicate rows. Pandas concat() method is analogous to UNION ALL, a crucial operation in our “sql to pandas guide”.

Let’s create two DataFrames for UNION examples:

df1 = pd.DataFrame({"city": ["Chicago", "San Francisco", "New York City"], "rank": range(1, 4)})
df2 = pd.DataFrame({"city": ["Chicago", "Boston", "Los Angeles"], "rank": [1, 4, 5]})

UNION ALL: Combining All Rows

SQL UNION ALL simply stacks result sets together.

SELECT city, rank FROM df1 UNION ALL SELECT city, rank FROM df2;
/*
 city rank
 Chicago 1
San Francisco 2
New York City 3
 Chicago 1
 Boston 4
 Los Angeles 5
*/

Pandas concat() performs UNION ALL by default:

pd.concat([df1, df2])

UNION: Combining Unique Rows

SQL UNION combines result sets and removes duplicate rows.

SELECT city, rank FROM df1 UNION SELECT city, rank FROM df2;
-- notice that there is only one Chicago record this time
/*
 city rank
 Chicago 1
San Francisco 2
New York City 3
 Boston 4
 Los Angeles 5
*/

To achieve UNION in pandas (removing duplicates), use concat() followed by drop_duplicates():

pd.concat([df1, df2]).drop_duplicates()

LIMIT Clause: Selecting a Subset of Rows

SQL LIMIT clause restricts the number of rows returned by a query.

SELECT * FROM tips LIMIT 10;

Pandas head() method returns the first n rows of a DataFrame, effectively mimicking LIMIT. This is useful for quickly inspecting data, a common practice in data analysis and therefore important to include in a “sql to pandas guide”.

tips.head(10)

Pandas Equivalents for SQL Analytic and Aggregate Functions

Pandas provides powerful functions that mirror SQL’s analytic and aggregate capabilities, allowing for complex data manipulations beyond basic queries.

Top N Rows with Offset: Finding Specific Rank Ranges

To get top N rows with an offset, which is like LIMIT with OFFSET in SQL (e.g., for pagination or ranking).

-- MySQL
SELECT * FROM tips ORDER BY tip DESC LIMIT 10 OFFSET 5;

In pandas, use nlargest() to get the top N rows and then tail() to apply the offset, a clever combination when implementing “sql to pandas guide” techniques.

tips.nlargest(10 + 5, columns="tip").tail(10)

Top N Rows per Group: Ranking Within Categories

To get top N rows within each group, similar to SQL’s analytic functions like ROW_NUMBER() or RANK().

-- Oracle's ROW_NUMBER() analytic function
SELECT * FROM (
  SELECT t.*, ROW_NUMBER() OVER (PARTITION BY day ORDER BY total_bill DESC) AS rn
  FROM tips t
) WHERE rn < 3 ORDER BY day, rn;

Pandas achieves this using a combination of groupby(), sort_values(), cumcount(), and filtering, showcasing the flexibility of pandas for complex data tasks.

(
    tips
    .assign(rn=tips.sort_values(["total_bill"], ascending=False)
            .groupby(["day"])
            .cumcount() + 1)
    .query("rn < 3")
    .sort_values(["day", "rn"])
)

Alternatively, using rank(method='first') function for a more concise approach:

(
    tips
    .assign(rnk=tips.groupby(["day"])["total_bill"]
            .rank(method='first', ascending=False))
    .query("rnk < 3")
    .sort_values(["day", "rnk"])
)

Using RANK() equivalent in pandas with rank(method='min'):

-- Oracle's RANK() analytic function
SELECT * FROM (
  SELECT t.*, RANK() OVER (PARTITION BY sex ORDER BY tip) AS rnk
  FROM tips t
  WHERE tip < 2
) WHERE rnk < 3 ORDER BY sex, rnk;
(
    tips[tips["tip"] < 2]
    .assign(rnk_min=tips.groupby(["sex"])["tip"]
            .rank(method='min'))
    .query("rnk_min < 3")
    .sort_values(["sex", "rnk_min"])
)

UPDATE Statements: Modifying Data in Pandas

SQL UPDATE statement modifies existing data in a table based on conditions.

UPDATE tips SET tip = tip * 2 WHERE tip < 2;

In pandas, you can update values using .loc with boolean indexing, directly modifying the DataFrame.

tips.loc[tips["tip"] < 2, "tip"] *= 2

DELETE Statements: Removing Data in Pandas

SQL DELETE statement removes rows from a table based on conditions.

DELETE FROM tips WHERE tip > 9;

Pandas approach for deleting rows is typically to select the rows you want to keep (the opposite of the DELETE condition) and reassign the DataFrame, aligning with the principle of creating copies rather than in-place modifications.

tips = tips.loc[tips["tip"] <= 9]

This comprehensive “sql to pandas guide” provides a solid foundation for SQL users to transition to pandas for data manipulation. By understanding these equivalent operations, you can effectively leverage the power and flexibility of pandas in your data analysis workflows.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *