pandera

A light-weight, flexible, and expressive statistical data testing library

unionai-oss

3939

355

Python

The Open-source Framework for Validating DataFrame-like Objects

📊 🔎 ✅

Data validation for scientists, engineers, and analysts seeking correctness.

Pandera is a Union.ai open
source project that provides a flexible and expressive API for performing data
validation on dataframe-like objects. The goal of Pandera is to make data
processing pipelines more readable and robust with statistically typed
dataframes.

Install

Pandera supports multiple dataframe libraries, including pandas, polars, pyspark, and more. To validate pandas DataFrames, install Pandera with the pandas extra:

With pip:

pip install 'pandera[pandas]'

With uv:

uv pip install 'pandera[pandas]'

With conda:

conda install -c conda-forge pandera-pandas

Get started

First, create a dataframe:

import pandas as pd
import pandera.pandas as pa

# data to validate
df = pd.DataFrame({
    "column1": [1, 2, 3],
    "column2": [1.1, 1.2, 1.3],
    "column3": ["a", "b", "c"],
})

Validate the data using the object-based API:

# define a schema
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, pa.Check.ge(0)),
    "column2": pa.Column(float, pa.Check.lt(10)),
    "column3": pa.Column(
        str,
        [
            pa.Check.isin([*"abc"]),
            pa.Check(lambda series: series.str.len() == 1),
        ]
    ),
})

print(schema.validate(df))
#    column1  column2 column3
# 0        1      1.1       a
# 1        2      1.2       b
# 2        3      1.3       c

Or validate the data using the class-based API:

# define a schema
class Schema(pa.DataFrameModel):
    column1: int = pa.Field(ge=0)
    column2: float = pa.Field(lt=10)
    column3: str = pa.Field(isin=[*"abc"])

    @pa.check("column3")
    def custom_check(cls, series: pd.Series) -> pd.Series:
        return series.str.len() == 1

print(Schema.validate(df))
#    column1  column2 column3
# 0        1      1.1       a
# 1        2      1.2       b
# 2        3      1.3       c

[!WARNING]
Pandera v0.24.0 introduces the pandera.pandas module, which is now the
(highly) recommended way of defining DataFrameSchemas and DataFrameModels
for pandas data structures like DataFrames. Defining a dataframe schema from
the top-level pandera module will produce a FutureWarning:
import pandera as pa

schema = pa.DataFrameSchema({"col": pa.Column(str)})
Update your import to:
import pandera.pandas as pa
And all of the rest of your pandera code should work. Using the top-level
pandera module to access DataFrameSchema and the other pandera classes
or functions will be deprecated in version 0.29.0

Next steps

See the official documentation to learn more.