A light-weight, flexible, and expressive statistical data testing library
📊 🔎 ✅
Data validation for scientists, engineers, and analysts seeking correctness.
Pandera is a Union.ai open
source project that provides a flexible and expressive API for performing data
validation on dataframe-like objects. The goal of Pandera is to make data
processing pipelines more readable and robust with statistically typed
dataframes.
Pandera supports multiple dataframe libraries, including pandas, polars, pyspark, and more. To validate pandas
DataFrames, install Pandera with the pandas
extra:
With pip
:
pip install 'pandera[pandas]'
With uv
:
uv pip install 'pandera[pandas]'
With conda
:
conda install -c conda-forge pandera-pandas
First, create a dataframe:
import pandas as pd
import pandera.pandas as pa
# data to validate
df = pd.DataFrame({
"column1": [1, 2, 3],
"column2": [1.1, 1.2, 1.3],
"column3": ["a", "b", "c"],
})
Validate the data using the object-based API:
# define a schema
schema = pa.DataFrameSchema({
"column1": pa.Column(int, pa.Check.ge(0)),
"column2": pa.Column(float, pa.Check.lt(10)),
"column3": pa.Column(
str,
[
pa.Check.isin([*"abc"]),
pa.Check(lambda series: series.str.len() == 1),
]
),
})
print(schema.validate(df))
# column1 column2 column3
# 0 1 1.1 a
# 1 2 1.2 b
# 2 3 1.3 c
Or validate the data using the class-based API:
# define a schema
class Schema(pa.DataFrameModel):
column1: int = pa.Field(ge=0)
column2: float = pa.Field(lt=10)
column3: str = pa.Field(isin=[*"abc"])
@pa.check("column3")
def custom_check(cls, series: pd.Series) -> pd.Series:
return series.str.len() == 1
print(Schema.validate(df))
# column1 column2 column3
# 0 1 1.1 a
# 1 2 1.2 b
# 2 3 1.3 c
[!WARNING]
Panderav0.24.0
introduces thepandera.pandas
module, which is now the
(highly) recommended way of definingDataFrameSchema
s andDataFrameModel
s
forpandas
data structures likeDataFrame
s. Defining a dataframe schema from
the top-levelpandera
module will produce aFutureWarning
:import pandera as pa schema = pa.DataFrameSchema({"col": pa.Column(str)})
Update your import to:
import pandera.pandas as pa
And all of the rest of your pandera code should work. Using the top-level
pandera
module to accessDataFrameSchema
and the other pandera classes
or functions will be deprecated in a future version
See the official documentation to learn more.