Probability Distributions

Learn how to create a codebook by adding variables directly, without starting from a DataFrame.

Overview

Sometimes you want to document variables that aren't in a DataFrame—for example, simulated data, calculated values, or data from external sources. BookIt's add_variable method lets you add individual variables with raw data arrays.

In this tutorial, we'll create a codebook documenting several common probability distributions, each with 1,000 randomly sampled observations.

Generating the Data

First, let's generate samples from various probability distributions using NumPy:

import numpy as np

# Set seed for reproducibility
np.random.seed(42)
n = 1000  # Number of observations per distribution

# Normal distribution: μ=0, σ=1
normal_data = np.random.normal(loc=0, scale=1, size=n)

# Exponential distribution: λ=1.5
exponential_data = np.random.exponential(scale=1/1.5, size=n)

# Uniform distribution: a=0, b=10
uniform_data = np.random.uniform(low=0, high=10, size=n)

# Poisson distribution: λ=5
poisson_data = np.random.poisson(lam=5, size=n)

# Binomial distribution: n=20, p=0.3
binomial_data = np.random.binomial(n=20, p=0.3, size=n)

# Beta distribution: α=2, β=5
beta_data = np.random.beta(a=2, b=5, size=n)

# Gamma distribution: k=2, θ=2
gamma_data = np.random.gamma(shape=2, scale=2, size=n)

Creating the Codebook

Now we use add_variable to add each distribution to the codebook. For each variable, we include the distribution parameters in the context field:

from bookit_df import BookIt

with BookIt(
    "Probability Distributions Codebook",
    output="distributions_codebook.pdf",
    author="Data Science Team"
) as book:

    # Normal distribution
    book.add_variable(
        name="normal",
        description="Standard normal distribution",
        context="Parameters: μ (mean) = 0, σ (std dev) = 1. "
                "The normal distribution is symmetric and bell-shaped, "
                "commonly used to model natural phenomena.",
        data=normal_data
    )

    # Exponential distribution
    book.add_variable(
        name="exponential",
        description="Exponential distribution",
        context="Parameters: λ (rate) = 1.5. "
                "Models time between events in a Poisson process. "
                "Commonly used for survival analysis and reliability engineering.",
        data=exponential_data
    )

    # Uniform distribution
    book.add_variable(
        name="uniform",
        description="Continuous uniform distribution",
        context="Parameters: a (min) = 0, b (max) = 10. "
                "Every value in the interval has equal probability. "
                "Used for random sampling and Monte Carlo simulations.",
        data=uniform_data
    )

    # Poisson distribution
    book.add_variable(
        name="poisson",
        description="Poisson distribution",
        context="Parameters: λ (rate) = 5. "
                "Models count of events in a fixed interval. "
                "Common in queueing theory and epidemiology.",
        data=poisson_data
    )

    # Binomial distribution
    book.add_variable(
        name="binomial",
        description="Binomial distribution",
        context="Parameters: n (trials) = 20, p (success probability) = 0.3. "
                "Models number of successes in a fixed number of trials. "
                "Used in quality control and clinical trials.",
        data=binomial_data
    )

    # Beta distribution
    book.add_variable(
        name="beta",
        description="Beta distribution",
        context="Parameters: α (shape) = 2, β (shape) = 5. "
                "Bounded between 0 and 1, useful for modeling proportions. "
                "Common as a prior in Bayesian statistics.",
        data=beta_data
    )

    # Gamma distribution
    book.add_variable(
        name="gamma",
        description="Gamma distribution",
        context="Parameters: k (shape) = 2, θ (scale) = 2. "
                "Generalizes the exponential distribution. "
                "Used to model waiting times and insurance claims.",
        data=gamma_data
    )

# PDF saved automatically on exit!

Output

Here's what the generated codebook looks like:

Can't see the PDF?

If the embedded viewer doesn't work, you can download the PDF directly.

Key Concepts

Using `add_variable` Directly

The add_variable method allows you to add variables without a DataFrame:

book.add_variable(
    name="variable_name",     # Variable identifier
    description="...",        # What this variable represents
    context="...",           # Additional notes (e.g., parameters)
    data=array_or_list       # Raw data values
)

Context for Parameters

Use the context parameter to document important metadata like distribution parameters:

Distribution	Parameters
Normal	μ (mean), σ (std dev)
Exponential	λ (rate)
Uniform	a (min), b (max)
Poisson	λ (rate)
Binomial	n (trials), p (probability)
Beta	α (shape), β (shape)
Gamma	k (shape), θ (scale)

When to Use `add_variable` vs `from_dataframe`

Use `add_variable` when...	Use `from_dataframe` when...
Data isn't in a DataFrame	Your data is already in a DataFrame
Adding simulated/generated data	Documenting survey or tabular data
Combining data from multiple sources	Processing all columns at once
You need complete control per variable	You want batch processing

Next Steps

See the Getting Started tutorial for DataFrame-based usage
See the Working with mtcars tutorial for a real dataset example
Check the API Reference for all available options