Reference¶

`overview_na(df)` ¶

Plots an overview of missing values by variable.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input data frame	required

Returns:

Type	Description
`Figure`	matplotlib.figure.Figure: Bar plot visualizing the number of missing values per variable

Source code in overviewpy/overviewpy.py

def overview_na(df: pd.DataFrame) -> matplotlib.figure.Figure:
    """Plots an overview of missing values by variable.

    Args:
        df (pd.DataFrame): Input data frame

    Returns:
        matplotlib.figure.Figure: Bar plot visualizing the number of missing values per variable
    """
    ax = df.isna().sum().plot(kind="barh")
    ax.set_xlabel("Count")
    ax.set_ylabel("Columns")
    plt.title("Missing Values Overview")
    plt.show()

`overview_tab(df, id, time)` ¶

Generates a tabular overview of the sample (and returns a data frame). The general sample plots a two-column table that provides information on an id in the left column and a the time frame on the right column.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input data frame	required
`id`	`str`	Identifies the id column (for instance, country)	required
`time`	`int`	Identifies the time column (for instance, years). This argument can currently handle simple digits (YYYY or YY, for instance). More complex dates (YYYY-MM-DD, for instance) is planned as a future feature.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: Returns a reduced data frame that shows a cohesive
`DataFrame`	overview of the data frame

Source code in overviewpy/overviewpy.py

def overview_tab(df: pd.DataFrame, id: str, time: int) -> pd.DataFrame:
    """Generates a tabular overview of the sample (and returns a data frame). 
    The general sample plots a two-column table that provides information on an 
    id in the left column and a the time frame on the right column.

    Args:
        df (pd.DataFrame): Input data frame
        id (str): Identifies the id column (for instance, country)
        time (int): Identifies the time column (for instance, years). 
                    This argument can currently handle simple digits 
                    (YYYY or YY, for instance).
                    More complex dates (YYYY-MM-DD, for instance) is 
                    planned as a future feature.

    Returns:
        pd.DataFrame: Returns a reduced data frame that shows a cohesive
        overview of the data frame
    """

    df2 = df.dropna(subset=[id]).copy()
    if len(df2) != len(df):
        print("There is at least one missing value in your id variable. The missing value is automatically deleted.")

    df_no_dup = df2.filter(items=[id, time]).drop_duplicates()

    if len(df_no_dup) != len(df2):
        print("There are some duplicates. We aggregate the data before proceeding.")

    df_sorted = df_no_dup.sort_values([id, time])

    # Group the DataFrame by the ID column
    grouped = df_sorted.groupby(id)

    # Initialize the combined column
    df['time_frame'] = df_no_dup[time].astype(str)

    # Check if numbers within each group are consecutive and combine them
    for _, group_df in grouped:
        numbers = group_df[time].tolist()

        combined_str = ""

        if len(numbers) > 1:
            consecutive_ranges = []
            current_range = [numbers[0]]

            for i in range(1, len(numbers)):
                if numbers[i] == numbers[i-1] + 1:
                    current_range.append(numbers[i])
                else:
                    if len(current_range) > 1:
                        consecutive_ranges.append(f'{current_range[0]}-{current_range[-1]}')
                    else:
                        consecutive_ranges.append(str(current_range[0]))
                    current_range = [numbers[i]]

            if len(current_range) > 1:
                consecutive_ranges.append(f'{current_range[0]}-{current_range[-1]}')
            else:
                consecutive_ranges.append(str(current_range[0]))

            combined_str = ', '.join(consecutive_ranges)
        else:
            combined_str = str(numbers[0])

        df_no_dup.loc[group_df.index, 'time_frame'] = combined_str

    return df_no_dup[[id, 'time_frame']].sort_values([id]).drop_duplicates()

Reference¶

overview_na(df) ¶

overview_tab(df, id, time) ¶

`overview_na(df)` ¶

`overview_tab(df, id, time)` ¶