Skip to content

Reference

overview_na(df)

Plots an overview of missing values by variable.

Parameters:

Name Type Description Default
df DataFrame

Input data frame

required

Returns:

Type Description
Figure

matplotlib.figure.Figure: Bar plot visualizing the number of missing values per variable

Source code in overviewpy/overviewpy.py
74
75
76
77
78
79
80
81
82
83
84
85
86
87
def overview_na(df: pd.DataFrame) -> matplotlib.figure.Figure:
    """Plots an overview of missing values by variable.

    Args:
        df (pd.DataFrame): Input data frame

    Returns:
        matplotlib.figure.Figure: Bar plot visualizing the number of missing values per variable
    """
    ax = df.isna().sum().plot(kind="barh")
    ax.set_xlabel("Count")
    ax.set_ylabel("Columns")
    plt.title("Missing Values Overview")
    plt.show()

overview_tab(df, id, time)

Generates a tabular overview of the sample (and returns a data frame). The general sample plots a two-column table that provides information on an id in the left column and a the time frame on the right column.

Parameters:

Name Type Description Default
df DataFrame

Input data frame

required
id str

Identifies the id column (for instance, country)

required
time int

Identifies the time column (for instance, years). This argument can currently handle simple digits (YYYY or YY, for instance). More complex dates (YYYY-MM-DD, for instance) is planned as a future feature.

required

Returns:

Type Description
DataFrame

pd.DataFrame: Returns a reduced data frame that shows a cohesive

DataFrame

overview of the data frame

Source code in overviewpy/overviewpy.py
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
def overview_tab(df: pd.DataFrame, id: str, time: int) -> pd.DataFrame:
    """Generates a tabular overview of the sample (and returns a data frame). 
    The general sample plots a two-column table that provides information on an 
    id in the left column and a the time frame on the right column.

    Args:
        df (pd.DataFrame): Input data frame
        id (str): Identifies the id column (for instance, country)
        time (int): Identifies the time column (for instance, years). 
                    This argument can currently handle simple digits 
                    (YYYY or YY, for instance).
                    More complex dates (YYYY-MM-DD, for instance) is 
                    planned as a future feature.

    Returns:
        pd.DataFrame: Returns a reduced data frame that shows a cohesive
        overview of the data frame
    """

    df2 = df.dropna(subset=[id]).copy()
    if len(df2) != len(df):
        print("There is at least one missing value in your id variable. The missing value is automatically deleted.")

    df_no_dup = df2.filter(items=[id, time]).drop_duplicates()

    if len(df_no_dup) != len(df2):
        print("There are some duplicates. We aggregate the data before proceeding.")

    df_sorted = df_no_dup.sort_values([id, time])

    # Group the DataFrame by the ID column
    grouped = df_sorted.groupby(id)

    # Initialize the combined column
    df['time_frame'] = df_no_dup[time].astype(str)

    # Check if numbers within each group are consecutive and combine them
    for _, group_df in grouped:
        numbers = group_df[time].tolist()

        combined_str = ""

        if len(numbers) > 1:
            consecutive_ranges = []
            current_range = [numbers[0]]

            for i in range(1, len(numbers)):
                if numbers[i] == numbers[i-1] + 1:
                    current_range.append(numbers[i])
                else:
                    if len(current_range) > 1:
                        consecutive_ranges.append(f'{current_range[0]}-{current_range[-1]}')
                    else:
                        consecutive_ranges.append(str(current_range[0]))
                    current_range = [numbers[i]]

            if len(current_range) > 1:
                consecutive_ranges.append(f'{current_range[0]}-{current_range[-1]}')
            else:
                consecutive_ranges.append(str(current_range[0]))

            combined_str = ', '.join(consecutive_ranges)
        else:
            combined_str = str(numbers[0])

        df_no_dup.loc[group_df.index, 'time_frame'] = combined_str

    return df_no_dup[[id, 'time_frame']].sort_values([id]).drop_duplicates()