Getting started¶

Here we will demonstrate how to use overviewpy in a project:

Import libraries¶

In [1]:

Copied!

from overviewpy.overviewpy import overview_tab, overview_na
import pandas as pd
import numpy as np
from overviewpy.overviewpy import overview_tab, overview_na
import pandas as pd
import numpy as np

Generate data¶

In the first step, we will generate some data that we will use in the next steps.

In [2]:

Copied!





# Generate full data

data = {
        'id': ['RWA', 'RWA', 'RWA', 'GAB', 'GAB', 'FRA',\
            'FRA', 'BEL', 'BEL', 'ARG'],
        'year': [2022, 2023, 2021, 2023, 2020, 2019, 2015,\
            2014, 2013, 2002]
    }

df = pd.DataFrame(data)

df.head()
# Generate full data

data = {
        'id': ['RWA', 'RWA', 'RWA', 'GAB', 'GAB', 'FRA',\
            'FRA', 'BEL', 'BEL', 'ARG'],
        'year': [2022, 2023, 2021, 2023, 2020, 2019, 2015,\
            2014, 2013, 2002]
    }

df = pd.DataFrame(data)

df.head()

Out[2]:

	id	year
0	RWA	2022
1	RWA	2023
2	RWA	2021
3	GAB	2023
4	GAB	2020

In [3]:

Copied!





# Generate data with missing values

data_na = {
        'id': ['RWA', 'RWA', 'RWA', np.nan, 'GAB', 'GAB',\
            'FRA', 'FRA', 'BEL', 'BEL', 'ARG', np.nan,  np.nan],
        'year': [2022, 2001, 2000, 2023, 2021, 2023, 2020,\
            2019,  np.nan, 2015, 2014, 2013, 2002]
    }

df_na = pd.DataFrame(data_na)

df_na.head()
# Generate data with missing values

data_na = {
        'id': ['RWA', 'RWA', 'RWA', np.nan, 'GAB', 'GAB',\
            'FRA', 'FRA', 'BEL', 'BEL', 'ARG', np.nan,  np.nan],
        'year': [2022, 2001, 2000, 2023, 2021, 2023, 2020,\
            2019,  np.nan, 2015, 2014, 2013, 2002]
    }

df_na = pd.DataFrame(data_na)

df_na.head()

Out[3]:

	id	year
0	RWA	2022.0
1	RWA	2001.0
2	RWA	2000.0
3	NaN	2023.0
4	GAB	2021.0

Get an overview of the time distribution in your data¶

Generate some general overview of the data set using the time and scope conditions with overview_tab. The resulting data frame collapses the time condition for each id by taking into account potential gaps in the time frame.

In [4]:

Copied!

df_overview = overview_tab(df=df, id='id', time='year')

print(df_overview)
df_overview = overview_tab(df=df, id='id', time='year')

print(df_overview)

    id  time_frame
9  ARG        2002
7  BEL   2013-2014
5  FRA  2015, 2019
3  GAB  2020, 2023
0  RWA   2021-2023

/opt/hostedtoolcache/Python/3.11.5/x64/lib/python3.11/site-packages/overviewpy/overviewpy.py:70: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '2002' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df_no_dup.loc[group_df.index, 'time_frame'] = combined_str

Get an overview of missing data in your data frame¶

overview_na is a simple function that provides information about the content of all variables in your data, not only the time and scope conditions. It returns a horizontal ggplot bar plot that indicates the amount of missing data (NAs) for each variable (on the y-axis). You can choose whether to display the relative amount of NAs for each variable in percentage (the default) or the total number of NAs.

In [5]:

Copied!

overview_na(df_na)
overview_na(df_na)

No description has been provided for this image