{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Example Usage\n",
    "\n",
    "Welcome to the `salesanalyzer_mds` package! This package is designed to help small-sized businesses analyze their retail sales data efficiently, without needing extensive data analytics expertise. If you've ever felt overwhelmed by tools like Pandas or Scikit-learn, or wished for more retail-specific functions, you're in the right place.\n",
    "\n",
    "In this notebook, we'll walk through how to use the `salesanalyzer_mds` package to extract valuable insights from your sales data. We’ll demonstrate key functionalities using real-world examples, so you can start improving your business decisions right away!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports\n",
    "\n",
    "Let us begin by setting up all our imports for this demonstration, which includes all 3 `salesanalyzer_mds` functions:\n",
    "- `sales_summary_statistics`: Calculates a variety of summary statistics that provide insights into overall sales performance, customer behavior, and product performance.\n",
    "- `segment_revenue_share`: Segments products into three categories: cheap, medium, expensive, based on price, and calculates their respective share in total revenue.\n",
    "- `predict_sales`: Predicts future sales based on the provided historical data and the target.\n",
    "sales_summary_statistics: Calculates a variety of summary statistics that provide insights into overall sales performance, customer behavior, and product performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "from salesanalyzer_mds.sales_summary_statistics import sales_summary_statistics\n",
    "from salesanalyzer_mds.segment_revenue_share import segment_revenue_share\n",
    "from salesanalyzer_mds.predict_sales import predict_sales"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a sample data\n",
    "\n",
    "Next, let us create a sample data to work with. \n",
    "> Note:\n",
    "> `salesanalyzer_mds` package is not limited to the sample data columns and can be customized to suit your specific requirements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>InvoiceNo</th>\n",
       "      <th>Description</th>\n",
       "      <th>Quantity</th>\n",
       "      <th>InvoiceDate</th>\n",
       "      <th>UnitPrice</th>\n",
       "      <th>CustomerID</th>\n",
       "      <th>Country</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>INV-240891</td>\n",
       "      <td>Laptop</td>\n",
       "      <td>2</td>\n",
       "      <td>2023-06-09</td>\n",
       "      <td>1500</td>\n",
       "      <td>85732</td>\n",
       "      <td>USA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>INV-240892</td>\n",
       "      <td>Headphones</td>\n",
       "      <td>3</td>\n",
       "      <td>2023-07-11</td>\n",
       "      <td>300</td>\n",
       "      <td>70179</td>\n",
       "      <td>Singapoore</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>INV-240893</td>\n",
       "      <td>Headphones</td>\n",
       "      <td>1</td>\n",
       "      <td>2023-08-21</td>\n",
       "      <td>250</td>\n",
       "      <td>85673</td>\n",
       "      <td>Germany</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>INV-240894</td>\n",
       "      <td>Monitor</td>\n",
       "      <td>3</td>\n",
       "      <td>2023-08-25</td>\n",
       "      <td>500</td>\n",
       "      <td>22367</td>\n",
       "      <td>USA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>INV-240895</td>\n",
       "      <td>Headphones</td>\n",
       "      <td>5</td>\n",
       "      <td>2023-09-10</td>\n",
       "      <td>420</td>\n",
       "      <td>57682</td>\n",
       "      <td>Geramny</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    InvoiceNo Description  Quantity InvoiceDate  UnitPrice  CustomerID  \\\n",
       "0  INV-240891      Laptop         2  2023-06-09       1500       85732   \n",
       "1  INV-240892  Headphones         3  2023-07-11        300       70179   \n",
       "2  INV-240893  Headphones         1  2023-08-21        250       85673   \n",
       "3  INV-240894     Monitor         3  2023-08-25        500       22367   \n",
       "4  INV-240895  Headphones         5  2023-09-10        420       57682   \n",
       "\n",
       "      Country  \n",
       "0         USA  \n",
       "1  Singapoore  \n",
       "2     Germany  \n",
       "3         USA  \n",
       "4     Geramny  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_data = pd.DataFrame({\n",
    "    'InvoiceNo' : ['INV-240891','INV-240892', 'INV-240893', 'INV-240894', 'INV-240895', 'INV-240896', 'INV-240898'],\n",
    "    'Description': ['Laptop', 'Headphones', 'Headphones', 'Monitor', 'Headphones', 'Laptop', 'Monitor'],\n",
    "    'Quantity' : [2, 3, 1, 3, 5, 2, 1],\n",
    "    'InvoiceDate' : ['2023-06-09', '2023-07-11', '2023-08-21', '2023-08-25', '2023-09-10', '2023-10-30', '2023-10-30'],\n",
    "    'UnitPrice' : [1500, 300, 250, 500, 420, 2000, 700],\n",
    "    'CustomerID' : [85732, 70179, 85673, 22367, 57682, 99123, 45612],\n",
    "    'Country' : ['USA', 'Singapoore', 'Germany', 'USA', 'Geramny', 'Singapoore', 'USA']\n",
    "})\n",
    "\n",
    "sample_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get Summary Statistics\n",
    "\n",
    "One of the key features of `salesanalyzer_mds` is its ability to quickly generate sales summary. Use the `analyze_sales_trends()` function to generate insights like total revenue, average order value, and top selling products.\n",
    "> Use help(sales_summary_statistics) for more information about the function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Value</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>total_revenue</th>\n",
       "      <td>12450.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique_customers</th>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>average_order_value</th>\n",
       "      <td>1778.571429</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top_selling_product_quantity</th>\n",
       "      <td>Headphones</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top_selling_product_revenue</th>\n",
       "      <td>Laptop</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>average_revenue_per_customer</th>\n",
       "      <td>1778.571429</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                    Value\n",
       "total_revenue                     12450.0\n",
       "unique_customers                        7\n",
       "average_order_value           1778.571429\n",
       "top_selling_product_quantity   Headphones\n",
       "top_selling_product_revenue        Laptop\n",
       "average_revenue_per_customer  1778.571429"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sales_summary_statistics(sample_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get Revenue Share for each Product Category\n",
    "\n",
    "Another feature of `saleanalyzer_mds`, the `segment_revenue_share()` function, segments products into three categories (cheap < medium < expensive) — based on their price, and calculates the respective share of total revenue contributed by each segment. By default, the price thresholds are set automatically, but users can define custom thresholds to categorize products according to their specific business needs. This function is particularly useful for analyzing product sales data and understanding revenue distribution across different pricing tiers.\n",
    "> Use help(sales_summary_statistics) for more information about the function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PriceSegment</th>\n",
       "      <th>TotalRevenue</th>\n",
       "      <th>RevenueShare (%)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>cheap</td>\n",
       "      <td>1150</td>\n",
       "      <td>9.24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>medium</td>\n",
       "      <td>4300</td>\n",
       "      <td>34.54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>expensive</td>\n",
       "      <td>7000</td>\n",
       "      <td>56.22</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  PriceSegment  TotalRevenue  RevenueShare (%)\n",
       "0        cheap          1150              9.24\n",
       "1       medium          4300             34.54\n",
       "2    expensive          7000             56.22"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Using default price thresholds\n",
    "revenue_share = segment_revenue_share(sample_data, price_col='UnitPrice', quantity_col='Quantity')\n",
    "revenue_share"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PriceSegment</th>\n",
       "      <th>TotalRevenue</th>\n",
       "      <th>RevenueShare (%)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>cheap</td>\n",
       "      <td>1150</td>\n",
       "      <td>9.24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>medium</td>\n",
       "      <td>3600</td>\n",
       "      <td>28.92</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>expensive</td>\n",
       "      <td>7700</td>\n",
       "      <td>61.85</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  PriceSegment  TotalRevenue  RevenueShare (%)\n",
       "0        cheap          1150              9.24\n",
       "1       medium          3600             28.92\n",
       "2    expensive          7700             61.85"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Using user-defined price thresholds\n",
    "revenue_share = segment_revenue_share(sample_data, price_col='UnitPrice', quantity_col='Quantity', price_thresholds=(300, 500))\n",
    "revenue_share"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Predict Future Sales\n",
    "\n",
    "Now that you have a good summary of your **past** sales, say, you want to peek into the **future** and predict how your products will sell in a month, 2 months or even a year? You can do this with `predict_sales()` function. This function uses a Random Forest machine learning model to make predictions on your specified target (e.g. quantity sold). The output will be a data frame with predicted values, and the model's performance score (Mean Squared Error and R Squared).\n",
    "\n",
    "> **Important** <br>\n",
    "> `predict_sales()` checks for duplicate entries, and only considers unique data points <br>\n",
    "> By default the function uses 70% data for training and 30% for testing, to change that you can pass test_size = 0.2 increase the ratio, if your data size is small \n",
    "<br><br>\n",
    "> **Model Performance Scores:**<br>\n",
    "> - Mean Squared Error: average squared difference between predicted values and the actual values \n",
    "> - Coefficient of Determination $(R^2)$: how well-observed results are reproduced by the model, depending on the ratio of total deviation of results described by the model.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MSE of the model: 6.7\n",
      "R_squared of the model: -6.54\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Predicted values</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.77</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.33</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Predicted values\n",
       "0              1.77\n",
       "1              1.33"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_data = pd.DataFrame({\n",
    "    'InvoiceNo' : ['INV-250891','INV-250892'],\n",
    "    'Description': ['Laptop', 'Headphones'],\n",
    "    'InvoiceDate' : ['2025-01-30', '2025-02-01'],\n",
    "    'UnitPrice' : [2000, 300],\n",
    "    'CustomerID' : [85732, 70179],\n",
    "    'Country' : ['USA', 'Singapoore']\n",
    "})\n",
    "\n",
    "predict_sales(sample_data, \n",
    "              new_data, \n",
    "              numeric_features = ['UnitPrice'], \n",
    "              categorical_features = ['Description', 'Country'], \n",
    "              target = 'Quantity', \n",
    "              date_feature = 'InvoiceDate')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you don't want to include a date feature into your analysis, you can omit it from your arguments.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MSE of the model: 1.72\n",
      "R_squared of the model: 0.0\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Predicted values</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.89</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.88</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Predicted values\n",
       "0              1.89\n",
       "1              1.88"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predict_sales(sample_data, new_data, ['UnitPrice'], ['Description', 'Country'], 'Quantity', test_size=0.2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the end of the tutorial, where you have seen how to get sales data insights using our package."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "salesanalyzser",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}