{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SMIPP 21/22 - Exercise Sheet 8\n",
    "\n",
    "## Prof. Dr. K. Reygers, Dr. R. Stamen, Dr. M. Völkl\n",
    "\n",
    "## Hand in by: Thursday, December 16 th: 12:00\n",
    "### Submit the file(s) through the Übungsgruppenverwaltung\n",
    "\n",
    "\n",
    "### Names (up to two):\n",
    "### Points:      "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8.1  Upper limit on counting experiment (10 points)\n",
    "\n",
    "Assume a simple counting experiment where there are contributions from signal ($S$) and background ($B$) processes. The number of signal ($n_S$) and background events ($n_B$) can be treated as independent random variables distributed according to a Poisson p.d.f. with means $\\nu_S$ and $\\nu_B$, respectively. The total number of events $n = n_S +n_B$ is a Poisson random variable with mean $\\nu = \\nu_S + \\nu_B$. \n",
    "In an experiment, $n_{\\rm obs} = 5$ events are observed, while $\\nu_B = 1.8$ background events are expected.\n",
    "\n",
    "\n",
    "1. Determine an upper limit $\\nu_S^{\\rm max}$ for the number of signal events at $95\\%$ confidence level. Such a limit is defined by the expected number of signal events $\\nu_S^{\\rm max}$ for which the probability of measuring $n_{\\rm obs}$ or fewer events reaches $5\\%$ assuming a Poisson statistic with mean $\\nu_B + \\nu_S^{\\rm max}$.\n",
    "\n",
    "2. Verify the limit determined above with toy MC experiments. In each toy experiment generate a random number according to a Poisson p.d.f with a mean value of $\\nu_B + \\nu_S^{\\rm max}$, where $\\nu_S^{\\rm max}$ is the upper limit you determined above. Then, count the number of experiments in which this random number is less or equal to $n_{\\rm obs}$. By construction, the fraction of these events should be $5\\%$. Provide a plot for the $n_{\\rm obs}$ distribution found in the experiments. Perform $10^6$ experiments.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8.2  Search for a particle with unknown yield (25 points)\n",
    "\n",
    "Consider the experiment of exercise 7.3 (*The 750 GeV resonance*).\n",
    "We want to test the same theoretical model but without making any assumption on the number of signal events we have in the recorded dataset (signal yield). Our aim will be to put an upper limit on the signal yield at 95% confidence level. We will use the modified frequentist limits from the so-called ${\\rm CL}_s$ method.\n",
    "\n",
    "The model now dependends on the average number of signal counts $s$. Use the following $s$-dependent test statistic:\n",
    "\n",
    "$$ t_s = -2 \\ln \\frac{P(\\text{data} | s+b)}{P(\\text{data} | b)} $$\n",
    "\n",
    "Here $b$ stands for the background-only model (with an average number of background counts $b$), and $s+b$ for the signal+background model with (on average) $s$ signal counts.\n",
    "\n",
    "${\\rm CL}_s$ is defined as\n",
    "\n",
    "$$ {\\rm CL}_s = \\frac{{\\rm CL}_{s+b}}{{\\rm CL}_{b}}  $$\n",
    "\n",
    "where ${\\rm CL}_{s+b}$ is the probability to find the test statistic for the model $s+b$ above the observed test statistic $t_s^{\\rm obs}$, while ${\\rm CL}_{b}$ is the probability to find the test statistic for the background-only model above $t_s^{\\rm obs}$. Note that both ${\\rm CL}_{s+b}$ and ${\\rm CL}_{b}$ depend on $s$.\n",
    "\n",
    "a) Calculate ${\\rm CL}_s$ as a function of $s$ (for $s$ between 0 and 50).\n",
    "\n",
    "b) The values of $s$ excluded at 95% are the ones for which ${\\rm CL}_s \\leq 5\\%$. Plot ${\\rm CL}_s$ versus $s$ and calculate the upper limit for $s$.\n",
    "\n",
    "c) Compute the expected upper limit for the background-only hypothesis. To do so, you first need to calculate the expectation value of $t_s^{\\rm exp}$ for the background-only hypothesis.\n",
    "\n",
    "\n",
    "Hint: Below you find code snippets that might be useful (of course different implementations are accepted, too). Your can add your code after the lines marked by \"*Your code here*\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Step 1: functions for the likelihoods and the test statistic (nothing else to be done here)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import expon, norm, poisson\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# pdf for background and signal distribution\n",
    "bkg = expon.freeze(loc=500, scale=100)\n",
    "sig = norm.freeze(loc=750, scale=30)\n",
    "\n",
    "def model(m, fsig):\n",
    "    \"\"\"\n",
    "    m: invariant mass\n",
    "    fsig: signal fraction (0 <= fsig <= 1)\n",
    "    msig: position of the signal peak\n",
    "    \"\"\"\n",
    "    return (1-fsig)*bkg.pdf(m) + fsig*sig.pdf(m)\n",
    "\n",
    "def LL_sb(evts, s):\n",
    "    \"\"\"\n",
    "    log-likelihood for the signal + background model:\n",
    "    evts: array of measured masses ('events')\n",
    "    s: number of signal event\n",
    "    msig: position of the signal peak\n",
    "    \"\"\"\n",
    "    fsig = s / len(evts)\n",
    "    return np.sum(np.log(model(evts, fsig)))\n",
    "\n",
    "def LL_b(evts):\n",
    "    \"\"\"log-likelohood for the background model:\"\"\"\n",
    "    return np.sum(np.log(bkg.pdf(evts)))\n",
    "\n",
    "def t_s(evts, s):\n",
    "    \"\"\"\n",
    "    returns test statistic\n",
    "    evts: array of invariant masses\n",
    "    s: average number of of signal counts in the s+b model\n",
    "    \"\"\"\n",
    "    return -2 * (LL_sb(evts, s) - LL_b(evts))\n",
    "\n",
    "import pandas as pd\n",
    "d = pd.read_csv('https://www.physi.uni-heidelberg.de/~reygers/lectures/2020/smipp/two_photon_inv_masses.csv');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Step 2: Define function that generates a toy data set (nothing else to be done here)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_toy_dataset(n_tot, s):\n",
    "    \"\"\"generate toy data set (numpy array) with n_tot mass values where \n",
    "    on average s masses are drawn from the signal distribution\"\"\"\n",
    "\n",
    "    # number of signal events\n",
    "    n_s = poisson.rvs(mu=s)\n",
    "    \n",
    "    evts_s = sig.rvs(n_s)\n",
    "    evts_b = bkg.rvs(n_tot - n_s)\n",
    "    return np.concatenate((evts_s, evts_b))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Step 3: define a function that returns a distribution of the test statistic $t_s$ for the signal+background model or for the background-only model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_test_statistic_distr(s, model):\n",
    "    \"\"\"\n",
    "    generate values for the test statistic for the background model (model = 'b')\n",
    "    or for the signal+background model (model = 's+b'), return them as numpy array.\n",
    "    s: number of signal counts used for the test statistic (test statistic depends on s)\n",
    "    s_gen: number of signal counts used in the generation of toy data \n",
    "    (s_gen = 0 is the background-only model)\n",
    "    \"\"\"\n",
    "    s_gen = 0\n",
    "    if model == \"s+b\": s_gen = s\n",
    "   \n",
    "    # your code here ...\n",
    "    # generate, e.g., 2000 toy data sets and calculate the test statistic for each data set\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Step 4: define a function that returns CLs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [],
   "source": [
    "def CL_s(t_obs, t_distr_sb, t_distr_b):\n",
    "    \"\"\"\n",
    "    return CLs value\n",
    "    t_obs: observerd value of the test statistic\n",
    "    t_distr_sb: numpy array with values of the test statistic for the signal+background model\n",
    "    t_distr_b: numpy array with values of the test statistic for the background-only model\n",
    "    \"\"\"\n",
    "\n",
    "    # your code here ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Step 5: loop over values of $s$ and calculate $\\mathrm{CL}_s$ for each $s$. Calculate also the expected $\\mathrm{CL}_s$ under the assumption that the background-only model is true. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Step 6: Plot $\\mathrm{CL}_s$ along with the expected $\\mathrm{CL}_s$ under the assumption that the background-only model is true as a function of $s$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Step 7: Determine the upper limit by determining the value of $s$ at which $\\mathrm{CL}_s \\le 5\\%$. Determine also the expected upper limit for the background-only model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {},
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8.3  Credible Interval for a Branching Ratios (5 points)\n",
    "\n",
    "In exercise 2.3, the branching ratio estimate for two alternative decays A and B was discussed. In this exercise you will calculate corresponding credible intervals.\n",
    "\n",
    "a) Explain, why for a prior that is flat in the branching ratio $f_A$, the posterior probability distribution for the decays AABBA should be $\\sim f_A^3 (1-f_A)^2$ and is in fact the beta distribution $Beta(4,3)$\n",
    "\n",
    "b) Calculate the following 80\\% credible intervals and give the lower and upped edge of $f_A$:\n",
    "\n",
    "    i) The credible interval with equal probability on both sides of the median of the posterior\n",
    "    ii) The credible interval symmetric around the mean of the posterior\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}