{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SMIPP 21/22 - Exercise Sheet 11\n", "\n", "## Prof. Dr. K. Reygers, Dr. R. Stamen, Dr. M. Völkl\n", "\n", "## Hand in by: Thursday, January 27th: 12:00\n", "### Submit the file(s) through the Übungsgruppenverwaltung\n", "\n", "\n", "### Names (up to two):\n", "### Points: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 11.1 Quiz questions (40 point)\n", "### Please mark the correct answer\n", "\n", "1. According to Occam's razor\n", "\n", "\tA. one should not assign subject probabilities to competing hypotheses
\n", "\tB. one should select the hypothesis with the fewest assumptions among competing hypothesis about the same prediction
\n", "\tC. one should not reject plausible hypothesis even if they disagree with the data
\n", "\tD. one should accept only those hypotheses that can be mathematically proven
\n", "\n", "\n", "\n", "2. In Bayesian inference, an improper prior is a prior distribution\n", "\n", "\tA. whose integral is not finite
\n", "\tB. is a delta distribution
\n", "\tC. which is not a Gaussian
\n", "\tD. is a uniform distribution between 0 and a maximum value $a_\\max$
\n", "\n", "\n", "\n", "3. According to the reproducibility crisis ...\n", "\n", "\tA. ... old research papers often are not accessible online
\n", "\tB. ... proofs in pure math often are too abstract and hard to understand
\n", "\tC. ... many emperical research findings in medicine and the social sciences are wrong
\n", "\tD. ... a researcher should always keep a printed version of her/his papers
\n", "\n", "\n", "\n", "4. If the posterior distribution p(θ | x) is in the same probability distribution family as the prior probability distribution p(θ) for a likelihood L(x | θ) then\n", "\n", "\tA. p(θ) is called an improper prior for L(x | θ)
\n", "\tB. p(θ) is called an uninformed prior for L(x | θ)
\n", "\tC. p(θ) is called a conjugate prior for L(x | θ)
\n", "\tD. p(θ) is called a Jeffreys prior for L(x | θ)
\n", "\n", "\n", "\n", "5. The Central Limit Theorem states that \n", "\n", "\tA. histograms approach the underlying PDF for $n$ → ∞
\n", "\tB. that $n!$ can be calculated as Γ($n$+1)
\n", "\tC. a binomial distributions can be approximated by a Poisson distribution under certain conditions
\n", "\tD. the sum of $n$ random variables approaches a Gaussian distribution for $n$ → ∞
\n", "\n", "\n", "\n", "6. The ±1σ interval around the mean of a Gaussian corresponds to a probably of about\n", "\n", "\tA. 32%
\n", "\tB. 36%
\n", "\tC. 68%
\n", "\tD. 95%
\n", "\n", "\n", "\n", "7. The standard deviation of the mean of $n$ independent measurement ...\n", "\n", "\tA. decreases as $1/\\ln n$
\n", "\tB. decreases as $1/\\sqrt{n}$
\n", "\tC. decreases as $1/n$
\n", "\tD. is independent of $n$
\n", "\n", "\n", "\n", "8. Using linear error propagation, the relative uncertainty of the product z = x × y of two uncorrelated variables x and y is given by\n", "\n", "\tA. $\\sqrt{\\sigma_x^2 + \\sigma_y^2}$
\n", "\tB. $\\sqrt{(\\sigma_x/x)^2 + (\\sigma_y/y)^2}$
\n", "\tC. $\\sigma_x/x + \\sigma_y/y$
\n", "\tD. $1/\\sqrt{1/\\sigma_x^2 + 1/\\sigma_y^2}$
\n", "\n", "\n", "\n", "9. Suppose the average number of proton-proton collisions per bunch crossing at an interaction point of the LHC is 25. What is the variance of the number or collisions per bunch crossing?\n", "\n", "\tA. 5
\n", "\tB. 12.5
\n", "\tC. 25
\n", "\tD. 625
\n", "\n", "\n", "\n", "10. To generate random samples form a multi-dimensional probability distribution $f(\\vec x) = p(\\vec x) / A$ using the Metropolis-Hastings algorithm one needs to know\n", "\n", "\tA. just $p(\\vec x)$
\n", "\tB. $p(\\vec x)$, and the normalization constant $A$
\n", "\tC. $p(\\vec x)$, the first derivative of $p(\\vec x)$, and the normalization constant $A$
\n", "\tD. $p(\\vec x)$, the first and second derivative of $p(\\vec x)$, and the normalization constant $A$
\n", "\n", "\n", "\n", "11. To obtain random points uniformly distributed on the surface of a sphere one needs to uniformly distribute\n", "\n", "\tA. $\\varphi$ and $\\theta$
\n", "\tB. $\\sin \\varphi$ and $\\theta$
\n", "\tC. $\\varphi$ and $\\cos \\theta$
\n", "\tD. $\\varphi^2$ and $\\theta$
\n", "\n", "\n", "\n", "12. Let $r$ be a random variable uniformly distributed in [0, 1]. To draw random numbers from the PDF f($x$) = $2x$ one can transform $r$ as\n", "\n", "\tA. $\\sqrt{r}$
\n", "\tB. $r^2$
\n", "\tC. $\\ln r$
\n", "\tD. $r^4$
\n", "\n", "\n", "\n", "13. What is an extended maximum likelihood fit?\n", "\n", "\tA. A fit in which the number of events (i.e. the normalization) is a random variable
\n", "\tB. A fit that takes the correlations of the measurements into account.
\n", "\tC. A fit with more than 10 fit parameters
\n", "\tD. A fit which includes systematic uncertainties in form of nuisance parameters
\n", "\n", "\n", "\n", "14. The Rao-Cramér-Frechet bound expresses a ...\n", "\n", "\tA. lower bound on the expectation value of an estimator
\n", "\tB. lower bound on the variance of an estimator
\n", "\tC. upper bound on the standard deviation of an estimator
\n", "\tD. upper bound on the variance of an estimator
\n", "\n", "\n", "\n", "15. In the large sample limit the likelihood function $L$ approaches a\n", "\n", "\tA. Gaussian
\n", "\tB. parabolic function
\n", "\tC. chi-squared distribution
\n", "\tD. gamma distribution
\n", "\n", "\n", "\n", "16. A closed-form solution for a least squares parameter estimate exists if the fit function\n", "\n", "\tA. has less than three parameters
\n", "\tB. is monotonic
\n", "\tC. is differentiable
\n", "\tD. is linear in the fit parameters
\n", "\n", "\n", "\n", "17. The 1-sigma uncertainty of a parameter $\\theta$ estimated with a least-squares fit corresponds to $\\chi^2(\\theta)$ value of\n", "\n", "\tA. 1.64/$\\chi^2_\\mathrm{min} $
\n", "\tB. $\\chi^2_\\mathrm{min} + 1$
\n", "\tC. $1.64 \\cdot \\chi^2_\\mathrm{min}$
\n", "\tD. $1.64 \\cdot \\chi^2_\\mathrm{min}$ + ln 2
\n", "\n", "\n", "\n", "18. What is a generalized least squares fit?\n", "\n", "\tA. A fit for which the allowed range of the parameter values is limited
\n", "\tB. A fit which takes uncertainties of the $x$ values in addition to the $y$ uncertainties into account
\n", "\tC. A fit that combines the maximum-likelihood and the least-squres method
\n", "\tD. A fit that takes the correlation of the measured $y$ values into account
\n", "\n", "\n", "\n", "19. The p-value is the probability\n", "\n", "\tA. that an alternative hypothesis H1 is false
\n", "\tB. of a model being true
\n", "\tC. to observe an equal or larger deviation of the data from a model given the model is true
\n", "\tD. to reject a true hypothesis
\n", "\n", "\n", "\n", "20. A statement about the optimal choice of the test statistic when comparing two simple hypotheses is made in \n", "\n", "\tA. the Neyman-Pearson lemma
\n", "\tB. the central limit theorem
\n", "\tC. the Wald–Wolfowitz proposition
\n", "\tD. Fisher's law
\n", "\n", "\n", "\n", "21. A variable that is a function of the data alone and that can be used to test a hypothesis is called\n", "\n", "\tA. run test
\n", "\tB. test statistic
\n", "\tC. Kolmogorov-Smirnov variable
\n", "\tD. Neyman-Pearson variable
\n", "\n", "\n", "\n", "22. In the Feldman-Cousins method points are added to the acceptance region\n", "\n", "\tA. based on maximum entropy principle
\n", "\tB. requiring a symmetric interval around the maximum
\n", "\tC. based on the likelihood ratio ordering principle
\n", "\tD. by maximizing the Fisher information
\n", "\n", "\n", "\n", "23. In frequentist statistics, the fraction of the time that a confidence interval contains the true value of interest is called\n", "\n", "\tA. coverage
\n", "\tB. credibility
\n", "\tC. power
\n", "\tD. likelihood
\n", "\n", "\n", "\n", "24. In Bayesian statistics, an interval in the domain of a posterior probability distribution corresponding to a certain probability is called \n", "\n", "\tA. CLs interval
\n", "\tB. confidence interval
\n", "\tC. confidential interval
\n", "\tD. credible interval
\n", "\n", "\n", "\n", "25. The \"curse of dimensionality\" refers to the fact that \n", "\n", "\tA. models with a large number of parameters are hard to train
\n", "\tB. when the dimensionality of the feature vectors increases, training data in feature space become sparse
\n", "\tC. cassification problems become hard for a large number of classes
\n", "\tD. that even for large training samples the parameter optimization never converges
\n", "\n", "\n", "\n", "26. In supervised learning, \"over-training\" means that\n", "\n", "\tA. the classifier learns statistical fluctuations of the training sample
\n", "\tB. sometimes classification works better than theoretically expected
\n", "\tC. the classification performance becomes worse when the training sample is too large
\n", "\tD. classification becomes slow for a too large training sample
\n", "\n", "\n", "\n", "27. The naive Bayes classifier is called \"naive\" because \n", "\n", "\tA. in many situations Bayes' theorem does not apply
\n", "\tB. it approximates PDF's as multi-variate Gaussians
\n", "\tC. it it based on an a linear approximation of Bayes' formula
\n", "\tD. it ignores correlations between the input variables
\n", "\n", "\n", "\n", "28. In the random forest algorithm, ...\n", "\n", "\tA. the total number of trees is chosen randomly
\n", "\tB. one adds random values to the components of the feature vector
\n", "\tC. one uses random subsets of the features to when building the tree
\n", "\tD. the separation measure of each split is chosen randomly
\n", "\n", "\n", "\n", "29. The Gini index measures \n", "\n", "\tA. the dimension of the feature space
\n", "\tB. the error rate of a classifier
\n", "\tC. the performance of boosted decisions trees
\n", "\tD. the separation between signal and background in a sample
\n", "\n", "\n", "\n", "30. \"Boosting\" in machine learning is a technique\n", "\n", "\tA. to use special relativity in particle identification
\n", "\tB. to combine many weak classifiers into a strong one
\n", "\tC. to increase the performance by using more input variables
\n", "\tD. to use GPUs to speed up the learning process
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10.4 Data Challenge: Searching for exotic particles in high-energy physics (40 extra points)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This exercise is meant as extra exercise which you can solve voluntary to get extra points. We will discuss the problem during the tutorial.\n", "\n", "In this exercise we want to explore various techniques to optimise the event selection in the search for supersymmetric Higgs bosons at the LHC. In supersymmetry the Higgs sector constitutes of five Higgs bosons in cotrast to the single Higgs in the standard model. Here we deal with a heavy Higgs boson which decays into two W-bosons and a standard Higgs boson ($H^0 \\to W^+ W^- h$) which decay further into leptons ($W^\\pm \\to l^\\pm \\nu$) and b-quarks ($h\\to b \\bar{b}$) respectively. Based on the signals deposited in the detector low level quantities like momenta of particles and high level quantities like invariant masses (i.e. combination of information from more than one of the final state particles) are reconstructed. Various machine learning algorithms should be run and their performance shold be quantified. We will take the Model Accuracy, the AUC score and the roc-curve to evalute the performance.\n", "\n", "This exercise is based on a [nature paper](https://www.nature.com/articles/ncomms5308) which contains much more information like general background information, details about the selection variables and links to large sets of simulated events. You might also use the paper as inspiration for the solution of this exercise. \n", "\n", "The two dataset consists of 10k and 100k events respectivlely. For each event 29 variables are stored:\n", "\n", " 0: classification (1 = signal, 0 = background) \n", " 1 - 21 : low level quantities (var1 - var21)\n", " 22 -28 : high level quantities (var22 - var28)\n", " \n", "In the lecture you have met several different machine learning techniques (logistic regressor, k_Neighbors Classifier, LinearDiscriminant, BDT, NN, ...) You should implement four of them (At least one NN algorithm) and perform the following studies:\n", "\n", "1) Use the low level quantities and determine for each algorithm the [Model Accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.score)\n", "and the [AUC score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html)\n", ". Also plot the roc curve. Perform these studies for the 10k as well as for the 100k data sample.\n", "\n", "2) Repeat these studies using the high level quantities.\n", "\n", "3) Answer the following questions:\n", "* Which differences do you observe when using the high level quantities in contrast to the low level quantities?\n", "* Which difference do you observe when using the 100k data set in contrast to the low statistics data set?\n", "* What do you conclude for the implementation of a machine learning technique at the LHC.\n", " \n", "Hint 1: In order to compare roc-curves it is advisable to plot them in the same figure.\n", "\n", "Hint 2: In the aforementioned paper you will also find plots using an extremely large set of simulated events. These might support your conclusions.\n", " \n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from sklearn.model_selection import train_test_split\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "100000" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "##############################################################\n", "# \n", "# read the data\n", "# switch between datasets by commenting and uncommenting \n", "#\n", "#############################################################\n", "\n", "#filename = \"https://www.physi.uni-heidelberg.de/~reygers/lectures/2020/smipp/HIGGS_10k.csv\" # 10000 event \n", "filename = \"https://www.physi.uni-heidelberg.de/~reygers/lectures/2020/smipp/HIGGS_100k.csv\" # 100000 event \n", "\n", "df = pd.read_csv(filename, engine='python')\n", "len(df)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classvar1var2var3var4var5var6var7var8var9...var19var20var21var22var23var24var25var26var27var28
01.00.9075420.3291470.3594121.497970-0.3130101.095531-0.557525-1.5882302.173076...-1.138930-0.0008190.0000000.3022200.8330480.9857000.9780980.7797320.9923560.798343
11.00.7988351.470639-1.6359750.4537730.4256291.1048751.2823221.3816640.000000...1.1288480.9004610.0000000.9097531.1083300.9856920.9513310.8032520.8659240.780118
20.01.344385-0.8766260.9359131.9920500.8824541.786066-1.646778-0.9423830.000000...-0.678379-1.3603560.0000000.9466521.0287040.9986560.7282810.8692001.0267360.957904
31.01.1050090.3213561.5224010.882808-1.2053490.681466-1.070464-0.9218710.000000...-0.3735660.1130410.0000000.7558561.3610570.9866100.8380851.1332950.8722450.808487
40.01.595839-0.6078110.0070751.818450-0.1119060.847550-0.5664371.5812392.173076...-0.654227-1.2743453.1019610.8237610.9381910.9717580.7891760.4305530.9613570.957818
\n", "

5 rows × 29 columns

\n", "
" ], "text/plain": [ " class var1 var2 var3 var4 var5 var6 \\\n", "0 1.0 0.907542 0.329147 0.359412 1.497970 -0.313010 1.095531 \n", "1 1.0 0.798835 1.470639 -1.635975 0.453773 0.425629 1.104875 \n", "2 0.0 1.344385 -0.876626 0.935913 1.992050 0.882454 1.786066 \n", "3 1.0 1.105009 0.321356 1.522401 0.882808 -1.205349 0.681466 \n", "4 0.0 1.595839 -0.607811 0.007075 1.818450 -0.111906 0.847550 \n", "\n", " var7 var8 var9 ... var19 var20 var21 var22 \\\n", "0 -0.557525 -1.588230 2.173076 ... -1.138930 -0.000819 0.000000 0.302220 \n", "1 1.282322 1.381664 0.000000 ... 1.128848 0.900461 0.000000 0.909753 \n", "2 -1.646778 -0.942383 0.000000 ... -0.678379 -1.360356 0.000000 0.946652 \n", "3 -1.070464 -0.921871 0.000000 ... -0.373566 0.113041 0.000000 0.755856 \n", "4 -0.566437 1.581239 2.173076 ... -0.654227 -1.274345 3.101961 0.823761 \n", "\n", " var23 var24 var25 var26 var27 var28 \n", "0 0.833048 0.985700 0.978098 0.779732 0.992356 0.798343 \n", "1 1.108330 0.985692 0.951331 0.803252 0.865924 0.780118 \n", "2 1.028704 0.998656 0.728281 0.869200 1.026736 0.957904 \n", "3 1.361057 0.986610 0.838085 1.133295 0.872245 0.808487 \n", "4 0.938191 0.971758 0.789176 0.430553 0.961357 0.957818 \n", "\n", "[5 rows x 29 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# define the names of the variables\n", "# the actual physical meaning can be looked up in the paper\n", "#\n", "#\n", "\n", "df.columns = ['class', 'var1','var2','var3','var4','var5',\n", " 'var6','var7','var8','var9','var10',\n", " 'var11','var12','var13','var14','var15',\n", " 'var16','var17','var18','var19','var20',\n", " 'var21','var22','var23','var24','var25',\n", " 'var26','var27','var28']\n", "\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "#\n", "# create two subsets of the data\n", "# dfall - 'class' + low level quantities \n", "# dfahl - 'class' + high level quantities\n", "#\n", "# two different techniques to create these subsets are used for demonstration purposes \n", "#\n", "\n", "\n", "dfall = df.iloc[:,0:22] \n", "dfahl = df[['class', 'var22','var23','var24','var25','var26',\n", " 'var27','var28']]\n", " " ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "#\n", "# prepare data sets to be used in the various machine learning algorithms\n", "#\n", "# split data set in training and test data set\n", "#\n", "#\n", "\n", "y = dfall['class'].values\n", "X = dfall[[col for col in dfall.columns if col!=\"class\"]]\n", "\n", "#y = dfahl['class'].values\n", "#X = dfahl[[col for col in dfahl.columns if col!=\"class\"]]\n", "\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=True)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DecisionTreeClassifier()" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import linear_model\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn import tree\n", "\n", "###################################################################\n", "#\n", "# Your Code starts here\n", "#\n", "###################################################################\n", "\n", "# define at least four different machine learning techniques\n", "\n", "### YOUR CODE ###\n", "\n", "\n", "#\n", "# fit training data\n", "#\n", "\n", "### YOUR CODE ###\n", "\n", "\n", "#\n", "# calculate predictions for test data\n", "#\n", "\n", "### YOUR CODE ###\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import roc_auc_score\n", "\n", "#\n", "# Determine the Model Accuracy and the AUC score for each algorithm\n", "#\n", "\n", "### YOUR CODE ###\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "from sklearn.metrics import roc_curve\n", "\n", "#\n", "# Plot the RO curve\n", "# for each setup plot the ROC curves of the different algorithms in one plot\n", "#\n", "\n", "\n", "\n", "\n", "### YOUR CODE ###\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3) Results and Interpretation\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }