{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SMIPP 21/22 - Exercise Sheet 10\n", "\n", "## Prof. Dr. K. Reygers, Dr. R. Stamen, Dr. M. Völkl\n", "\n", "## Hand in by: Thursday, January 20th: 12:00\n", "### Submit the file(s) through the Übungsgruppenverwaltung\n", "\n", "\n", "### Names (up to two):\n", "### Points: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10.1 Overfitting (15 points)\n", "\n", "With sufficient free parameters in the fit function, any dataset can be fitted perfectly. To assess the usefulness of an interpolation, it can be useful to check how well it performs on an independent test sample. Here, a training and test sample of equal size are provided. Each point can be assumed to be subject the same type of independent fluctuations from a normal distribution in y and no uncertainty in x.\n", "\n", "#### a)\n", "\n", "Define a function which minimizes the $\\chi^2=\\sum \\frac{(y_i-f(x_i))^2}{\\sigma_i^2}$ (use $\\sigma=1$) for a fit of the training data with a polynomial of arbitrary degree.\n", "\n", "Hint: You can make use of the fact that iminuit automatically deduces the number of free parameters from the input array.\n", "\n", "\n", "#### b) \n", "\n", "Fit polynomials of degree 0-14 to the training sample provided. Plot them together with the training and test data. In a separate plot, show the $\\chi^2/dof$ for the training and test sample as a function of the number of free parameters.\n", "\n", "Hint: The test sample always has 20 degrees of freedom." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from iminuit import Minuit\n", "import matplotlib.pyplot as plt\n", "\n", "xtrain = np.array([6.04, 3.62, 6.77, 9.92, 2.23, 2.51, 9.84, 5.14, 0.335, 5.9, 1.05, 1.86, 4.71, 9.3, 1.59, 9.14, 8.3, 7.39, 9.64, 0.0284])\n", "ytrain = np.array([-0.135, 1.49, 0.283, -1.6, 5.63, 5.08, -1.05, -0.554, 7.65, -1.97, 8.76, 8.56, -0.887, -1.88, 8.54, -2.51, -0.496, -1.71, -0.582, 8.75])\n", "xtest = np.array([6.82, 0.842, 9.71, 7.61, 0.425, 3.23, 8.09, 7.28, 0.602, 9.54, 7.55, 9.21, 6.76, 3.92, 5.91, 7.82, 9.23, 4.36, 6.31, 2.59])\n", "ytest = np.array([-1.24, 6.67, -1.49, -1.16, 6.83, 2.88, -1.16, -2.05, 7.09, -2.31, -1.89, -3.06, -0.685, -0.436, -1.19, 0.565, -2.22, -0.289, 0.595, 6.03])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Solution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10.2 Decision Boundary (10 points)\n", "\n", "When long-lived neutral particles decay into a pair of charged particles, the decay has a clear signature as the particles appear out of nowhere. When only the momenta of the decay products are known but not their mass, the Armenteros-Podolanski plot is a useful way to distinguish the different particles. The training data contains the values\n", "$$ q_t $$\n", "(the momentum of one product transverse to the mother particle) and \n", "$$ \\alpha = \\frac{p_{l,1}-p_{l,2}}{p_{l,1}+p_{l,2}}$$\n", "(the asymmetry in the longitudinal momenta) for the decays $K^0_S \\rightarrow \\pi^{+} \\pi^{-}$ and $\\Lambda \\rightarrow p \\pi^{-}$.\n", "\n", "#### a)\n", "\n", "Read in the data and plot the points in ($\\alpha,q_t$) for the two classes of particles. Create a _numpy_ array of the vectors ($\\alpha,q_t$) as well as one with the classification.\n", "\n", "#### b)\n", "\n", "Create a grid in $\\alpha\\in (-1,1)$, $q_t\\in (0,0.3)$ and fill each point with the classifier. Plot the result and draw the point as in a) on top. Do this for the following classifiers:\n", "\n", "- k-Nearest neighbor ($n=4$)\n", "- A linear discriminant\n", "- Decision trees of random forest\n", "- A neural network\n", "- A Gaussian process\n", "\n", "#### c)\n", "\n", "Of the ones you tried, which seem particularly suited or unsuited for this dataset?\n", "\n", "Hint: The training data can be read in in this way:\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import sklearn\n", "import matplotlib.pyplot as plt\n", "from iminuit import Minuit\n", "from sklearn import *\n", "\n", "# The file can be read in this way\n", "Kaons = pd.read_table(\"https://www.physi.uni-heidelberg.de/~reygers/lectures/2020/smipp/K0.txt\", delimiter=' ')\n", "Lambda = pd.read_table(\"https://www.physi.uni-heidelberg.de/~reygers/lectures/2020/smipp/L0.txt\", delimiter=' ')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Solution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10.3 MAGIC air showers: comparison of different classifiers (15 points)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this problem we use the [MAGIC machine learning dataset](https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope) of problem 9.2 to compare the performance of different classifiers as implemented in [scikit-learn](https://scikit-learn.org). Consider the following algorithms:\n", "\n", "* [Logistic regression]() (from problem 9.2)\n", "* [AdaBoost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier)\n", "* [Random forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)\n", "* [Gradient boosting](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)\n", "\n", "a) Determine model accuracy, AUC score and time needed for training for all classifiers\n", "\n", "b) Plot the ROC curves of all classifiers in one figure.\n", "\n", "c) Which classifier shows the best performance (with default parameters for AdaBoost, random forest and gradient boosting)?\n", "\n", "d) Determine the importance of each feature for the random forest classifier following this [example](https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html) for the scikit-learn webpage. Does this agree with the expectation from the plots of each feature distributions for signal and background in problem 9.2? \n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "filename = \"https://www.physi.uni-heidelberg.de/~reygers/lectures/2020/smipp/magic04_data.txt\"\n", "df = pd.read_csv(filename, engine='python')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# use categories 1 and 0 insted of \"g\" and \"h\"\n", "df['class'] = df['class'].map({'g': 1, 'h': 0})" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fLengthfWidthfSizefConcfConc1fAsymfM3LongfM3TransfAlphafDistclass
028.796716.00212.64490.39180.198227.700422.0110-8.202740.092081.88281
131.603611.72352.51850.53030.377326.272223.8238-9.95746.3609205.26101
2162.0520136.03104.06120.03740.0187116.7410-64.8580-45.216076.9600256.78801
323.81729.57282.33850.61470.392227.2107-6.4633-7.151310.4490116.73701
475.136230.92053.16110.31680.1832-5.527728.552521.83934.6480356.46201
\n", "
" ], "text/plain": [ " fLength fWidth fSize fConc fConc1 fAsym fM3Long fM3Trans \\\n", "0 28.7967 16.0021 2.6449 0.3918 0.1982 27.7004 22.0110 -8.2027 \n", "1 31.6036 11.7235 2.5185 0.5303 0.3773 26.2722 23.8238 -9.9574 \n", "2 162.0520 136.0310 4.0612 0.0374 0.0187 116.7410 -64.8580 -45.2160 \n", "3 23.8172 9.5728 2.3385 0.6147 0.3922 27.2107 -6.4633 -7.1513 \n", "4 75.1362 30.9205 3.1611 0.3168 0.1832 -5.5277 28.5525 21.8393 \n", "\n", " fAlpha fDist class \n", "0 40.0920 81.8828 1 \n", "1 6.3609 205.2610 1 \n", "2 76.9600 256.7880 1 \n", "3 10.4490 116.7370 1 \n", "4 4.6480 356.4620 1 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# create training and test data set\n", "y = df['class'].values\n", "X = df[[col for col in df.columns if col!=\"class\"]]\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Solution:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }