|
1 | 1 | {
|
2 | 2 | "cells": [
|
3 | 3 | {
|
4 |
| - "metadata": {}, |
5 | 4 | "cell_type": "markdown",
|
| 5 | + "id": "700d8667b7cb899c", |
| 6 | + "metadata": {}, |
6 | 7 | "source": [
|
7 | 8 | "\n",
|
8 | 9 | "# Local Outlier Factor Example\n",
|
9 |
| - "Local Outlier Factor (LOF) is a common algorithm for identifying data points that are inliers/outliers relative to their neighbors. The algorithm works by comparing how close an element is to its neighbors vs how close they are to their neighbors. The number of neighbors to use, k, is set by the user.\n", |
10 |
| - "Scores much less than one are inliers, scores much greater are outliers, and those near one are neither.\n", |
| 10 | + "Local Outlier Factor (LOF) is a common algorithm for identifying data points that are inliers or outliers relative to their neighbors. The algorithm generates an outlier score that compares the proximity of a data point's density relative to its neighbors. The number of neighbors to use, k, is defined by the user.\n", |
| 11 | + "\n", |
| 12 | + "Scores much less than 1 are inliers, scores much greater than 1 are outliers, and those near 1 are neither.\n", |
11 | 13 | "This demo is derived from the [scikit-learn Local Outlier Detection demo](https://scikit-learn.org/stable/auto_examples/neighbors/plot_lof_outlier_detection.html)."
|
12 |
| - ], |
13 |
| - "id": "700d8667b7cb899c" |
| 14 | + ] |
14 | 15 | },
|
15 | 16 | {
|
16 | 17 | "cell_type": "markdown",
|
|
22 | 23 | },
|
23 | 24 | {
|
24 | 25 | "cell_type": "code",
|
| 26 | + "execution_count": null, |
25 | 27 | "id": "3a51adb8-f89f-4cb3-9a41-24a36d8f1fcf",
|
26 | 28 | "metadata": {},
|
| 29 | + "outputs": [], |
27 | 30 | "source": [
|
28 | 31 | "from sedona.spark import SedonaContext\n",
|
29 | 32 | "\n",
|
30 | 33 | "config = SedonaContext.builder().getOrCreate()\n",
|
31 | 34 | "sedona = SedonaContext.create(config)"
|
32 |
| - ], |
33 |
| - "outputs": [], |
34 |
| - "execution_count": null |
| 35 | + ] |
35 | 36 | },
|
36 | 37 | {
|
37 | 38 | "cell_type": "markdown",
|
38 | 39 | "id": "cff198142e2ebced",
|
39 | 40 | "metadata": {},
|
40 | 41 | "source": [
|
41 | 42 | "# Data Generation\n",
|
42 |
| - "We generate some data. Most of it is random, but some data is explicitly designed to be outliers" |
| 43 | + "\n", |
| 44 | + "The following code generates data with two clusters of inliers and some outliers." |
43 | 45 | ]
|
44 | 46 | },
|
45 | 47 | {
|
46 | 48 | "cell_type": "code",
|
| 49 | + "execution_count": null, |
47 | 50 | "id": "99f8c27a-9c0b-4f8f-a388-54a781d892e3",
|
48 | 51 | "metadata": {},
|
| 52 | + "outputs": [], |
49 | 53 | "source": [
|
50 | 54 | "import numpy as np\n",
|
51 | 55 | "import pyspark.sql.functions as f\n",
|
|
58 | 62 | "X_inliers = 0.3 * np.random.randn(100, 2)\n",
|
59 | 63 | "X_inliers = np.r_[X_inliers + 2, X_inliers - 2]\n",
|
60 | 64 | "X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))\n",
|
61 |
| - "X = np.r_[X_inliers, X_outliers]\n" |
62 |
| - ], |
63 |
| - "outputs": [], |
64 |
| - "execution_count": null |
| 65 | + "X = np.r_[X_inliers, X_outliers]" |
| 66 | + ] |
65 | 67 | },
|
66 | 68 | {
|
67 | 69 | "cell_type": "markdown",
|
68 | 70 | "id": "372ae8aee6714c8c",
|
69 | 71 | "metadata": {},
|
70 | 72 | "source": [
|
71 | 73 | "## Generation LOF\n",
|
72 |
| - "We use the LOF implementation in Wherobots to generate this statistic on the data. We set k to 20.\n" |
| 74 | + "The following code uses the LOF implementation in Wherobots to generate an outlier score on the selected data. We set k to 20." |
73 | 75 | ]
|
74 | 76 | },
|
75 | 77 | {
|
76 | 78 | "cell_type": "code",
|
| 79 | + "execution_count": null, |
77 | 80 | "id": "fe5af921-957c-48cc-942a-e6c744c72bc5",
|
78 | 81 | "metadata": {},
|
| 82 | + "outputs": [], |
79 | 83 | "source": [
|
80 | 84 | "df = sedona.createDataFrame(X).select(ST_MakePoint(f.col(\"_1\"), f.col(\"_2\")).alias(\"geometry\"))\n",
|
81 |
| - "outliers_df = local_outlier_factor(df, 20)\n" |
82 |
| - ], |
83 |
| - "outputs": [], |
84 |
| - "execution_count": null |
| 85 | + "outliers_df = local_outlier_factor(df, 20)" |
| 86 | + ] |
85 | 87 | },
|
86 | 88 | {
|
87 | 89 | "cell_type": "code",
|
| 90 | + "execution_count": null, |
88 | 91 | "id": "d060243f-c00b-436d-99d4-e0fbaba89930",
|
89 | 92 | "metadata": {},
|
| 93 | + "outputs": [], |
90 | 94 | "source": [
|
91 | 95 | "outliers_df.show()"
|
92 |
| - ], |
93 |
| - "outputs": [], |
94 |
| - "execution_count": null |
| 96 | + ] |
95 | 97 | },
|
96 | 98 | {
|
97 | 99 | "cell_type": "markdown",
|
98 | 100 | "id": "5a475ce35250afea",
|
99 | 101 | "metadata": {},
|
100 | 102 | "source": [
|
101 | 103 | "## Visualization\n",
|
102 |
| - "We visualize the results using geopandas. Some manipulations are made to the data to improve the clarity of the visualization." |
| 104 | + "Finally, we visualize the results using geopandas. Some manipulations are made to the data to improve the clarity of the visualization." |
103 | 105 | ]
|
104 | 106 | },
|
105 | 107 | {
|
106 | 108 | "cell_type": "code",
|
| 109 | + "execution_count": null, |
107 | 110 | "id": "244c8d55-0c69-4922-b769-666c056098c4",
|
108 | 111 | "metadata": {},
|
| 112 | + "outputs": [], |
109 | 113 | "source": [
|
110 | 114 | "import geopandas as gpd\n",
|
111 | 115 | "\n",
|
|
126 | 130 | "\n",
|
127 | 131 | "ax.set_title('LOF Scores')\n",
|
128 | 132 | "ax.legend(['Outlier Scores', 'Data points'])"
|
129 |
| - ], |
130 |
| - "outputs": [], |
131 |
| - "execution_count": null |
| 133 | + ] |
132 | 134 | },
|
133 | 135 | {
|
134 | 136 | "cell_type": "code",
|
| 137 | + "execution_count": null, |
135 | 138 | "id": "057ef39f-0d20-464d-9891-590e5aff6d72",
|
136 | 139 | "metadata": {},
|
137 |
| - "source": [], |
138 | 140 | "outputs": [],
|
139 |
| - "execution_count": null |
| 141 | + "source": [] |
140 | 142 | }
|
141 | 143 | ],
|
142 | 144 | "metadata": {
|
|
0 commit comments