Skip to content

Commit 5507552

Browse files
authored
[DOCS] editing lof notebook (apache#286)
1 parent db26c0e commit 5507552

File tree

1 file changed

+29
-27
lines changed

1 file changed

+29
-27
lines changed

example_notebooks/lof_example.ipynb

Lines changed: 29 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,17 @@
11
{
22
"cells": [
33
{
4-
"metadata": {},
54
"cell_type": "markdown",
5+
"id": "700d8667b7cb899c",
6+
"metadata": {},
67
"source": [
78
"![](https://wherobots.com/wp-content/uploads/2023/12/[email protected])\n",
89
"# Local Outlier Factor Example\n",
9-
"Local Outlier Factor (LOF) is a common algorithm for identifying data points that are inliers/outliers relative to their neighbors. The algorithm works by comparing how close an element is to its neighbors vs how close they are to their neighbors. The number of neighbors to use, k, is set by the user.\n",
10-
"Scores much less than one are inliers, scores much greater are outliers, and those near one are neither.\n",
10+
"Local Outlier Factor (LOF) is a common algorithm for identifying data points that are inliers or outliers relative to their neighbors. The algorithm generates an outlier score that compares the proximity of a data point's density relative to its neighbors. The number of neighbors to use, k, is defined by the user.\n",
11+
"\n",
12+
"Scores much less than 1 are inliers, scores much greater than 1 are outliers, and those near 1 are neither.\n",
1113
"This demo is derived from the [scikit-learn Local Outlier Detection demo](https://scikit-learn.org/stable/auto_examples/neighbors/plot_lof_outlier_detection.html)."
12-
],
13-
"id": "700d8667b7cb899c"
14+
]
1415
},
1516
{
1617
"cell_type": "markdown",
@@ -22,30 +23,33 @@
2223
},
2324
{
2425
"cell_type": "code",
26+
"execution_count": null,
2527
"id": "3a51adb8-f89f-4cb3-9a41-24a36d8f1fcf",
2628
"metadata": {},
29+
"outputs": [],
2730
"source": [
2831
"from sedona.spark import SedonaContext\n",
2932
"\n",
3033
"config = SedonaContext.builder().getOrCreate()\n",
3134
"sedona = SedonaContext.create(config)"
32-
],
33-
"outputs": [],
34-
"execution_count": null
35+
]
3536
},
3637
{
3738
"cell_type": "markdown",
3839
"id": "cff198142e2ebced",
3940
"metadata": {},
4041
"source": [
4142
"# Data Generation\n",
42-
"We generate some data. Most of it is random, but some data is explicitly designed to be outliers"
43+
"\n",
44+
"The following code generates data with two clusters of inliers and some outliers."
4345
]
4446
},
4547
{
4648
"cell_type": "code",
49+
"execution_count": null,
4750
"id": "99f8c27a-9c0b-4f8f-a388-54a781d892e3",
4851
"metadata": {},
52+
"outputs": [],
4953
"source": [
5054
"import numpy as np\n",
5155
"import pyspark.sql.functions as f\n",
@@ -58,54 +62,54 @@
5862
"X_inliers = 0.3 * np.random.randn(100, 2)\n",
5963
"X_inliers = np.r_[X_inliers + 2, X_inliers - 2]\n",
6064
"X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))\n",
61-
"X = np.r_[X_inliers, X_outliers]\n"
62-
],
63-
"outputs": [],
64-
"execution_count": null
65+
"X = np.r_[X_inliers, X_outliers]"
66+
]
6567
},
6668
{
6769
"cell_type": "markdown",
6870
"id": "372ae8aee6714c8c",
6971
"metadata": {},
7072
"source": [
7173
"## Generation LOF\n",
72-
"We use the LOF implementation in Wherobots to generate this statistic on the data. We set k to 20.\n"
74+
"The following code uses the LOF implementation in Wherobots to generate an outlier score on the selected data. We set k to 20."
7375
]
7476
},
7577
{
7678
"cell_type": "code",
79+
"execution_count": null,
7780
"id": "fe5af921-957c-48cc-942a-e6c744c72bc5",
7881
"metadata": {},
82+
"outputs": [],
7983
"source": [
8084
"df = sedona.createDataFrame(X).select(ST_MakePoint(f.col(\"_1\"), f.col(\"_2\")).alias(\"geometry\"))\n",
81-
"outliers_df = local_outlier_factor(df, 20)\n"
82-
],
83-
"outputs": [],
84-
"execution_count": null
85+
"outliers_df = local_outlier_factor(df, 20)"
86+
]
8587
},
8688
{
8789
"cell_type": "code",
90+
"execution_count": null,
8891
"id": "d060243f-c00b-436d-99d4-e0fbaba89930",
8992
"metadata": {},
93+
"outputs": [],
9094
"source": [
9195
"outliers_df.show()"
92-
],
93-
"outputs": [],
94-
"execution_count": null
96+
]
9597
},
9698
{
9799
"cell_type": "markdown",
98100
"id": "5a475ce35250afea",
99101
"metadata": {},
100102
"source": [
101103
"## Visualization\n",
102-
"We visualize the results using geopandas. Some manipulations are made to the data to improve the clarity of the visualization."
104+
"Finally, we visualize the results using geopandas. Some manipulations are made to the data to improve the clarity of the visualization."
103105
]
104106
},
105107
{
106108
"cell_type": "code",
109+
"execution_count": null,
107110
"id": "244c8d55-0c69-4922-b769-666c056098c4",
108111
"metadata": {},
112+
"outputs": [],
109113
"source": [
110114
"import geopandas as gpd\n",
111115
"\n",
@@ -126,17 +130,15 @@
126130
"\n",
127131
"ax.set_title('LOF Scores')\n",
128132
"ax.legend(['Outlier Scores', 'Data points'])"
129-
],
130-
"outputs": [],
131-
"execution_count": null
133+
]
132134
},
133135
{
134136
"cell_type": "code",
137+
"execution_count": null,
135138
"id": "057ef39f-0d20-464d-9891-590e5aff6d72",
136139
"metadata": {},
137-
"source": [],
138140
"outputs": [],
139-
"execution_count": null
141+
"source": []
140142
}
141143
],
142144
"metadata": {

0 commit comments

Comments
 (0)