@@ -8,6 +8,13 @@ Post uniqueness in an RSS feed can be determined by 4 different methods:
8
8
- The url of the first enclosure (common amongst podcasts)
9
9
- A hash of the title, link, description and enclosures
10
10
11
+ Note that the guid shouldn't change. The link, url and hash can change when the post is updated though.
12
+ So using the first approach is preferable.
13
+
14
+ While not available in the RSS feed you could also consider the
15
+
16
+ - Canonical URL on the page
17
+
11
18
## RSS feed Uniqueness ##
12
19
13
20
Determining the uniqueness for an RSS feed is harder.
@@ -29,3 +36,17 @@ Many sites have dropped support for this tag though. We could add special cases
29
36
- Wordpress blogs
30
37
31
38
As it's easy to determine the feed location
39
+
40
+ ## How Winds handles uniqueness ##
41
+
42
+ For every feed Winds will evaluate which one of these fields are unique:
43
+
44
+ [ 'guid', 'link', 'enclosure[ 0] .url', 'hash']
45
+
46
+ Note that the hash is computed before any enrichment is done on the feed content.
47
+ After that it stores the unique value in ` article.fingerprint ` in the format ` guid:123 ` or ` hash:123 ` etc.
48
+ After selecting the best algorithm it will use a batch select and update to update the feed articles.
49
+
50
+ The uniqueness of the last 20 articles is used to compute a hash for the RSS feed.
51
+ We use this information to occasionally merge RSS feeds.
52
+ After merging the alternative URLs are stored to prevent people from submitting the same feed under a different url.
0 commit comments