Skip to content
This repository was archived by the owner on Oct 12, 2021. It is now read-only.

Commit 39085bd

Browse files
committed
more docs on rss
1 parent 49d1bae commit 39085bd

File tree

1 file changed

+21
-0
lines changed

1 file changed

+21
-0
lines changed

RSS.md

+21
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,13 @@ Post uniqueness in an RSS feed can be determined by 4 different methods:
88
- The url of the first enclosure (common amongst podcasts)
99
- A hash of the title, link, description and enclosures
1010

11+
Note that the guid shouldn't change. The link, url and hash can change when the post is updated though.
12+
So using the first approach is preferable.
13+
14+
While not available in the RSS feed you could also consider the
15+
16+
- Canonical URL on the page
17+
1118
## RSS feed Uniqueness ##
1219

1320
Determining the uniqueness for an RSS feed is harder.
@@ -29,3 +36,17 @@ Many sites have dropped support for this tag though. We could add special cases
2936
- Wordpress blogs
3037

3138
As it's easy to determine the feed location
39+
40+
## How Winds handles uniqueness ##
41+
42+
For every feed Winds will evaluate which one of these fields are unique:
43+
44+
['guid', 'link', 'enclosure[0].url', 'hash']
45+
46+
Note that the hash is computed before any enrichment is done on the feed content.
47+
After that it stores the unique value in `article.fingerprint` in the format `guid:123` or `hash:123` etc.
48+
After selecting the best algorithm it will use a batch select and update to update the feed articles.
49+
50+
The uniqueness of the last 20 articles is used to compute a hash for the RSS feed.
51+
We use this information to occasionally merge RSS feeds.
52+
After merging the alternative URLs are stored to prevent people from submitting the same feed under a different url.

0 commit comments

Comments
 (0)