-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Many public suffixes, both public and private, have been added to the list, but some of them were removed from the list at later date. While this is great to have an up-to-date list to assess the live web, but when we need to process historical URLs those removed suffixes cause issues. It would be really helpful to maintain an add-only list (in addition to the existing one) from which no historically valid suffixes are ever removed. Albeit, such deletions can be marked with an inline note containing date of removal and reason.
One prominent use-case for such a historically accumulated PSL comes from web archives like the Wayback Machine of the Internet Archive and many other web archives around the world.
In the past I have tried two approaches to generate such an accumulated list:
- From the Wayback Machine: I downloaded all the historical captures of the published PSL file from the Wayback Machine, merged them all, removed any comments and empty lines, and sorted them to create a list of unique suffixes.
- From the Git history of this repo: I checked out every version of the list file from this repo iteratively and merged them to create the most comprehensive PSL (which resulted in more suffixes than the ones collected from the Wayback Machine).
While these approaches work for the most part, they lose a great deal of context and annotations. There is also the risk of including wrong or misspelled prefixes in such automatically generated comprehensive lists which might have been added by mistake in a revision, but were reverted later.
It would be helpful to provision one or more of the following, if practical:
- Include scripts to automatically generate comprehensive lists from the repo history while considering exclusions of any known bad entries.
- Maintaining such a comprehensive list manually with necessary comments and annotations.
- Disseminating the alternate comprehensive PSL from a well-known URL.