Releases: medialab/hyphe
Releases · medialab/hyphe
Hot summer 2025, crawling from within
ChangeLog:
- Allow to start individual and multiple crawls directly from the NETWORK page
- Make MANAGE TAGS page applicable also to UNDECIDED webentities (#506 #441)
- Hide www from DEFINE WEBENTITY prefix slider (#510)
- Minor frontend fixes (correct crawl statuses for startpages in WebEntity's list of pages, fixed durations for crawls in Monitor all crawls page, proper permalinks to wayback machine for INA Web Archives)
Full Changelog: v1.12.1...v1.12.2
Hot 2025, the post Skybox era
ChangeLog:
- Add a button to start a crawl directly from the Network page
- Add in frontend ways to cancel all pending crawls and cancel/recrawl individual crawls from the Monitor all crawls page (+ fix API's crawl.cancel_all route to also cancel crawls unscheduled within scrapy yet and set their crawl status appropriately)
- Improve reviewed crawls button in the Monitor all crawls page
- Add default webentity creation rules for Bluesky and X user accounts, as well as skyblogs for webarchives
- Fix Monitor latest crawls page not displaying most recent ones in some server cases due to misaligned timestamps
- Small fixes for BnF & INA Web Archives (proper permalinks, adapt to recent upstream changes)
- Minor fixes to installation doc and frontend display (make tags validation easier with an "Add" button, display visually crawl status of each page of a webentity, handle total redirected pages missing from old hyphe corpus versions, autostop network spatialization, make some buttons more visible, fix duration displayed for canceled unscheduled crawls, autofocus input in Import page, etc.)
Full Changelog: v1.12.0...v1.12.1
2025 up in the Skybox
ChangeLog:
- Fix Default WebEntityCreationRule not always applied when different of domain (upgrades to hyphe-traph v2.2) (#499)
- Add an option in the web interface to load tags from a CSV file along with importing new or existing WebEntities (#503)
- Add the possibility to set a crawl job as reviewed (#478)
- Allow to rename a corpus (#457)
- Better handle WebEntities with prefixes including special characters in the path (#447)
- Distinguish crawl pages error from simple redirection ones (#492)
- Auto resolve more urls directly within crawler (#463)
- Fix automatic feeding of recent UserAgents, whether behind a proxy or not
- Small fixes for INA & BnF Web Archives (#502 + permalinks with misformatted dates)
- Minor fixes to lookups logic, config loading, manual installation doc, corpus landing page (#487) and backend logs display
Full Changelog: v1.11.0...v1.12.0
Early 2024
Back-to-school papercuts
ChangeLog:
- Add a button to export metadata from all pages of a webentity (#318)
- Explicitly separate startpages warnings regarding redirected pages and faulty ones (#379)
- Allow to set a specific User-Agent per crawl within the web interface (#461)
- Display hints on the meaning of the different possible status of a crawl (#474)
- Highlight corresponding webentities when hovering a status or a tag in the network legend (#459)
- Switch User-Agents list used within crawls to relying on https://www.useragents.me/ (#453)
- Various improvements (cleaner backend logs, remove empty traphs directories (#475), updated heuristics for webentity links calculation rhythm, visual fixes (#476, #477)
Hot Summer '23
ChangeLog:
- migrated caching WELinks to (working) files instead of mongo to handle huge corpuses
- allow to set archives pass as ENV variable for docker instances
- display time required by links indexation on overview
Summer '23
ChangeLog:
- Added handling of more webarchives as sources (Arquivo.pt + INA DLWeb) + fixed various webarchives frontend info (#469, #471,
- Added a corpus setting "ignore internal links" to crawl but not record links within the currently crawled webentity in order to fasten drastically indexation of entities with crazy amounts of links (with a cost in terms of functionalities since the network of internal pages is then not available, and entities that are split after a crawl will require to recrawled) (cf #371, #378, #433)
- Better handle frontend warning on pending actions when trying to close a tab (#465, #466)
- Minor fixes (#448, #460, #467, #468, #470, 50d97e8, 85decf2)
Better, faster, stronger traph, there it is!
ChangeLog:
- Switched to breaking new version of hyphe-traph 2.1, which should help fasten indexation on big networks, but requires to rebuild corpuses from start
- Make iterator traph calls less recurrent to leave priority to quick user actions
- Fixed stack on calling empty callback in List Webentities
- Upgraded urllib3 to handle SSL deprecation
- Froze dependencies to maintain python2.7 compat
Summer '22
ChangeLog:
- Upgraded User Agents list
- Added extra default WebEntity CreationRules for Github, Instagram, TikTok, Reddit and a bunch of blog platforms
- Added perma.cc to list of default autofollowlinks
- Diverse fixes and extra features for webarchives (links to archive permalinks, etc.)
- Minor bugfixes
Spring '22
ChangeLog:
- Added a distinction between successful and errored crawled pages to identify Suspicious crawls (#425)
- Fixed frontend compatibility within Hyphe-Browser (medialab/hyphe-browser#212)
- Fixed WebArchives crawling interface (#431) and behavior from BNF's archives (#426)
- Improved network page's interaction using latest sigma.js v2.2 (node highlight etc & #367)
- Allowed frontend to automatically restart a closed corpus when reopening the frontend directly on a specific corpus link (#440)
- Allowed to check contiguous cases in frontend's lists of webentities using the shift key (#438)
- Allowed to tune the frontend's header color from the config (#430)
- Published Hyphe on Zenodo & Software Heritage
- Minor fixes (#397, #388, #432, #429, #437, #343, #341, #444, #325)