nyhetis-core

Description

nyhetis is an automatic news dossier that crawls newspaper websites and keeps track of relevant news.

nyhetis-core provides the retrieval functionality and an API so a client can use it eventually.

Functionality

nyhetis-core uses Sinatra to provide the needed endpoints and request handlers and Cobweb to crawl the newsfeeds.

Under the hood, it uses Resque to queue crawl jobs. Crawling is an expensive job that might take several minutes/hours, so in order to keep the system smooth and responsive, it needs to delegate to another processes this task.

Since online-newspapers never share their HTML structure, the system provides a strategy to allow future newsfeeds to be added to the crawl.

Once a page from a newsfeed is downloaded by the crawler, it is validated and processed. The validation is made following the concrete strategy criteria. The concrete strategy would parse the HTML element from the website and extract two things: the text of the new and its heading. The HTML parsing is performed Nokogiri.

The relevance of one new is calculated using a Bag of Words, this means that crawled news would be marked as relevant if contain at least one "word" defined in the bag.

Supported newspapers

So far the supported newspapers is the following, even though this list might be extended soon:

Use

After installing the dependencies, start up the rack server using the config.ru configuration file:

rackup config.ru

Testing

While testing the whole system it's needed to activate the workers and activate them in RACK_ENV = test and run nyhetis in testing mode, using RACK_ENV = test again.

It is provided, along with the tests, a Newspaper Mock that serves a single html site. The code is available under test/newspaper_mock.

Copyright

MIT License. Luis Carlos Mateos. 2013

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.yardoc		.yardoc
config		config
db		db
doc		doc
lib		lib
log		log
spec		spec
tmp/pids		tmp/pids
vendor		vendor
.gitignore		.gitignore
.rspec		.rspec
.ruby-version		.ruby-version
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
Rakefile		Rakefile
config.ru		config.ru
dossier.rb		dossier.rb
pfc.rb		pfc.rb
start_in_test_mode_fish		start_in_test_mode_fish
start_workers		start_workers
test_instructions		test_instructions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nyhetis-core

Description

Functionality

Supported newspapers

Use

Testing

Copyright

About

Releases

Packages

Languages

License

wuiscmc/nyhetis

Folders and files

Latest commit

History

Repository files navigation

nyhetis-core

Description

Functionality

Supported newspapers

Use

Testing

Copyright

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages