ManiCrawl

This Python script uses Selenium to crawl the Manipal University Library Portal and extract links to PDF question papers organized by year and subject.

Features

Recursively navigates year and subject folders
Extracts all PDF links
Saves results in structured JSON format
Supports concurrent crawling for multiple years

Requirements

Python 3.7+
Google Chrome
ChromeDriver (compatible with your Chrome version)

Installation

pip install selenium

Download and place the matching chromedriver in your PATH or the same directory as the script.

Usage

python crawler.py

Collected PDF links will be saved in the pdf_results/ folder as {year}_pdfs.json.

Configuration

To crawl specific years, edit the years list in main():

years = [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024]

GitHub Actions

You can automate crawling using GitHub Actions. Create a workflow file under .github/workflows/ and define the trigger (manual or scheduled).

Credits

Thanks to Epicguest97/crawler for the original inspiration and code structure.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
README.md		README.md
manicrawl.py		manicrawl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ManiCrawl

Features

Requirements

Installation

Usage

Configuration

GitHub Actions

Credits

About

Uh oh!

Languages

NilayShenai/ManiCrawl

Folders and files

Latest commit

History

Repository files navigation

ManiCrawl

Features

Requirements

Installation

Usage

Configuration

GitHub Actions

Credits

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages