Skip to content

NilayShenai/ManiCrawl

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 

Repository files navigation

ManiCrawl

This Python script uses Selenium to crawl the Manipal University Library Portal and extract links to PDF question papers organized by year and subject.

Features

  • Recursively navigates year and subject folders
  • Extracts all PDF links
  • Saves results in structured JSON format
  • Supports concurrent crawling for multiple years

Requirements

  • Python 3.7+
  • Google Chrome
  • ChromeDriver (compatible with your Chrome version)

Installation

pip install selenium

Download and place the matching chromedriver in your PATH or the same directory as the script.

Usage

python crawler.py

Collected PDF links will be saved in the pdf_results/ folder as {year}_pdfs.json.

Configuration

To crawl specific years, edit the years list in main():

years = [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024]

GitHub Actions

You can automate crawling using GitHub Actions. Create a workflow file under .github/workflows/ and define the trigger (manual or scheduled).

Credits

Thanks to Epicguest97/crawler for the original inspiration and code structure.

About

This was inspired by a project belonging to a dear friend of mine.

Topics

Resources

Stars

Watchers

Forks

Languages

  • Python 100.0%