URL Checker

A robust URL validation tool that scans codebase files for links and verifies their validity. The tool handles both absolute URLs (http/https) and relative file paths across multiple file types, providing detailed reports with color-coded output.

📋 Table of Contents

Features
Requirements
Installation
Basic Usage
Helper Tools
Output Format
Configuration
Troubleshooting

✨ Features

Comprehensive Link Validation
- Absolute URLs (http/https)
- Relative file paths with smart path resolution
- Root-relative paths (starting with /)
- Image and SVG links
- Markdown header links (#section-name)
- Cross-file links with anchors
Multi-Language Support - Detects URLs in over 25 file types:
- Markdown (.md)
- HTML (.html, .htm)
- CSS/SCSS (.css, .scss)
- JavaScript/TypeScript (.js, .jsx, .ts, .tsx)
- Python (.py)
- Shell scripts (.sh, .bash, .zsh)
- PowerShell (.ps1, .psm1, .psd1)
- Configuration files (.json, .yaml, .yml, .toml, .env)
- And many more...
Smart Path Resolution
- Case-insensitive path matching
- Directory index detection (_index.md, README.md)
- Parent directory traversal
- Special character handling
Detailed Reporting
- Color-coded console output
- Categorized link results
- Comprehensive log files
- Runtime performance metrics

🧰 Requirements

Python 3.8 or higher
Required packages:
- requests - For HTTP requests
- colorama - For colored terminal output

pip install requests colorama

🚀 Installation

Clone the repository or download the URL checker tool:

git clone https://github.com/username/jumpstart-sdk.git
cd jumpstart-sdk/tools/url-checker

💻 Basic Usage

Run the URL checker from the repository root to scan all supported files:

python tools/url-checker/url_checker.py

Command Line Options

# Check files in a specific directory
python url_checker.py --dir=docs

# Use a custom timeout for HTTP requests
python url_checker.py --timeout=30

# Exclude specific folders from being checked
python url_checker.py --exclude node_modules vendor

# Combine multiple options
python url_checker.py --dir=src --exclude tests temp --timeout=20

The --exclude option accepts multiple folder paths that will be skipped during URL checking. This is useful for:

Excluding third-party code and vendor directories
Skipping generated code folders
Ignoring temporary or build directories
Reducing execution time for large repositories

🛠️ Helper Tools

The URL checker comes with two companion tools to help with testing and visualization:

1. Test File Generator (`create_test_files.py`)

Creates a realistic directory structure with various file types containing different kinds of URLs for testing purposes.

# Create test files with default settings
python create_test_files.py

# Create a more complex test environment
python create_test_files.py --complexity=5 --file-count=10

# Clean up test files when done
python create_test_files.py --clean

Options:

--dir=NAME - Directory where test files will be created (default: "test_files")
--clean - Remove existing test files instead of creating new ones
--file-count=N - Base number of files per type (default: 5)
--complexity=N - Directory structure complexity level 1-5 (default: 3)

2. Output Simulator (`simulate_output.py`)

Demonstrates what the URL checker's output will look like without actually checking any URLs.

python simulate_output.py

This is useful for:

Testing the formatting and appearance of output
Demonstrating the tool to others
Testing terminal color compatibility

📊 Output Format

The URL checker provides categorized output in both the console and log files:

Console Output Example

═════════════════════════════════════════════════════════
📊  LINK VALIDATION SUMMARY (156 links checked)
═════════════════════════════════════════════════════════

❌  BROKEN LINKS: 8
   • Absolute URLs: 3
   • Relative URLs without anchors: 2
   • Relative URLs with anchors: 1
   • Image URLs: 2

📭  NO LINKS FOUND: 1
   • SVG URLs

🔍  CATEGORIES WITH NO BROKEN LINKS: 2
   • Root-relative URLs: 10 OK links
   • Header links: 7 OK links

✅  OK LINKS: 148

⏱️  RUNTIME: 3.70 minutes (0:03:42)

📄 FULL LOGS: logs/broken_urls_2023-10-20_15-30-45.log

❌  Broken links were found. Check the logs for details.

Log Files

Detailed logs are saved to the logs directory with timestamps, containing:

Full details of all checked URLs
Status codes for broken absolute URLs
File paths for broken relative URLs
Categorized summaries
Runtime statistics

⚙️ Configuration

Known Valid Domains

Add domains that should be considered valid without checking:

# In url_checker.py
KNOWN_VALID_DOMAINS = [
    "learn.microsoft.com",
    "icanhazip.com",
    # Add more domains to skip
]

Timeout Settings

Adjust the HTTP request timeout:

# In url_checker.py
TIMEOUT = 15  # Timeout in seconds

Or use the command-line option:

python url_checker.py --timeout=30

File Extensions

Modify the SUPPORTED_FILE_TYPES dictionary to control which file types are checked.

🔍 Troubleshooting

Timeout Issues

If you encounter many timeout errors:

Increase the timeout value: --timeout=30
Add problematic domains to KNOWN_VALID_DOMAINS

False Positives

Some URLs may be incorrectly marked as broken due to:

Server-side rate limiting
Temporary server issues
Authentication requirements

For trusted domains that may have connectivity issues, add them to KNOWN_VALID_DOMAINS.

Relative Path Issues

If relative URLs are incorrectly reported as broken:

Check case sensitivity (important on Linux/macOS)
Verify directory separators (/ not \)
Check parent directory traversal notation (../)

🔄 Testing Workflow

A typical testing workflow using the helper tools:

Generate test files: python create_test_files.py
Check only those files: python url_checker.py --dir=test_files
Clean up when finished: python create_test_files.py --clean

📋 Exit Codes

The URL checker returns the following exit codes:

0 - All URLs are valid (no broken links found)
1 - At least one broken link was found

This makes the tool suitable for use in CI/CD pipelines where you might want to fail a build when broken links are detected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

URL Checker

📋 Table of Contents

✨ Features

🧰 Requirements

🚀 Installation

💻 Basic Usage

Command Line Options

🛠️ Helper Tools

1. Test File Generator (`create_test_files.py`)

2. Output Simulator (`simulate_output.py`)

📊 Output Format

Console Output Example

Log Files

⚙️ Configuration

Known Valid Domains

Timeout Settings

File Extensions

🔍 Troubleshooting

Timeout Issues

False Positives

Relative Path Issues

🔄 Testing Workflow

📋 Exit Codes

Files

README.md

Latest commit

History

README.md

File metadata and controls

URL Checker

📋 Table of Contents

✨ Features

🧰 Requirements

🚀 Installation

💻 Basic Usage

Command Line Options

🛠️ Helper Tools

1. Test File Generator (create_test_files.py)

2. Output Simulator (simulate_output.py)

📊 Output Format

Console Output Example

Log Files

⚙️ Configuration

Known Valid Domains

Timeout Settings

File Extensions

🔍 Troubleshooting

Timeout Issues

False Positives

Relative Path Issues

🔄 Testing Workflow

📋 Exit Codes

1. Test File Generator (`create_test_files.py`)

2. Output Simulator (`simulate_output.py`)