A robust URL validation tool that scans codebase files for links and verifies their validity. The tool handles both absolute URLs (http/https) and relative file paths across multiple file types, providing detailed reports with color-coded output.
- Features
- Requirements
- Installation
- Basic Usage
- Helper Tools
- Output Format
- Configuration
- Troubleshooting
-
Comprehensive Link Validation
- Absolute URLs (http/https)
- Relative file paths with smart path resolution
- Root-relative paths (starting with
/
) - Image and SVG links
- Markdown header links (
#section-name
) - Cross-file links with anchors
-
Multi-Language Support - Detects URLs in over 25 file types:
- Markdown (.md)
- HTML (.html, .htm)
- CSS/SCSS (.css, .scss)
- JavaScript/TypeScript (.js, .jsx, .ts, .tsx)
- Python (.py)
- Shell scripts (.sh, .bash, .zsh)
- PowerShell (.ps1, .psm1, .psd1)
- Configuration files (.json, .yaml, .yml, .toml, .env)
- And many more...
-
Smart Path Resolution
- Case-insensitive path matching
- Directory index detection (_index.md, README.md)
- Parent directory traversal
- Special character handling
-
Detailed Reporting
- Color-coded console output
- Categorized link results
- Comprehensive log files
- Runtime performance metrics
- Python 3.8 or higher
- Required packages:
requests
- For HTTP requestscolorama
- For colored terminal output
pip install requests colorama
Clone the repository or download the URL checker tool:
git clone https://github.com/username/jumpstart-sdk.git
cd jumpstart-sdk/tools/url-checker
Run the URL checker from the repository root to scan all supported files:
python tools/url-checker/url_checker.py
# Check files in a specific directory
python url_checker.py --dir=docs
# Use a custom timeout for HTTP requests
python url_checker.py --timeout=30
# Exclude specific folders from being checked
python url_checker.py --exclude node_modules vendor
# Combine multiple options
python url_checker.py --dir=src --exclude tests temp --timeout=20
The --exclude
option accepts multiple folder paths that will be skipped during URL checking. This is useful for:
- Excluding third-party code and vendor directories
- Skipping generated code folders
- Ignoring temporary or build directories
- Reducing execution time for large repositories
The URL checker comes with two companion tools to help with testing and visualization:
Creates a realistic directory structure with various file types containing different kinds of URLs for testing purposes.
# Create test files with default settings
python create_test_files.py
# Create a more complex test environment
python create_test_files.py --complexity=5 --file-count=10
# Clean up test files when done
python create_test_files.py --clean
Options:
--dir=NAME
- Directory where test files will be created (default: "test_files")--clean
- Remove existing test files instead of creating new ones--file-count=N
- Base number of files per type (default: 5)--complexity=N
- Directory structure complexity level 1-5 (default: 3)
Demonstrates what the URL checker's output will look like without actually checking any URLs.
python simulate_output.py
This is useful for:
- Testing the formatting and appearance of output
- Demonstrating the tool to others
- Testing terminal color compatibility
The URL checker provides categorized output in both the console and log files:
═════════════════════════════════════════════════════════
📊 LINK VALIDATION SUMMARY (156 links checked)
═════════════════════════════════════════════════════════
❌ BROKEN LINKS: 8
• Absolute URLs: 3
• Relative URLs without anchors: 2
• Relative URLs with anchors: 1
• Image URLs: 2
📭 NO LINKS FOUND: 1
• SVG URLs
🔍 CATEGORIES WITH NO BROKEN LINKS: 2
• Root-relative URLs: 10 OK links
• Header links: 7 OK links
✅ OK LINKS: 148
⏱️ RUNTIME: 3.70 minutes (0:03:42)
📄 FULL LOGS: logs/broken_urls_2023-10-20_15-30-45.log
❌ Broken links were found. Check the logs for details.
Detailed logs are saved to the logs
directory with timestamps, containing:
- Full details of all checked URLs
- Status codes for broken absolute URLs
- File paths for broken relative URLs
- Categorized summaries
- Runtime statistics
Add domains that should be considered valid without checking:
# In url_checker.py
KNOWN_VALID_DOMAINS = [
"learn.microsoft.com",
"icanhazip.com",
# Add more domains to skip
]
Adjust the HTTP request timeout:
# In url_checker.py
TIMEOUT = 15 # Timeout in seconds
Or use the command-line option:
python url_checker.py --timeout=30
Modify the SUPPORTED_FILE_TYPES
dictionary to control which file types are checked.
If you encounter many timeout errors:
- Increase the timeout value:
--timeout=30
- Add problematic domains to
KNOWN_VALID_DOMAINS
Some URLs may be incorrectly marked as broken due to:
- Server-side rate limiting
- Temporary server issues
- Authentication requirements
For trusted domains that may have connectivity issues, add them to KNOWN_VALID_DOMAINS
.
If relative URLs are incorrectly reported as broken:
- Check case sensitivity (important on Linux/macOS)
- Verify directory separators (
/
not\
) - Check parent directory traversal notation (
../
)
A typical testing workflow using the helper tools:
- Generate test files:
python create_test_files.py
- Check only those files:
python url_checker.py --dir=test_files
- Clean up when finished:
python create_test_files.py --clean
The URL checker returns the following exit codes:
0
- All URLs are valid (no broken links found)1
- At least one broken link was found
This makes the tool suitable for use in CI/CD pipelines where you might want to fail a build when broken links are detected.