Two Python scripts for recursively analyzing directories and generating comprehensive file metadata reports.
Best for: Quick scans with both JSON and CSV output
- Lightweight and easy to use
- Outputs both JSON and CSV formats
- PDF-focused metadata extraction
- Date-time stamped output files
- Simple command-line interface
Best for: In-depth analysis with detailed statistics
- Supports more file types (PDF, DOCX, ODT, TXT, MD, RTF)
- Advanced author/owner extraction
- Folder-level statistics
- Comprehensive error logging
- Modified date tracking
- Recursively scans directories for specified file extensions
- Extracts file metadata:
- Filename and folder path
- File extension and size (MB)
- Creation date
- Page count (PDF only)
- Word count (PDF only)
- Author (PDF only)
- Generates 4 output files per run:
- Detailed log (JSON and CSV)
- Summary log (JSON and CSV)
- All files include date-time stamps
- Filter by file extension
python scan_directory.py <input_folder> <output_folder> <extensions>input_folder- Path to the directory to scanoutput_folder- Path where output files will be savedextensions- Space-separated list of file extensions (e.g., pdf docx txt)
# Scan for PDF files only
python scan_directory.py ./documents ./output_logs pdf
# Scan for multiple file types
python scan_directory.py ./documents ./output_logs pdf docx txt
# Windows example
python scan_directory.py "C:\Documents\Project Files" "C:\Reports" pdf docxAll files include matching date-time stamps (YYYYMMDD_HHMMSS):
- detailed_log_YYYYMMDD_HHMMSS.json - JSON array with all file details
- detailed_log_YYYYMMDD_HHMMSS.csv - CSV table with file details
- summary_log_YYYYMMDD_HHMMSS.json - Aggregated statistics
- summary_log_YYYYMMDD_HHMMSS.csv - Flattened summary with sections
- filename
- folder_path
- file_extension
- size_in_MBs
- create_date
- page_count
- word_count
- author
- Log run date
- Input folder path
- Total files, size, pages, and words
- File counts by extension
- File sizes by extension
pip install PyMuPDF- Recursively scans directories and subdirectories
- Extracts detailed metadata for each file:
- Filename and path
- File extension
- Size in MB
- Page count (for PDF, DOCX, TXT, ODT files)
- Word count (for text-based files)
- Creation and modified dates
- Author/Owner information
- Generates two timestamped JSON reports:
- Detailed Report: Complete list of all files with full metadata
- Summary Report: Aggregated statistics by extension and folder
- Filter files by extension
- Comprehensive error handling and logging
pip install PyMuPDFpip install -r requirements.txtOptional dependencies for file_triage.py:
PyMuPDF- PDF page and word counts (fast and efficient)python-docx- DOCX page and word countsodfpy- ODT file supportpywin32- Windows file ownership (Windows only)
Install only what you need:
# For PDF support only
pip install PyMuPDF
# For DOCX support only
pip install python-docx
# For Windows file ownership
pip install pywin32Analyze all files in a directory:
python file_triage.py -i /path/to/input -o /path/to/outputAnalyze only specific file types:
python file_triage.py -i /path/to/input -o /path/to/output -e .pdf .docx .txtpython file_triage.py -i "C:\Documents\Project" -o "C:\Reports" -e .pdf .docx .xlsx-i,--input-folder(required): Input folder to scan recursively-o,--output-folder(required): Output folder for generated reports-e,--extensions(optional): File extensions to process (space-separated)
The script generates two timestamped JSON files in the output folder:
Contains complete information for each file:
{
"metadata": {
"run_date": "2025-10-11T10:30:00",
"input_folder": "C:\\Documents\\Project",
"total_files": 150,
"extensions_filter": [".pdf", ".docx"]
},
"files": [
{
"filename": "report.pdf",
"folder_path": "C:\\Documents\\Project\\Reports",
"file_extension": ".pdf",
"size_mb": 2.5,
"create_date": "2025-09-15T14:30:00",
"modified_date": "2025-10-01T09:15:00",
"author": "DOMAIN\\username",
"page_count": 45,
"word_count": 8500
}
],
"errors": []
}Contains aggregated statistics:
{
"metadata": {
"run_date": "2025-10-11T10:30:00",
"input_folder": "C:\\Documents\\Project",
"extensions_filter": [".pdf", ".docx"]
},
"overall_summary": {
"total_files": 150,
"total_size_mb": 450.75,
"total_pages": 3250,
"total_words": 125000,
"files_with_page_count": 120,
"files_with_word_count": 145
},
"by_extension": {
".pdf": {
"count": 80,
"total_size_mb": 320.5
},
".docx": {
"count": 70,
"total_size_mb": 130.25
}
},
"by_folder": {
"C:\\Documents\\Project\\Reports": {
"file_count": 45,
"total_size_mb": 125.5,
"total_pages": 980,
"total_words": 42000
}
}
}- PDF files (requires PyMuPDF)
- DOCX files (requires python-docx)
- TXT files (estimated based on line count)
- ODT files (requires odfpy)
- TXT files
- PDF files (requires PyMuPDF)
- DOCX files (requires python-docx)
- Markdown files (.md)
- RTF files
- ODT files (requires odfpy)
- Windows: Extracts file owner from Windows security descriptors (requires pywin32)
- Unix/Linux: Extracts owner from file system metadata
- Note: For document-embedded author information, additional libraries would be needed
The script includes comprehensive error handling:
- Files that cannot be read are logged in the errors section
- Processing continues even if individual files fail
- Errors are reported in the detailed report
- Progress updates are displayed during scanning
- Large directories may take time to process
- PDF and DOCX processing is slower than plain text
- Progress is displayed every 100 files
- Consider using extension filters for faster processing
python file_triage.py -i "C:\Documents" -o "C:\Reports" -e .pdf .docx .docpython file_triage.py -i "/home/user/projects/myapp" -o "/home/user/reports"python file_triage.py -i "./data" -o "./output" -e .txt .md .csv .log- Page count for DOCX files is estimated based on character count
- Author information is limited to file system owner (not document metadata)
- Some file formats may not support page/word count extraction
- Very large files may cause memory issues
- Encrypted or password-protected files cannot be analyzed
Issue: "Module not found" error
- Solution: Install missing dependencies with
pip install -r requirements.txt
Issue: Page counts showing as null
- Solution: Install the relevant library (PyMuPDF for PDFs, python-docx for DOCX files)
Issue: Author showing as "Unknown"
- Solution: On Windows, install pywin32:
pip install pywin32
Issue: Permission denied errors
- Solution: Run the script with appropriate permissions or exclude protected directories
This script is provided as-is for file analysis and reporting purposes.