File Triage Scripts

Two Python scripts for recursively analyzing directories and generating comprehensive file metadata reports.

Scripts Overview

1. scan_directory.py (Simple & Fast)

Best for: Quick scans with both JSON and CSV output

Lightweight and easy to use
Outputs both JSON and CSV formats
PDF-focused metadata extraction
Date-time stamped output files
Simple command-line interface

2. file_triage.py (Advanced & Comprehensive)

Best for: In-depth analysis with detailed statistics

Supports more file types (PDF, DOCX, ODT, TXT, MD, RTF)
Advanced author/owner extraction
Folder-level statistics
Comprehensive error logging
Modified date tracking

scan_directory.py

Features

Recursively scans directories for specified file extensions
Extracts file metadata:
- Filename and folder path
- File extension and size (MB)
- Creation date
- Page count (PDF only)
- Word count (PDF only)
- Author (PDF only)
Generates 4 output files per run:
- Detailed log (JSON and CSV)
- Summary log (JSON and CSV)
All files include date-time stamps
Filter by file extension

Usage

Basic Syntax

python scan_directory.py <input_folder> <output_folder> <extensions>

Arguments

input_folder - Path to the directory to scan
output_folder - Path where output files will be saved
extensions - Space-separated list of file extensions (e.g., pdf docx txt)

Examples

# Scan for PDF files only
python scan_directory.py ./documents ./output_logs pdf

# Scan for multiple file types
python scan_directory.py ./documents ./output_logs pdf docx txt

# Windows example
python scan_directory.py "C:\Documents\Project Files" "C:\Reports" pdf docx

Output Files

All files include matching date-time stamps (YYYYMMDD_HHMMSS):

detailed_log_YYYYMMDD_HHMMSS.json - JSON array with all file details
detailed_log_YYYYMMDD_HHMMSS.csv - CSV table with file details
summary_log_YYYYMMDD_HHMMSS.json - Aggregated statistics
summary_log_YYYYMMDD_HHMMSS.csv - Flattened summary with sections

Detailed Log Fields

filename
folder_path
file_extension
size_in_MBs
create_date
page_count
word_count
author

Summary Log Contents

Log run date
Input folder path
Total files, size, pages, and words
File counts by extension
File sizes by extension

Requirements

pip install PyMuPDF

file_triage.py

Features

Recursively scans directories and subdirectories
Extracts detailed metadata for each file:
- Filename and path
- File extension
- Size in MB
- Page count (for PDF, DOCX, TXT, ODT files)
- Word count (for text-based files)
- Creation and modified dates
- Author/Owner information
Generates two timestamped JSON reports:
1. Detailed Report: Complete list of all files with full metadata
2. Summary Report: Aggregated statistics by extension and folder
Filter files by extension
Comprehensive error handling and logging

Requirements & Installation

For scan_directory.py

pip install PyMuPDF

For file_triage.py

pip install -r requirements.txt

Optional dependencies for file_triage.py:

PyMuPDF - PDF page and word counts (fast and efficient)
python-docx - DOCX page and word counts
odfpy - ODT file support
pywin32 - Windows file ownership (Windows only)

Install only what you need:

# For PDF support only
pip install PyMuPDF

# For DOCX support only
pip install python-docx

# For Windows file ownership
pip install pywin32

Usage

Basic Usage

Analyze all files in a directory:

python file_triage.py -i /path/to/input -o /path/to/output

Filter by File Extensions

Analyze only specific file types:

python file_triage.py -i /path/to/input -o /path/to/output -e .pdf .docx .txt

Windows Example

python file_triage.py -i "C:\Documents\Project" -o "C:\Reports" -e .pdf .docx .xlsx

Command-Line Arguments

-i, --input-folder (required): Input folder to scan recursively
-o, --output-folder (required): Output folder for generated reports
-e, --extensions (optional): File extensions to process (space-separated)

Output Files

The script generates two timestamped JSON files in the output folder:

1. Detailed Report (`detailed_report_YYYYMMDD_HHMMSS.json`)

Contains complete information for each file:

{
  "metadata": {
    "run_date": "2025-10-11T10:30:00",
    "input_folder": "C:\\Documents\\Project",
    "total_files": 150,
    "extensions_filter": [".pdf", ".docx"]
  },
  "files": [
    {
      "filename": "report.pdf",
      "folder_path": "C:\\Documents\\Project\\Reports",
      "file_extension": ".pdf",
      "size_mb": 2.5,
      "create_date": "2025-09-15T14:30:00",
      "modified_date": "2025-10-01T09:15:00",
      "author": "DOMAIN\\username",
      "page_count": 45,
      "word_count": 8500
    }
  ],
  "errors": []
}

2. Summary Report (`summary_report_YYYYMMDD_HHMMSS.json`)

Contains aggregated statistics:

{
  "metadata": {
    "run_date": "2025-10-11T10:30:00",
    "input_folder": "C:\\Documents\\Project",
    "extensions_filter": [".pdf", ".docx"]
  },
  "overall_summary": {
    "total_files": 150,
    "total_size_mb": 450.75,
    "total_pages": 3250,
    "total_words": 125000,
    "files_with_page_count": 120,
    "files_with_word_count": 145
  },
  "by_extension": {
    ".pdf": {
      "count": 80,
      "total_size_mb": 320.5
    },
    ".docx": {
      "count": 70,
      "total_size_mb": 130.25
    }
  },
  "by_folder": {
    "C:\\Documents\\Project\\Reports": {
      "file_count": 45,
      "total_size_mb": 125.5,
      "total_pages": 980,
      "total_words": 42000
    }
  }
}

Supported File Types

Page Count Extraction

PDF files (requires PyMuPDF)
DOCX files (requires python-docx)
TXT files (estimated based on line count)
ODT files (requires odfpy)

Word Count Extraction

TXT files
PDF files (requires PyMuPDF)
DOCX files (requires python-docx)
Markdown files (.md)
RTF files
ODT files (requires odfpy)

Author/Owner Information

Windows: Extracts file owner from Windows security descriptors (requires pywin32)
Unix/Linux: Extracts owner from file system metadata
Note: For document-embedded author information, additional libraries would be needed

Error Handling

The script includes comprehensive error handling:

Files that cannot be read are logged in the errors section
Processing continues even if individual files fail
Errors are reported in the detailed report
Progress updates are displayed during scanning

Performance Notes

Large directories may take time to process
PDF and DOCX processing is slower than plain text
Progress is displayed every 100 files
Consider using extension filters for faster processing

Examples

Example 1: Analyze all PDF and Word documents

python file_triage.py -i "C:\Documents" -o "C:\Reports" -e .pdf .docx .doc

Example 2: Analyze all files in a project directory

python file_triage.py -i "/home/user/projects/myapp" -o "/home/user/reports"

Example 3: Analyze only text-based files

python file_triage.py -i "./data" -o "./output" -e .txt .md .csv .log

Limitations

Page count for DOCX files is estimated based on character count
Author information is limited to file system owner (not document metadata)
Some file formats may not support page/word count extraction
Very large files may cause memory issues
Encrypted or password-protected files cannot be analyzed

Troubleshooting

Issue: "Module not found" error

Solution: Install missing dependencies with pip install -r requirements.txt

Issue: Page counts showing as null

Solution: Install the relevant library (PyMuPDF for PDFs, python-docx for DOCX files)

Issue: Author showing as "Unknown"

Solution: On Windows, install pywin32: pip install pywin32

Issue: Permission denied errors

Solution: Run the script with appropriate permissions or exclude protected directories

License

This script is provided as-is for file analysis and reporting purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
file_triage.py		file_triage.py
requirements.txt		requirements.txt
scan_directory.py		scan_directory.py

Folders and files

Latest commit

History

Repository files navigation

File Triage Scripts

Scripts Overview

1. scan_directory.py (Simple & Fast)

2. file_triage.py (Advanced & Comprehensive)

scan_directory.py

Features

Usage

Basic Syntax

Arguments

Examples

Output Files

Detailed Log Fields

Summary Log Contents

Requirements

file_triage.py

Features

Requirements & Installation

For scan_directory.py

For file_triage.py

Usage

Basic Usage

Filter by File Extensions

Windows Example

Command-Line Arguments

Output Files

1. Detailed Report (detailed_report_YYYYMMDD_HHMMSS.json)

2. Summary Report (summary_report_YYYYMMDD_HHMMSS.json)

Supported File Types

Page Count Extraction

Word Count Extraction

Author/Owner Information

Error Handling

Performance Notes

Examples

Example 1: Analyze all PDF and Word documents

Example 2: Analyze all files in a project directory

Example 3: Analyze only text-based files

Limitations

Troubleshooting

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Detailed Report (`detailed_report_YYYYMMDD_HHMMSS.json`)

2. Summary Report (`summary_report_YYYYMMDD_HHMMSS.json`)

Packages