Skip to content

artnoricojr/scan-directory

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

File Triage Scripts

Two Python scripts for recursively analyzing directories and generating comprehensive file metadata reports.

Scripts Overview

1. scan_directory.py (Simple & Fast)

Best for: Quick scans with both JSON and CSV output

  • Lightweight and easy to use
  • Outputs both JSON and CSV formats
  • PDF-focused metadata extraction
  • Date-time stamped output files
  • Simple command-line interface

2. file_triage.py (Advanced & Comprehensive)

Best for: In-depth analysis with detailed statistics

  • Supports more file types (PDF, DOCX, ODT, TXT, MD, RTF)
  • Advanced author/owner extraction
  • Folder-level statistics
  • Comprehensive error logging
  • Modified date tracking

scan_directory.py

Features

  • Recursively scans directories for specified file extensions
  • Extracts file metadata:
    • Filename and folder path
    • File extension and size (MB)
    • Creation date
    • Page count (PDF only)
    • Word count (PDF only)
    • Author (PDF only)
  • Generates 4 output files per run:
    • Detailed log (JSON and CSV)
    • Summary log (JSON and CSV)
  • All files include date-time stamps
  • Filter by file extension

Usage

Basic Syntax

python scan_directory.py <input_folder> <output_folder> <extensions>

Arguments

  • input_folder - Path to the directory to scan
  • output_folder - Path where output files will be saved
  • extensions - Space-separated list of file extensions (e.g., pdf docx txt)

Examples

# Scan for PDF files only
python scan_directory.py ./documents ./output_logs pdf

# Scan for multiple file types
python scan_directory.py ./documents ./output_logs pdf docx txt

# Windows example
python scan_directory.py "C:\Documents\Project Files" "C:\Reports" pdf docx

Output Files

All files include matching date-time stamps (YYYYMMDD_HHMMSS):

  1. detailed_log_YYYYMMDD_HHMMSS.json - JSON array with all file details
  2. detailed_log_YYYYMMDD_HHMMSS.csv - CSV table with file details
  3. summary_log_YYYYMMDD_HHMMSS.json - Aggregated statistics
  4. summary_log_YYYYMMDD_HHMMSS.csv - Flattened summary with sections

Detailed Log Fields

  • filename
  • folder_path
  • file_extension
  • size_in_MBs
  • create_date
  • page_count
  • word_count
  • author

Summary Log Contents

  • Log run date
  • Input folder path
  • Total files, size, pages, and words
  • File counts by extension
  • File sizes by extension

Requirements

pip install PyMuPDF

file_triage.py

Features

  • Recursively scans directories and subdirectories
  • Extracts detailed metadata for each file:
    • Filename and path
    • File extension
    • Size in MB
    • Page count (for PDF, DOCX, TXT, ODT files)
    • Word count (for text-based files)
    • Creation and modified dates
    • Author/Owner information
  • Generates two timestamped JSON reports:
    1. Detailed Report: Complete list of all files with full metadata
    2. Summary Report: Aggregated statistics by extension and folder
  • Filter files by extension
  • Comprehensive error handling and logging

Requirements & Installation

For scan_directory.py

pip install PyMuPDF

For file_triage.py

pip install -r requirements.txt

Optional dependencies for file_triage.py:

  • PyMuPDF - PDF page and word counts (fast and efficient)
  • python-docx - DOCX page and word counts
  • odfpy - ODT file support
  • pywin32 - Windows file ownership (Windows only)

Install only what you need:

# For PDF support only
pip install PyMuPDF

# For DOCX support only
pip install python-docx

# For Windows file ownership
pip install pywin32

Usage

Basic Usage

Analyze all files in a directory:

python file_triage.py -i /path/to/input -o /path/to/output

Filter by File Extensions

Analyze only specific file types:

python file_triage.py -i /path/to/input -o /path/to/output -e .pdf .docx .txt

Windows Example

python file_triage.py -i "C:\Documents\Project" -o "C:\Reports" -e .pdf .docx .xlsx

Command-Line Arguments

  • -i, --input-folder (required): Input folder to scan recursively
  • -o, --output-folder (required): Output folder for generated reports
  • -e, --extensions (optional): File extensions to process (space-separated)

Output Files

The script generates two timestamped JSON files in the output folder:

1. Detailed Report (detailed_report_YYYYMMDD_HHMMSS.json)

Contains complete information for each file:

{
  "metadata": {
    "run_date": "2025-10-11T10:30:00",
    "input_folder": "C:\\Documents\\Project",
    "total_files": 150,
    "extensions_filter": [".pdf", ".docx"]
  },
  "files": [
    {
      "filename": "report.pdf",
      "folder_path": "C:\\Documents\\Project\\Reports",
      "file_extension": ".pdf",
      "size_mb": 2.5,
      "create_date": "2025-09-15T14:30:00",
      "modified_date": "2025-10-01T09:15:00",
      "author": "DOMAIN\\username",
      "page_count": 45,
      "word_count": 8500
    }
  ],
  "errors": []
}

2. Summary Report (summary_report_YYYYMMDD_HHMMSS.json)

Contains aggregated statistics:

{
  "metadata": {
    "run_date": "2025-10-11T10:30:00",
    "input_folder": "C:\\Documents\\Project",
    "extensions_filter": [".pdf", ".docx"]
  },
  "overall_summary": {
    "total_files": 150,
    "total_size_mb": 450.75,
    "total_pages": 3250,
    "total_words": 125000,
    "files_with_page_count": 120,
    "files_with_word_count": 145
  },
  "by_extension": {
    ".pdf": {
      "count": 80,
      "total_size_mb": 320.5
    },
    ".docx": {
      "count": 70,
      "total_size_mb": 130.25
    }
  },
  "by_folder": {
    "C:\\Documents\\Project\\Reports": {
      "file_count": 45,
      "total_size_mb": 125.5,
      "total_pages": 980,
      "total_words": 42000
    }
  }
}

Supported File Types

Page Count Extraction

  • PDF files (requires PyMuPDF)
  • DOCX files (requires python-docx)
  • TXT files (estimated based on line count)
  • ODT files (requires odfpy)

Word Count Extraction

  • TXT files
  • PDF files (requires PyMuPDF)
  • DOCX files (requires python-docx)
  • Markdown files (.md)
  • RTF files
  • ODT files (requires odfpy)

Author/Owner Information

  • Windows: Extracts file owner from Windows security descriptors (requires pywin32)
  • Unix/Linux: Extracts owner from file system metadata
  • Note: For document-embedded author information, additional libraries would be needed

Error Handling

The script includes comprehensive error handling:

  • Files that cannot be read are logged in the errors section
  • Processing continues even if individual files fail
  • Errors are reported in the detailed report
  • Progress updates are displayed during scanning

Performance Notes

  • Large directories may take time to process
  • PDF and DOCX processing is slower than plain text
  • Progress is displayed every 100 files
  • Consider using extension filters for faster processing

Examples

Example 1: Analyze all PDF and Word documents

python file_triage.py -i "C:\Documents" -o "C:\Reports" -e .pdf .docx .doc

Example 2: Analyze all files in a project directory

python file_triage.py -i "/home/user/projects/myapp" -o "/home/user/reports"

Example 3: Analyze only text-based files

python file_triage.py -i "./data" -o "./output" -e .txt .md .csv .log

Limitations

  • Page count for DOCX files is estimated based on character count
  • Author information is limited to file system owner (not document metadata)
  • Some file formats may not support page/word count extraction
  • Very large files may cause memory issues
  • Encrypted or password-protected files cannot be analyzed

Troubleshooting

Issue: "Module not found" error

  • Solution: Install missing dependencies with pip install -r requirements.txt

Issue: Page counts showing as null

  • Solution: Install the relevant library (PyMuPDF for PDFs, python-docx for DOCX files)

Issue: Author showing as "Unknown"

  • Solution: On Windows, install pywin32: pip install pywin32

Issue: Permission denied errors

  • Solution: Run the script with appropriate permissions or exclude protected directories

License

This script is provided as-is for file analysis and reporting purposes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages