# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview This is a drug risk monitoring and data processing system for detecting controlled substances in text and image data from e-commerce platforms, darknet sources, and social media. **Core Capabilities:** 1. **Image Recognition**: Batch image analysis using LLM APIs (OpenAI, Anthropic, DMX, Dify) for OCR and risk detection 2. **Keyword Matching**: Multi-mode keyword matching with CAS number extraction and exact matching 3. **LLM Verification**: Secondary verification of high-confidence unmatched records using LLM 4. **Data Collection**: Merge and consolidate results from batch processing ## Running Scripts All scripts must be run from the `scripts/` directory: ```bash cd scripts/ # Image batch recognition (mock mode for testing) python3 image_batch_recognizer.py --mock --limit 5 # Image recognition with API python3 image_batch_recognizer.py --api-type dify --limit 10 # Collect and merge xlsx files from batch output python3 collect_xlsx.py # Multi-mode keyword matching (default: cas + exact) python3 keyword_matcher.py # Single mode matching python3 keyword_matcher.py -m cas # CAS number only python3 keyword_matcher.py -m exact # Exact matching only # Verify high-confidence unmatched records python3 verify_high_confidence.py -o original.xlsx -m matched.xlsx --mock ``` ## Dependencies **Required:** ```bash pip install pandas openpyxl ``` **Optional:** ```bash pip install pyahocorasick # 5x faster exact matching pip install requests # Required for Dify API pip install tqdm # Progress bars pip install openai # For OpenAI-compatible APIs in verify script ``` ## Environment Configuration Copy `.env.example` to `.env` and configure API keys: ```bash # Default API type (openai | dmx | dify | ollama) LLM_API_TYPE="dify" # DMX API (OpenAI compatible) DMX_API_KEY="your-key" DMX_BASE_URL="https://www.dmxapi.cn" DMX_MODEL="gpt-4o-mini" # Dify API (used by image_batch_recognizer.py) DIFY_API_KEY="app-xxx" DIFY_BASE_URL="https://your-dify-server:4433" DIFY_USER_ID="default-user" # Separate config for verify_high_confidence.py (VERIFY_ prefix) VERIFY_API_TYPE="dmx" VERIFY_API_KEY="your-key" VERIFY_BASE_URL="https://api.example.com" VERIFY_MODEL="gpt-4o-mini" ``` ## Data Flow Architecture ``` data/ ├── input/ # Source data │ ├── clickin_text_img.xlsx # Text + image paths │ └── keywords.xlsx # Keyword database ├── images/ # Image files for recognition ├── batch_output/ # Per-folder recognition results │ └── {name}/results.xlsx ├── data_all/ # Original data by source │ └── {name}_text_img.xlsx ├── collected_xlsx/ # Merged results (collect_xlsx.py output) └── output/ # Final processed results ``` **Processing Pipeline:** ``` 1. image_batch_recognizer.py → batch_output/{name}/results.xlsx 2. collect_xlsx.py → Merge results.xlsx with {name}_text_img.xlsx → collected_xlsx/ 3. keyword_matcher.py → Match keywords in text → output/keyword_matched_results.xlsx 4. verify_high_confidence.py → LLM verify unmatched high-confidence → *_llm_verified.xlsx ``` ## Key Scripts ### keyword_matcher.py Two detection modes with Strategy Pattern architecture: 1. **CAS Number Recognition (`-m cas`)** - Regex pattern: `\b(\d{2,7})[\s\-._]?(\d{2})[\s\-._]?(\d)\b` - Supports formats: `123-45-6`, `123 45 6`, `123456`, `123.45.6` - Auto-normalizes to standard `XXX-XX-X` format - Source column: `CAS号` 2. **Exact Matching (`-m exact`)** - Uses Aho-Corasick automaton (if pyahocorasick installed) or regex with word boundaries - Source columns: `中文名`, `英文名`, `CAS号`, `简称`, `可能名称` **Multi-column text matching:** - Automatically detects and combines `detected_text` and `文本` columns - Use `-c col1 col2` to specify custom columns **Class Hierarchy:** ``` KeywordMatcher (ABC) ├── CASRegexMatcher # Regex CAS extraction + normalization ├── RegexExactMatcher # Word-boundary exact matching ├── AhoCorasickMatcher # Fast multi-pattern matching └── SetMatcher # Simple substring matching ``` ### verify_high_confidence.py Compares keyword_matcher output with original data to find high-confidence rows that weren't matched, then uses LLM for secondary verification. - Uses `VERIFY_` prefixed env vars (separate from image_batch_recognizer.py) - Supports: OpenAI, DMX, Dify, Ollama, Mock modes - Input columns: `raw_response`, `文本` ### collect_xlsx.py Merges batch recognition results with original data: - Matches by image filename (handles both Windows `\` and Unix `/` paths) - Adds original columns (`文本`, metadata) to recognition results ### image_batch_recognizer.py Batch image recognition with multiple API backends: - Supports: OpenAI, Anthropic, DMX, Dify, Mock - Outputs: `detected_text`, `detected_objects`, `sensitive_items`, `summary`, `confidence` - Parallel processing with `--max-workers` ## Excel File Schemas **keywords.xlsx columns:** - `中文名`, `英文名`, `CAS号`, `简称`, `备注`, `可能名称` - `可能名称` uses `|||` separator for multiple values **Recognition output columns:** - `image_name`, `image_path`, `detected_text`, `detected_objects` - `sensitive_items`, `summary`, `confidence`, `raw_response` **Matched output adds:** - `匹配到的关键词` (matched keywords, ` | ` separated) - `匹配模式` (e.g., "CAS号识别 + 精确匹配") ## Key Conventions - Triple pipe `|||` separator in keyword cells (avoids conflicts with chemical names) - Match result separator: ` | ` - All scripts use relative paths from `scripts/` directory - Configuration priority: command-line args > VERIFY_* env > general env > defaults