chem-risk-detect/CLAUDE.md

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

This is a drug risk monitoring and data processing system for detecting controlled substances in text and image data from e-commerce platforms, darknet sources, and social media.

**Core Capabilities:**
1. **Image Recognition**: Batch image analysis using LLM APIs (OpenAI, Anthropic, DMX, Dify) for OCR and risk detection
2. **Keyword Matching**: Multi-mode keyword matching with CAS number extraction and exact matching
3. **LLM Verification**: Secondary verification of high-confidence unmatched records using LLM
4. **Data Collection**: Merge and consolidate results from batch processing

## Running Scripts

All scripts must be run from the `scripts/` directory:

```bash
cd scripts/

# Image batch recognition (mock mode for testing)
python3 image_batch_recognizer.py --mock --limit 5

# Image recognition with API
python3 image_batch_recognizer.py --api-type dify --limit 10

# Collect and merge xlsx files from batch output
python3 collect_xlsx.py

# Multi-mode keyword matching (default: cas + exact)
python3 keyword_matcher.py

# Single mode matching
python3 keyword_matcher.py -m cas      # CAS number only
python3 keyword_matcher.py -m exact    # Exact matching only

# Verify high-confidence unmatched records
python3 verify_high_confidence.py -o original.xlsx -m matched.xlsx --mock
```

## Dependencies

**Required:**
```bash
pip install pandas openpyxl
```

**Optional:**
```bash
pip install pyahocorasick  # 5x faster exact matching
pip install requests       # Required for Dify API
pip install tqdm           # Progress bars
pip install openai         # For OpenAI-compatible APIs in verify script
```

## Environment Configuration

Copy `.env.example` to `.env` and configure API keys:

```bash
# Default API type (openai | dmx | dify | ollama)
LLM_API_TYPE="dify"

# DMX API (OpenAI compatible)
DMX_API_KEY="your-key"
DMX_BASE_URL="https://www.dmxapi.cn"
DMX_MODEL="gpt-4o-mini"

# Dify API (used by image_batch_recognizer.py)
DIFY_API_KEY="app-xxx"
DIFY_BASE_URL="https://your-dify-server:4433"
DIFY_USER_ID="default-user"

# Separate config for verify_high_confidence.py (VERIFY_ prefix)
VERIFY_API_TYPE="dmx"
VERIFY_API_KEY="your-key"
VERIFY_BASE_URL="https://api.example.com"
VERIFY_MODEL="gpt-4o-mini"
```

## Data Flow Architecture

```
data/
├── input/                      # Source data
│   ├── clickin_text_img.xlsx   # Text + image paths
│   └── keywords.xlsx           # Keyword database
├── images/                     # Image files for recognition
├── batch_output/               # Per-folder recognition results
│   └── {name}/results.xlsx
├── data_all/                   # Original data by source
│   └── {name}_text_img.xlsx
├── collected_xlsx/             # Merged results (collect_xlsx.py output)
└── output/                     # Final processed results
```

**Processing Pipeline:**
```
1. image_batch_recognizer.py → batch_output/{name}/results.xlsx
2. collect_xlsx.py → Merge results.xlsx with {name}_text_img.xlsx → collected_xlsx/
3. keyword_matcher.py → Match keywords in text → output/keyword_matched_results.xlsx
4. verify_high_confidence.py → LLM verify unmatched high-confidence → *_llm_verified.xlsx
```

## Key Scripts

### keyword_matcher.py

Two detection modes with Strategy Pattern architecture:

1. **CAS Number Recognition (`-m cas`)**
   - Regex pattern: `\b(\d{2,7})[\s\-._]?(\d{2})[\s\-._]?(\d)\b`
   - Supports formats: `123-45-6`, `123 45 6`, `123456`, `123.45.6`
   - Auto-normalizes to standard `XXX-XX-X` format
   - Source column: `CAS号`

2. **Exact Matching (`-m exact`)**
   - Uses Aho-Corasick automaton (if pyahocorasick installed) or regex with word boundaries
   - Source columns: `中文名`, `英文名`, `CAS号`, `简称`, `可能名称`

**Multi-column text matching:**
- Automatically detects and combines `detected_text` and `文本` columns
- Use `-c col1 col2` to specify custom columns

**Class Hierarchy:**
```
KeywordMatcher (ABC)
├── CASRegexMatcher      # Regex CAS extraction + normalization
├── RegexExactMatcher    # Word-boundary exact matching
├── AhoCorasickMatcher   # Fast multi-pattern matching
└── SetMatcher           # Simple substring matching
```

### verify_high_confidence.py

Compares keyword_matcher output with original data to find high-confidence rows that weren't matched, then uses LLM for secondary verification.

- Uses `VERIFY_` prefixed env vars (separate from image_batch_recognizer.py)
- Supports: OpenAI, DMX, Dify, Ollama, Mock modes
- Input columns: `raw_response`, `文本`

### collect_xlsx.py

Merges batch recognition results with original data:
- Matches by image filename (handles both Windows `\` and Unix `/` paths)
- Adds original columns (`文本`, metadata) to recognition results

### image_batch_recognizer.py

Batch image recognition with multiple API backends:
- Supports: OpenAI, Anthropic, DMX, Dify, Mock
- Outputs: `detected_text`, `detected_objects`, `sensitive_items`, `summary`, `confidence`
- Parallel processing with `--max-workers`

## Excel File Schemas

**keywords.xlsx columns:**
- `中文名`, `英文名`, `CAS号`, `简称`, `备注`, `可能名称`
- `可能名称` uses `|||` separator for multiple values

**Recognition output columns:**
- `image_name`, `image_path`, `detected_text`, `detected_objects`
- `sensitive_items`, `summary`, `confidence`, `raw_response`

**Matched output adds:**
- `匹配到的关键词` (matched keywords, ` | ` separated)
- `匹配模式` (e.g., "CAS号识别 + 精确匹配")

## Key Conventions

- Triple pipe `|||` separator in keyword cells (avoids conflicts with chemical names)
- Match result separator: ` | `
- All scripts use relative paths from `scripts/` directory
- Configuration priority: command-line args > VERIFY_* env > general env > defaults