Files

176 lines
5.8 KiB
Markdown
Raw Permalink Normal View History

2026-01-04 09:07:25 +08:00
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
This is a drug risk monitoring and data processing system for detecting controlled substances in text and image data from e-commerce platforms, darknet sources, and social media.
**Core Capabilities:**
2026-01-18 18:25:36 +08:00
1. **Image Recognition**: Batch image analysis using LLM APIs (OpenAI, Anthropic, DMX, Dify) for OCR and risk detection
2. **Keyword Matching**: Multi-mode keyword matching with CAS number extraction and exact matching
3. **LLM Verification**: Secondary verification of high-confidence unmatched records using LLM
4. **Data Collection**: Merge and consolidate results from batch processing
2026-01-04 09:07:25 +08:00
## Running Scripts
All scripts must be run from the `scripts/` directory:
```bash
cd scripts/
2026-01-18 18:25:36 +08:00
# Image batch recognition (mock mode for testing)
python3 image_batch_recognizer.py --mock --limit 5
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
# Image recognition with API
python3 image_batch_recognizer.py --api-type dify --limit 10
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
# Collect and merge xlsx files from batch output
python3 collect_xlsx.py
# Multi-mode keyword matching (default: cas + exact)
2026-01-04 09:07:25 +08:00
python3 keyword_matcher.py
# Single mode matching
2026-01-18 18:25:36 +08:00
python3 keyword_matcher.py -m cas # CAS number only
python3 keyword_matcher.py -m exact # Exact matching only
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
# Verify high-confidence unmatched records
python3 verify_high_confidence.py -o original.xlsx -m matched.xlsx --mock
2026-01-04 09:07:25 +08:00
```
## Dependencies
**Required:**
```bash
pip install pandas openpyxl
```
2026-01-18 18:25:36 +08:00
**Optional:**
2026-01-04 09:07:25 +08:00
```bash
2026-01-18 18:25:36 +08:00
pip install pyahocorasick # 5x faster exact matching
pip install requests # Required for Dify API
pip install tqdm # Progress bars
pip install openai # For OpenAI-compatible APIs in verify script
2026-01-04 09:07:25 +08:00
```
2026-01-18 18:25:36 +08:00
## Environment Configuration
Copy `.env.example` to `.env` and configure API keys:
2026-01-04 09:07:25 +08:00
```bash
2026-01-18 18:25:36 +08:00
# Default API type (openai | dmx | dify | ollama)
LLM_API_TYPE="dify"
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
# DMX API (OpenAI compatible)
DMX_API_KEY="your-key"
DMX_BASE_URL="https://www.dmxapi.cn"
DMX_MODEL="gpt-4o-mini"
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
# Dify API (used by image_batch_recognizer.py)
DIFY_API_KEY="app-xxx"
DIFY_BASE_URL="https://your-dify-server:4433"
DIFY_USER_ID="default-user"
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
# Separate config for verify_high_confidence.py (VERIFY_ prefix)
VERIFY_API_TYPE="dmx"
VERIFY_API_KEY="your-key"
VERIFY_BASE_URL="https://api.example.com"
VERIFY_MODEL="gpt-4o-mini"
2026-01-04 09:07:25 +08:00
```
2026-01-18 18:25:36 +08:00
## Data Flow Architecture
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
```
data/
├── input/ # Source data
│ ├── clickin_text_img.xlsx # Text + image paths
│ └── keywords.xlsx # Keyword database
├── images/ # Image files for recognition
├── batch_output/ # Per-folder recognition results
│ └── {name}/results.xlsx
├── data_all/ # Original data by source
│ └── {name}_text_img.xlsx
├── collected_xlsx/ # Merged results (collect_xlsx.py output)
└── output/ # Final processed results
2026-01-04 09:07:25 +08:00
```
**Processing Pipeline:**
```
2026-01-18 18:25:36 +08:00
1. image_batch_recognizer.py → batch_output/{name}/results.xlsx
2. collect_xlsx.py → Merge results.xlsx with {name}_text_img.xlsx → collected_xlsx/
3. keyword_matcher.py → Match keywords in text → output/keyword_matched_results.xlsx
4. verify_high_confidence.py → LLM verify unmatched high-confidence → *_llm_verified.xlsx
2026-01-04 09:07:25 +08:00
```
2026-01-18 18:25:36 +08:00
## Key Scripts
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
### keyword_matcher.py
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
Two detection modes with Strategy Pattern architecture:
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
1. **CAS Number Recognition (`-m cas`)**
- Regex pattern: `\b(\d{2,7})[\s\-._]?(\d{2})[\s\-._]?(\d)\b`
- Supports formats: `123-45-6`, `123 45 6`, `123456`, `123.45.6`
- Auto-normalizes to standard `XXX-XX-X` format
- Source column: `CAS号`
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
2. **Exact Matching (`-m exact`)**
- Uses Aho-Corasick automaton (if pyahocorasick installed) or regex with word boundaries
2026-01-04 09:07:25 +08:00
- Source columns: `中文名`, `英文名`, `CAS号`, `简称`, `可能名称`
2026-01-18 18:25:36 +08:00
**Multi-column text matching:**
- Automatically detects and combines `detected_text` and `文本` columns
- Use `-c col1 col2` to specify custom columns
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
**Class Hierarchy:**
2026-01-04 09:07:25 +08:00
```
2026-01-18 18:25:36 +08:00
KeywordMatcher (ABC)
├── CASRegexMatcher # Regex CAS extraction + normalization
├── RegexExactMatcher # Word-boundary exact matching
├── AhoCorasickMatcher # Fast multi-pattern matching
└── SetMatcher # Simple substring matching
2026-01-04 09:07:25 +08:00
```
2026-01-18 18:25:36 +08:00
### verify_high_confidence.py
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
Compares keyword_matcher output with original data to find high-confidence rows that weren't matched, then uses LLM for secondary verification.
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
- Uses `VERIFY_` prefixed env vars (separate from image_batch_recognizer.py)
- Supports: OpenAI, DMX, Dify, Ollama, Mock modes
- Input columns: `raw_response`, `文本`
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
### collect_xlsx.py
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
Merges batch recognition results with original data:
- Matches by image filename (handles both Windows `\` and Unix `/` paths)
- Adds original columns (`文本`, metadata) to recognition results
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
### image_batch_recognizer.py
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
Batch image recognition with multiple API backends:
- Supports: OpenAI, Anthropic, DMX, Dify, Mock
- Outputs: `detected_text`, `detected_objects`, `sensitive_items`, `summary`, `confidence`
- Parallel processing with `--max-workers`
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
## Excel File Schemas
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
**keywords.xlsx columns:**
- `中文名`, `英文名`, `CAS号`, `简称`, `备注`, `可能名称`
- `可能名称` uses `|||` separator for multiple values
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
**Recognition output columns:**
- `image_name`, `image_path`, `detected_text`, `detected_objects`
- `sensitive_items`, `summary`, `confidence`, `raw_response`
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
**Matched output adds:**
- `匹配到的关键词` (matched keywords, ` | ` separated)
- `匹配模式` (e.g., "CAS号识别 + 精确匹配")
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
## Key Conventions
2026-01-04 09:07:25 +08:00
2026-01-18 18:25:36 +08:00
- Triple pipe `|||` separator in keyword cells (avoids conflicts with chemical names)
- Match result separator: ` | `
- All scripts use relative paths from `scripts/` directory
- Configuration priority: command-line args > VERIFY_* env > general env > defaults