176 lines
5.8 KiB
Markdown
176 lines
5.8 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
## Project Overview
|
|
|
|
This is a drug risk monitoring and data processing system for detecting controlled substances in text and image data from e-commerce platforms, darknet sources, and social media.
|
|
|
|
**Core Capabilities:**
|
|
1. **Image Recognition**: Batch image analysis using LLM APIs (OpenAI, Anthropic, DMX, Dify) for OCR and risk detection
|
|
2. **Keyword Matching**: Multi-mode keyword matching with CAS number extraction and exact matching
|
|
3. **LLM Verification**: Secondary verification of high-confidence unmatched records using LLM
|
|
4. **Data Collection**: Merge and consolidate results from batch processing
|
|
|
|
## Running Scripts
|
|
|
|
All scripts must be run from the `scripts/` directory:
|
|
|
|
```bash
|
|
cd scripts/
|
|
|
|
# Image batch recognition (mock mode for testing)
|
|
python3 image_batch_recognizer.py --mock --limit 5
|
|
|
|
# Image recognition with API
|
|
python3 image_batch_recognizer.py --api-type dify --limit 10
|
|
|
|
# Collect and merge xlsx files from batch output
|
|
python3 collect_xlsx.py
|
|
|
|
# Multi-mode keyword matching (default: cas + exact)
|
|
python3 keyword_matcher.py
|
|
|
|
# Single mode matching
|
|
python3 keyword_matcher.py -m cas # CAS number only
|
|
python3 keyword_matcher.py -m exact # Exact matching only
|
|
|
|
# Verify high-confidence unmatched records
|
|
python3 verify_high_confidence.py -o original.xlsx -m matched.xlsx --mock
|
|
```
|
|
|
|
## Dependencies
|
|
|
|
**Required:**
|
|
```bash
|
|
pip install pandas openpyxl
|
|
```
|
|
|
|
**Optional:**
|
|
```bash
|
|
pip install pyahocorasick # 5x faster exact matching
|
|
pip install requests # Required for Dify API
|
|
pip install tqdm # Progress bars
|
|
pip install openai # For OpenAI-compatible APIs in verify script
|
|
```
|
|
|
|
## Environment Configuration
|
|
|
|
Copy `.env.example` to `.env` and configure API keys:
|
|
|
|
```bash
|
|
# Default API type (openai | dmx | dify | ollama)
|
|
LLM_API_TYPE="dify"
|
|
|
|
# DMX API (OpenAI compatible)
|
|
DMX_API_KEY="your-key"
|
|
DMX_BASE_URL="https://www.dmxapi.cn"
|
|
DMX_MODEL="gpt-4o-mini"
|
|
|
|
# Dify API (used by image_batch_recognizer.py)
|
|
DIFY_API_KEY="app-xxx"
|
|
DIFY_BASE_URL="https://your-dify-server:4433"
|
|
DIFY_USER_ID="default-user"
|
|
|
|
# Separate config for verify_high_confidence.py (VERIFY_ prefix)
|
|
VERIFY_API_TYPE="dmx"
|
|
VERIFY_API_KEY="your-key"
|
|
VERIFY_BASE_URL="https://api.example.com"
|
|
VERIFY_MODEL="gpt-4o-mini"
|
|
```
|
|
|
|
## Data Flow Architecture
|
|
|
|
```
|
|
data/
|
|
├── input/ # Source data
|
|
│ ├── clickin_text_img.xlsx # Text + image paths
|
|
│ └── keywords.xlsx # Keyword database
|
|
├── images/ # Image files for recognition
|
|
├── batch_output/ # Per-folder recognition results
|
|
│ └── {name}/results.xlsx
|
|
├── data_all/ # Original data by source
|
|
│ └── {name}_text_img.xlsx
|
|
├── collected_xlsx/ # Merged results (collect_xlsx.py output)
|
|
└── output/ # Final processed results
|
|
```
|
|
|
|
**Processing Pipeline:**
|
|
```
|
|
1. image_batch_recognizer.py → batch_output/{name}/results.xlsx
|
|
2. collect_xlsx.py → Merge results.xlsx with {name}_text_img.xlsx → collected_xlsx/
|
|
3. keyword_matcher.py → Match keywords in text → output/keyword_matched_results.xlsx
|
|
4. verify_high_confidence.py → LLM verify unmatched high-confidence → *_llm_verified.xlsx
|
|
```
|
|
|
|
## Key Scripts
|
|
|
|
### keyword_matcher.py
|
|
|
|
Two detection modes with Strategy Pattern architecture:
|
|
|
|
1. **CAS Number Recognition (`-m cas`)**
|
|
- Regex pattern: `\b(\d{2,7})[\s\-._]?(\d{2})[\s\-._]?(\d)\b`
|
|
- Supports formats: `123-45-6`, `123 45 6`, `123456`, `123.45.6`
|
|
- Auto-normalizes to standard `XXX-XX-X` format
|
|
- Source column: `CAS号`
|
|
|
|
2. **Exact Matching (`-m exact`)**
|
|
- Uses Aho-Corasick automaton (if pyahocorasick installed) or regex with word boundaries
|
|
- Source columns: `中文名`, `英文名`, `CAS号`, `简称`, `可能名称`
|
|
|
|
**Multi-column text matching:**
|
|
- Automatically detects and combines `detected_text` and `文本` columns
|
|
- Use `-c col1 col2` to specify custom columns
|
|
|
|
**Class Hierarchy:**
|
|
```
|
|
KeywordMatcher (ABC)
|
|
├── CASRegexMatcher # Regex CAS extraction + normalization
|
|
├── RegexExactMatcher # Word-boundary exact matching
|
|
├── AhoCorasickMatcher # Fast multi-pattern matching
|
|
└── SetMatcher # Simple substring matching
|
|
```
|
|
|
|
### verify_high_confidence.py
|
|
|
|
Compares keyword_matcher output with original data to find high-confidence rows that weren't matched, then uses LLM for secondary verification.
|
|
|
|
- Uses `VERIFY_` prefixed env vars (separate from image_batch_recognizer.py)
|
|
- Supports: OpenAI, DMX, Dify, Ollama, Mock modes
|
|
- Input columns: `raw_response`, `文本`
|
|
|
|
### collect_xlsx.py
|
|
|
|
Merges batch recognition results with original data:
|
|
- Matches by image filename (handles both Windows `\` and Unix `/` paths)
|
|
- Adds original columns (`文本`, metadata) to recognition results
|
|
|
|
### image_batch_recognizer.py
|
|
|
|
Batch image recognition with multiple API backends:
|
|
- Supports: OpenAI, Anthropic, DMX, Dify, Mock
|
|
- Outputs: `detected_text`, `detected_objects`, `sensitive_items`, `summary`, `confidence`
|
|
- Parallel processing with `--max-workers`
|
|
|
|
## Excel File Schemas
|
|
|
|
**keywords.xlsx columns:**
|
|
- `中文名`, `英文名`, `CAS号`, `简称`, `备注`, `可能名称`
|
|
- `可能名称` uses `|||` separator for multiple values
|
|
|
|
**Recognition output columns:**
|
|
- `image_name`, `image_path`, `detected_text`, `detected_objects`
|
|
- `sensitive_items`, `summary`, `confidence`, `raw_response`
|
|
|
|
**Matched output adds:**
|
|
- `匹配到的关键词` (matched keywords, ` | ` separated)
|
|
- `匹配模式` (e.g., "CAS号识别 + 精确匹配")
|
|
|
|
## Key Conventions
|
|
|
|
- Triple pipe `|||` separator in keyword cells (avoids conflicts with chemical names)
|
|
- Match result separator: ` | `
|
|
- All scripts use relative paths from `scripts/` directory
|
|
- Configuration priority: command-line args > VERIFY_* env > general env > defaults
|