5.8 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
This is a drug risk monitoring and data processing system for detecting controlled substances in text and image data from e-commerce platforms, darknet sources, and social media.
Core Capabilities:
- Image Recognition: Batch image analysis using LLM APIs (OpenAI, Anthropic, DMX, Dify) for OCR and risk detection
- Keyword Matching: Multi-mode keyword matching with CAS number extraction and exact matching
- LLM Verification: Secondary verification of high-confidence unmatched records using LLM
- Data Collection: Merge and consolidate results from batch processing
Running Scripts
All scripts must be run from the scripts/ directory:
cd scripts/
# Image batch recognition (mock mode for testing)
python3 image_batch_recognizer.py --mock --limit 5
# Image recognition with API
python3 image_batch_recognizer.py --api-type dify --limit 10
# Collect and merge xlsx files from batch output
python3 collect_xlsx.py
# Multi-mode keyword matching (default: cas + exact)
python3 keyword_matcher.py
# Single mode matching
python3 keyword_matcher.py -m cas # CAS number only
python3 keyword_matcher.py -m exact # Exact matching only
# Verify high-confidence unmatched records
python3 verify_high_confidence.py -o original.xlsx -m matched.xlsx --mock
Dependencies
Required:
pip install pandas openpyxl
Optional:
pip install pyahocorasick # 5x faster exact matching
pip install requests # Required for Dify API
pip install tqdm # Progress bars
pip install openai # For OpenAI-compatible APIs in verify script
Environment Configuration
Copy .env.example to .env and configure API keys:
# Default API type (openai | dmx | dify | ollama)
LLM_API_TYPE="dify"
# DMX API (OpenAI compatible)
DMX_API_KEY="your-key"
DMX_BASE_URL="https://www.dmxapi.cn"
DMX_MODEL="gpt-4o-mini"
# Dify API (used by image_batch_recognizer.py)
DIFY_API_KEY="app-xxx"
DIFY_BASE_URL="https://your-dify-server:4433"
DIFY_USER_ID="default-user"
# Separate config for verify_high_confidence.py (VERIFY_ prefix)
VERIFY_API_TYPE="dmx"
VERIFY_API_KEY="your-key"
VERIFY_BASE_URL="https://api.example.com"
VERIFY_MODEL="gpt-4o-mini"
Data Flow Architecture
data/
├── input/ # Source data
│ ├── clickin_text_img.xlsx # Text + image paths
│ └── keywords.xlsx # Keyword database
├── images/ # Image files for recognition
├── batch_output/ # Per-folder recognition results
│ └── {name}/results.xlsx
├── data_all/ # Original data by source
│ └── {name}_text_img.xlsx
├── collected_xlsx/ # Merged results (collect_xlsx.py output)
└── output/ # Final processed results
Processing Pipeline:
1. image_batch_recognizer.py → batch_output/{name}/results.xlsx
2. collect_xlsx.py → Merge results.xlsx with {name}_text_img.xlsx → collected_xlsx/
3. keyword_matcher.py → Match keywords in text → output/keyword_matched_results.xlsx
4. verify_high_confidence.py → LLM verify unmatched high-confidence → *_llm_verified.xlsx
Key Scripts
keyword_matcher.py
Two detection modes with Strategy Pattern architecture:
-
CAS Number Recognition (
-m cas)- Regex pattern:
\b(\d{2,7})[\s\-._]?(\d{2})[\s\-._]?(\d)\b - Supports formats:
123-45-6,123 45 6,123456,123.45.6 - Auto-normalizes to standard
XXX-XX-Xformat - Source column:
CAS号
- Regex pattern:
-
Exact Matching (
-m exact)- Uses Aho-Corasick automaton (if pyahocorasick installed) or regex with word boundaries
- Source columns:
中文名,英文名,CAS号,简称,可能名称
Multi-column text matching:
- Automatically detects and combines
detected_textand文本columns - Use
-c col1 col2to specify custom columns
Class Hierarchy:
KeywordMatcher (ABC)
├── CASRegexMatcher # Regex CAS extraction + normalization
├── RegexExactMatcher # Word-boundary exact matching
├── AhoCorasickMatcher # Fast multi-pattern matching
└── SetMatcher # Simple substring matching
verify_high_confidence.py
Compares keyword_matcher output with original data to find high-confidence rows that weren't matched, then uses LLM for secondary verification.
- Uses
VERIFY_prefixed env vars (separate from image_batch_recognizer.py) - Supports: OpenAI, DMX, Dify, Ollama, Mock modes
- Input columns:
raw_response,文本
collect_xlsx.py
Merges batch recognition results with original data:
- Matches by image filename (handles both Windows
\and Unix/paths) - Adds original columns (
文本, metadata) to recognition results
image_batch_recognizer.py
Batch image recognition with multiple API backends:
- Supports: OpenAI, Anthropic, DMX, Dify, Mock
- Outputs:
detected_text,detected_objects,sensitive_items,summary,confidence - Parallel processing with
--max-workers
Excel File Schemas
keywords.xlsx columns:
中文名,英文名,CAS号,简称,备注,可能名称可能名称uses|||separator for multiple values
Recognition output columns:
image_name,image_path,detected_text,detected_objectssensitive_items,summary,confidence,raw_response
Matched output adds:
匹配到的关键词(matched keywords,|separated)匹配模式(e.g., "CAS号识别 + 精确匹配")
Key Conventions
- Triple pipe
|||separator in keyword cells (avoids conflicts with chemical names) - Match result separator:
| - All scripts use relative paths from
scripts/directory - Configuration priority: command-line args > VERIFY_* env > general env > defaults