fix: update keywords_match
This commit is contained in:
320
CLAUDE.md
320
CLAUDE.md
@@ -7,9 +7,10 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
|
||||
This is a drug risk monitoring and data processing system for detecting controlled substances in text and image data from e-commerce platforms, darknet sources, and social media.
|
||||
|
||||
**Core Capabilities:**
|
||||
1. **CAS Number Matching**: Extract and match chemical CAS numbers from text using regex patterns (supports multiple formats)
|
||||
2. **Keyword Matching**: High-performance multi-mode keyword matching (fuzzy, CAS)
|
||||
3. **Keyword Expansion**: LLM-powered expansion of chemical/drug names to include variants, abbreviations, and aliases
|
||||
1. **Image Recognition**: Batch image analysis using LLM APIs (OpenAI, Anthropic, DMX, Dify) for OCR and risk detection
|
||||
2. **Keyword Matching**: Multi-mode keyword matching with CAS number extraction and exact matching
|
||||
3. **LLM Verification**: Secondary verification of high-confidence unmatched records using LLM
|
||||
4. **Data Collection**: Merge and consolidate results from batch processing
|
||||
|
||||
## Running Scripts
|
||||
|
||||
@@ -18,28 +19,24 @@ All scripts must be run from the `scripts/` directory:
|
||||
```bash
|
||||
cd scripts/
|
||||
|
||||
# Quick start (recommended for testing)
|
||||
python3 quick_start.py
|
||||
# Image batch recognition (mock mode for testing)
|
||||
python3 image_batch_recognizer.py --mock --limit 5
|
||||
|
||||
# CAS number matching
|
||||
python3 match_cas_numbers.py
|
||||
# Image recognition with API
|
||||
python3 image_batch_recognizer.py --api-type dify --limit 10
|
||||
|
||||
# Multi-mode keyword matching (default: both modes)
|
||||
# Collect and merge xlsx files from batch output
|
||||
python3 collect_xlsx.py
|
||||
|
||||
# Multi-mode keyword matching (default: cas + exact)
|
||||
python3 keyword_matcher.py
|
||||
|
||||
# Single mode matching
|
||||
python3 keyword_matcher.py -m cas # CAS number only
|
||||
python3 keyword_matcher.py -m fuzzy --threshold 90 # Fuzzy matching only
|
||||
python3 keyword_matcher.py -m cas # CAS number only
|
||||
python3 keyword_matcher.py -m exact # Exact matching only
|
||||
|
||||
# Use larger keyword database
|
||||
python3 keyword_matcher.py -k ../data/input/keyword_all.xlsx
|
||||
|
||||
# Keyword expansion (mock mode, no API)
|
||||
python3 expand_keywords_with_llm.py -m
|
||||
|
||||
# Keyword expansion (with OpenAI API)
|
||||
export OPENAI_API_KEY="sk-..."
|
||||
python3 expand_keywords_with_llm.py ../data/input/keywords.xlsx
|
||||
# Verify high-confidence unmatched records
|
||||
python3 verify_high_confidence.py -o original.xlsx -m matched.xlsx --mock
|
||||
```
|
||||
|
||||
## Dependencies
|
||||
@@ -49,221 +46,130 @@ python3 expand_keywords_with_llm.py ../data/input/keywords.xlsx
|
||||
pip install pandas openpyxl
|
||||
```
|
||||
|
||||
**Optional (for fuzzy keyword matching):**
|
||||
**Optional:**
|
||||
```bash
|
||||
pip install rapidfuzz
|
||||
pip install pyahocorasick # 5x faster exact matching
|
||||
pip install requests # Required for Dify API
|
||||
pip install tqdm # Progress bars
|
||||
pip install openai # For OpenAI-compatible APIs in verify script
|
||||
```
|
||||
|
||||
**Optional (for LLM keyword expansion):**
|
||||
## Environment Configuration
|
||||
|
||||
Copy `.env.example` to `.env` and configure API keys:
|
||||
|
||||
```bash
|
||||
pip install openai anthropic
|
||||
# Default API type (openai | dmx | dify | ollama)
|
||||
LLM_API_TYPE="dify"
|
||||
|
||||
# DMX API (OpenAI compatible)
|
||||
DMX_API_KEY="your-key"
|
||||
DMX_BASE_URL="https://www.dmxapi.cn"
|
||||
DMX_MODEL="gpt-4o-mini"
|
||||
|
||||
# Dify API (used by image_batch_recognizer.py)
|
||||
DIFY_API_KEY="app-xxx"
|
||||
DIFY_BASE_URL="https://your-dify-server:4433"
|
||||
DIFY_USER_ID="default-user"
|
||||
|
||||
# Separate config for verify_high_confidence.py (VERIFY_ prefix)
|
||||
VERIFY_API_TYPE="dmx"
|
||||
VERIFY_API_KEY="your-key"
|
||||
VERIFY_BASE_URL="https://api.example.com"
|
||||
VERIFY_MODEL="gpt-4o-mini"
|
||||
```
|
||||
|
||||
## Data Flow Architecture
|
||||
|
||||
All scripts use relative paths from `scripts/` directory:
|
||||
|
||||
```
|
||||
Input: ../data/input/
|
||||
clickin_text_img.xlsx (2779 rows: text + image paths)
|
||||
keywords.xlsx (22 rows, basic keyword list)
|
||||
keyword_all.xlsx (1659 rows, 1308 unique CAS numbers)
|
||||
|
||||
Output: ../data/output/
|
||||
keyword_matched_results.xlsx (multi-mode merged results)
|
||||
cas_matched_results_final.xlsx
|
||||
test_keywords_expanded_rows.xlsx
|
||||
|
||||
Images: ../data/images/ (1955 JPG files, 84MB)
|
||||
data/
|
||||
├── input/ # Source data
|
||||
│ ├── clickin_text_img.xlsx # Text + image paths
|
||||
│ └── keywords.xlsx # Keyword database
|
||||
├── images/ # Image files for recognition
|
||||
├── batch_output/ # Per-folder recognition results
|
||||
│ └── {name}/results.xlsx
|
||||
├── data_all/ # Original data by source
|
||||
│ └── {name}_text_img.xlsx
|
||||
├── collected_xlsx/ # Merged results (collect_xlsx.py output)
|
||||
└── output/ # Final processed results
|
||||
```
|
||||
|
||||
**Processing Pipeline:**
|
||||
```
|
||||
Raw data collection -> Text extraction (OCR/LLM) ->
|
||||
Feature matching (CAS/keywords) -> Data cleaning ->
|
||||
Risk determination
|
||||
1. image_batch_recognizer.py → batch_output/{name}/results.xlsx
|
||||
2. collect_xlsx.py → Merge results.xlsx with {name}_text_img.xlsx → collected_xlsx/
|
||||
3. keyword_matcher.py → Match keywords in text → output/keyword_matched_results.xlsx
|
||||
4. verify_high_confidence.py → LLM verify unmatched high-confidence → *_llm_verified.xlsx
|
||||
```
|
||||
|
||||
## Key Technical Details
|
||||
## Key Scripts
|
||||
|
||||
### 1. CAS Number Matching (`match_cas_numbers.py`)
|
||||
- Supports multiple formats: `123-45-6`, `123 45 6`, `123 - 45 - 6`
|
||||
- Auto-normalizes to standard format `XXX-XX-X`
|
||||
- Uses regex pattern: `\b\d{2,7}[\s\-]+\d{2}[\s\-]+\d\b`
|
||||
- Dual-mode: `"regex"` for CAS matching, `"keywords"` for keyword matching
|
||||
### keyword_matcher.py
|
||||
|
||||
### 2. Keyword Matching (`keyword_matcher.py`) - REFACTORED
|
||||
Two detection modes with Strategy Pattern architecture:
|
||||
|
||||
**Architecture:**
|
||||
- Strategy Pattern with `KeywordMatcher` base class
|
||||
- Concrete matchers: `CASRegexMatcher`, `FuzzyMatcher`
|
||||
- Factory Pattern for matcher creation
|
||||
- Dataclass-based result handling
|
||||
1. **CAS Number Recognition (`-m cas`)**
|
||||
- Regex pattern: `\b(\d{2,7})[\s\-._]?(\d{2})[\s\-._]?(\d)\b`
|
||||
- Supports formats: `123-45-6`, `123 45 6`, `123456`, `123.45.6`
|
||||
- Auto-normalizes to standard `XXX-XX-X` format
|
||||
- Source column: `CAS号`
|
||||
|
||||
**Two Detection Modes:**
|
||||
|
||||
1. **CAS Number Recognition (CAS号识别)**
|
||||
- Uses `CASRegexMatcher` with comprehensive regex pattern
|
||||
- Supports formats: `123-45-6`, `123 45 6`, `12345 6`, `123456`, `123.45.6`, `123_45_6`
|
||||
- Auto-normalizes all formats to standard `XXX-XX-X`
|
||||
- Regex: `\b(\d{2,7})[\s\-._]?(\d{2})[\s\-._]?(\d)\b`
|
||||
- Extracts CAS from text, normalizes, compares with keyword database
|
||||
- Source columns: `CAS号`
|
||||
|
||||
2. **Fuzzy Matching (模糊匹配)**
|
||||
- Uses `FuzzyMatcher` with RapidFuzz library
|
||||
- Default threshold: 85 (configurable via `--threshold`)
|
||||
- Scoring function: `partial_ratio`
|
||||
2. **Exact Matching (`-m exact`)**
|
||||
- Uses Aho-Corasick automaton (if pyahocorasick installed) or regex with word boundaries
|
||||
- Source columns: `中文名`, `英文名`, `CAS号`, `简称`, `可能名称`
|
||||
- **Note**: Fuzzy matching covers all cases that exact matching would find, making exact mode redundant
|
||||
|
||||
**Multi-Mode Result Merging:**
|
||||
- Automatically merges results from multiple modes
|
||||
- Deduplicates by row index
|
||||
- Combines matched keywords with ` | ` separator
|
||||
- Adds `匹配模式` column showing which modes matched (e.g., "CAS号识别 + 模糊匹配")
|
||||
|
||||
**Command-Line Options:**
|
||||
```bash
|
||||
-k, --keywords # Path to keywords file (default: ../data/input/keywords.xlsx)
|
||||
-t, --text # Path to text file (default: ../data/input/clickin_text_img.xlsx)
|
||||
-o, --output # Output file path (default: ../data/output/keyword_matched_results.xlsx)
|
||||
-c, --text-column # Column containing text to search (default: "文本")
|
||||
-m, --modes # Modes to run: cas, fuzzy (default: both)
|
||||
--threshold # Fuzzy matching threshold 0-100 (default: 85)
|
||||
--separator # Keyword separator in cells (default: "|||")
|
||||
```
|
||||
|
||||
**Performance:**
|
||||
- With keyword_all.xlsx (1308 CAS numbers):
|
||||
- CAS mode: 255 rows matched (9.18%)
|
||||
- Fuzzy mode: 513 rows matched (18.46%)
|
||||
- Merged (both modes): ~516 unique rows
|
||||
|
||||
**Uses `|||` separator:**
|
||||
- Chemical names contain commas, hyphens, slashes, semicolons
|
||||
- Triple pipe avoids conflicts with chemical nomenclature
|
||||
- Example: `甲基苯丙胺|||冰毒|||Methamphetamine|||MA`
|
||||
|
||||
### 3. Keyword Expansion (`expand_keywords_with_llm.py`)
|
||||
- Expands Chinese names, English names, abbreviations
|
||||
- Supports OpenAI and Anthropic APIs
|
||||
- Mock mode available for testing without API costs
|
||||
- Output formats: compact (single row with `|||` separators) or expanded (one name per row)
|
||||
|
||||
## Configuration Patterns
|
||||
|
||||
Scripts use command-line arguments (keyword_matcher.py) or in-file configuration blocks:
|
||||
|
||||
```python
|
||||
# ========== Configuration ==========
|
||||
keywords_file = "../data/input/keywords.xlsx"
|
||||
text_file = "../data/input/clickin_text_img.xlsx"
|
||||
keywords_column = "中文名"
|
||||
text_column = "文本"
|
||||
separator = "|||"
|
||||
output_file = "../data/output/results.xlsx"
|
||||
# =============================
|
||||
```
|
||||
|
||||
## Excel File Schemas
|
||||
|
||||
**Input - clickin_text_img.xlsx:**
|
||||
- Columns: `文本` (text), image paths, metadata
|
||||
- 2779 rows of scraped e-commerce/social media data
|
||||
|
||||
**Input - keywords.xlsx:**
|
||||
- Columns: `中文名`, `英文名`, `CAS号`, `简称`, `备注`, `可能名称`
|
||||
- `可能名称` contains multiple keywords separated by `|||`
|
||||
- 22 rows (small test dataset)
|
||||
|
||||
**Input - keyword_all.xlsx:**
|
||||
- Same schema as keywords.xlsx
|
||||
- 1659 rows with 1308 unique CAS numbers
|
||||
- Production keyword database
|
||||
|
||||
**Output - Multi-mode matched (keyword_matched_results.xlsx):**
|
||||
- Adds columns:
|
||||
- `匹配到的关键词` (matched keywords, separated by ` | `)
|
||||
- `匹配模式` (matching modes, e.g., "CAS号识别 + 模糊匹配")
|
||||
- Preserves all original columns
|
||||
- Deduplicated across all modes
|
||||
|
||||
**Output - CAS matched:**
|
||||
- Adds column: `匹配到的CAS号` (matched CAS numbers)
|
||||
- Preserves all original columns
|
||||
- Typical match rate: ~9-11% (255-303/2779 rows)
|
||||
|
||||
## Common Modifications
|
||||
|
||||
**To change input/output paths:**
|
||||
Use command-line arguments for `keyword_matcher.py`:
|
||||
```bash
|
||||
python3 keyword_matcher.py -k /path/to/keywords.xlsx -t /path/to/text.xlsx -o /path/to/output.xlsx
|
||||
```
|
||||
|
||||
Or edit the configuration block in other scripts' `main()` function.
|
||||
|
||||
**To switch between CAS and keyword matching:**
|
||||
In `match_cas_numbers.py`, change `match_mode = "regex"` to `match_mode = "keywords"`.
|
||||
|
||||
In `keyword_matcher.py`, use `-m` flag:
|
||||
```bash
|
||||
python3 keyword_matcher.py -m cas # CAS only
|
||||
python3 keyword_matcher.py -m fuzzy # Fuzzy only
|
||||
```
|
||||
|
||||
**To adjust fuzzy matching sensitivity:**
|
||||
```bash
|
||||
python3 keyword_matcher.py -m fuzzy --threshold 90 # Stricter (fewer matches)
|
||||
python3 keyword_matcher.py -m fuzzy --threshold 70 # More lenient (more matches)
|
||||
```
|
||||
|
||||
**To use different LLM APIs:**
|
||||
```bash
|
||||
# OpenAI (default)
|
||||
python3 expand_keywords_with_llm.py input.xlsx
|
||||
|
||||
# Anthropic
|
||||
python3 expand_keywords_with_llm.py input.xlsx -a anthropic
|
||||
```
|
||||
|
||||
## Code Architecture Highlights
|
||||
|
||||
### keyword_matcher.py Design Patterns
|
||||
|
||||
1. **Strategy Pattern**: Different matching algorithms (`KeywordMatcher` subclasses)
|
||||
2. **Template Method**: Common matching workflow in base class `match()` method
|
||||
3. **Factory Pattern**: `create_matcher()` selects appropriate matcher
|
||||
4. **Dependency Injection**: Optional dependency (rapidfuzz) handled gracefully
|
||||
**Multi-column text matching:**
|
||||
- Automatically detects and combines `detected_text` and `文本` columns
|
||||
- Use `-c col1 col2` to specify custom columns
|
||||
|
||||
**Class Hierarchy:**
|
||||
```
|
||||
KeywordMatcher (ABC)
|
||||
├── CASRegexMatcher # Regex-based CAS number extraction
|
||||
└── FuzzyMatcher # RapidFuzz partial_ratio matching
|
||||
├── CASRegexMatcher # Regex CAS extraction + normalization
|
||||
├── RegexExactMatcher # Word-boundary exact matching
|
||||
├── AhoCorasickMatcher # Fast multi-pattern matching
|
||||
└── SetMatcher # Simple substring matching
|
||||
```
|
||||
|
||||
**Data Flow:**
|
||||
```
|
||||
1. Load keywords -> load_keywords_for_mode()
|
||||
2. Create matcher -> create_matcher()
|
||||
3. Match text -> matcher.match()
|
||||
├── _prepare() (build automaton, etc.)
|
||||
└── For each row:
|
||||
├── _match_single_text()
|
||||
└── _format_matches()
|
||||
4. Save results -> save_results()
|
||||
5. If multiple modes -> merge_mode_results()
|
||||
```
|
||||
### verify_high_confidence.py
|
||||
|
||||
## Data Sensitivity
|
||||
Compares keyword_matcher output with original data to find high-confidence rows that weren't matched, then uses LLM for secondary verification.
|
||||
|
||||
This codebase handles sensitive data related to controlled substances monitoring. The data includes:
|
||||
- Chemical compound names (Chinese and English)
|
||||
- CAS registry numbers
|
||||
- Image data from suspected illegal substance trading platforms
|
||||
- All data is for legitimate law enforcement/research purposes
|
||||
- Uses `VERIFY_` prefixed env vars (separate from image_batch_recognizer.py)
|
||||
- Supports: OpenAI, DMX, Dify, Ollama, Mock modes
|
||||
- Input columns: `raw_response`, `文本`
|
||||
|
||||
Do not commit actual data files or API keys to version control.
|
||||
- to memorize
|
||||
### collect_xlsx.py
|
||||
|
||||
Merges batch recognition results with original data:
|
||||
- Matches by image filename (handles both Windows `\` and Unix `/` paths)
|
||||
- Adds original columns (`文本`, metadata) to recognition results
|
||||
|
||||
### image_batch_recognizer.py
|
||||
|
||||
Batch image recognition with multiple API backends:
|
||||
- Supports: OpenAI, Anthropic, DMX, Dify, Mock
|
||||
- Outputs: `detected_text`, `detected_objects`, `sensitive_items`, `summary`, `confidence`
|
||||
- Parallel processing with `--max-workers`
|
||||
|
||||
## Excel File Schemas
|
||||
|
||||
**keywords.xlsx columns:**
|
||||
- `中文名`, `英文名`, `CAS号`, `简称`, `备注`, `可能名称`
|
||||
- `可能名称` uses `|||` separator for multiple values
|
||||
|
||||
**Recognition output columns:**
|
||||
- `image_name`, `image_path`, `detected_text`, `detected_objects`
|
||||
- `sensitive_items`, `summary`, `confidence`, `raw_response`
|
||||
|
||||
**Matched output adds:**
|
||||
- `匹配到的关键词` (matched keywords, ` | ` separated)
|
||||
- `匹配模式` (e.g., "CAS号识别 + 精确匹配")
|
||||
|
||||
## Key Conventions
|
||||
|
||||
- Triple pipe `|||` separator in keyword cells (avoids conflicts with chemical names)
|
||||
- Match result separator: ` | `
|
||||
- All scripts use relative paths from `scripts/` directory
|
||||
- Configuration priority: command-line args > VERIFY_* env > general env > defaults
|
||||
|
||||
Reference in New Issue
Block a user