fix: update keywords_match

2026-01-18 18:25:36 +08:00
parent 29f6e25f70
commit 4ed90734df
7 changed files with 1406 additions and 269 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -7,9 +7,10 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 This is a drug risk monitoring and data processing system for detecting controlled substances in text and image data from e-commerce platforms, darknet sources, and social media.

 **Core Capabilities:**
-1. **CAS Number Matching**: Extract and match chemical CAS numbers from text using regex patterns (supports multiple formats)
-2. **Keyword Matching**: High-performance multi-mode keyword matching (fuzzy, CAS)
-3. **Keyword Expansion**: LLM-powered expansion of chemical/drug names to include variants, abbreviations, and aliases
+1. **Image Recognition**: Batch image analysis using LLM APIs (OpenAI, Anthropic, DMX, Dify) for OCR and risk detection
+2. **Keyword Matching**: Multi-mode keyword matching with CAS number extraction and exact matching
+3. **LLM Verification**: Secondary verification of high-confidence unmatched records using LLM
+4. **Data Collection**: Merge and consolidate results from batch processing

 ## Running Scripts

@@ -18,28 +19,24 @@ All scripts must be run from the `scripts/` directory:
 ```bash
 cd scripts/

-# Quick start (recommended for testing)
-python3 quick_start.py
+# Image batch recognition (mock mode for testing)
+python3 image_batch_recognizer.py --mock --limit 5

-# CAS number matching
-python3 match_cas_numbers.py
+# Image recognition with API
+python3 image_batch_recognizer.py --api-type dify --limit 10

-# Multi-mode keyword matching (default: both modes)
+# Collect and merge xlsx files from batch output
+python3 collect_xlsx.py
+
+# Multi-mode keyword matching (default: cas + exact)
 python3 keyword_matcher.py

 # Single mode matching
-python3 keyword_matcher.py -m cas                    # CAS number only
-python3 keyword_matcher.py -m fuzzy --threshold 90   # Fuzzy matching only
+python3 keyword_matcher.py -m cas      # CAS number only
+python3 keyword_matcher.py -m exact    # Exact matching only

-# Use larger keyword database
-python3 keyword_matcher.py -k ../data/input/keyword_all.xlsx
-
-# Keyword expansion (mock mode, no API)
-python3 expand_keywords_with_llm.py -m
-
-# Keyword expansion (with OpenAI API)
-export OPENAI_API_KEY="sk-..."
-python3 expand_keywords_with_llm.py ../data/input/keywords.xlsx
+# Verify high-confidence unmatched records
+python3 verify_high_confidence.py -o original.xlsx -m matched.xlsx --mock
 ```

 ## Dependencies
@@ -49,221 +46,130 @@ python3 expand_keywords_with_llm.py ../data/input/keywords.xlsx
 pip install pandas openpyxl
 ```

-**Optional (for fuzzy keyword matching):**
+**Optional:**
 ```bash
-pip install rapidfuzz
+pip install pyahocorasick  # 5x faster exact matching
+pip install requests       # Required for Dify API
+pip install tqdm           # Progress bars
+pip install openai         # For OpenAI-compatible APIs in verify script
 ```

-**Optional (for LLM keyword expansion):**
+## Environment Configuration
+
+Copy `.env.example` to `.env` and configure API keys:
+
 ```bash
-pip install openai anthropic
+# Default API type (openai | dmx | dify | ollama)
+LLM_API_TYPE="dify"
+
+# DMX API (OpenAI compatible)
+DMX_API_KEY="your-key"
+DMX_BASE_URL="https://www.dmxapi.cn"
+DMX_MODEL="gpt-4o-mini"
+
+# Dify API (used by image_batch_recognizer.py)
+DIFY_API_KEY="app-xxx"
+DIFY_BASE_URL="https://your-dify-server:4433"
+DIFY_USER_ID="default-user"
+
+# Separate config for verify_high_confidence.py (VERIFY_ prefix)
+VERIFY_API_TYPE="dmx"
+VERIFY_API_KEY="your-key"
+VERIFY_BASE_URL="https://api.example.com"
+VERIFY_MODEL="gpt-4o-mini"
 ```

 ## Data Flow Architecture

-All scripts use relative paths from `scripts/` directory:
-
 ```
-Input:  ../data/input/
-     clickin_text_img.xlsx  (2779 rows: text + image paths)
-     keywords.xlsx           (22 rows, basic keyword list)
-     keyword_all.xlsx        (1659 rows, 1308 unique CAS numbers)
-
-Output: ../data/output/
-     keyword_matched_results.xlsx  (multi-mode merged results)
-     cas_matched_results_final.xlsx
-     test_keywords_expanded_rows.xlsx
-
-Images: ../data/images/  (1955 JPG files, 84MB)
+data/
+├── input/                      # Source data
+│   ├── clickin_text_img.xlsx   # Text + image paths
+│   └── keywords.xlsx           # Keyword database
+├── images/                     # Image files for recognition
+├── batch_output/               # Per-folder recognition results
+│   └── {name}/results.xlsx
+├── data_all/                   # Original data by source
+│   └── {name}_text_img.xlsx
+├── collected_xlsx/             # Merged results (collect_xlsx.py output)
+└── output/                     # Final processed results
 ```

 **Processing Pipeline:**
 ```
-Raw data collection -> Text extraction (OCR/LLM) ->
-Feature matching (CAS/keywords) -> Data cleaning ->
-Risk determination
+1. image_batch_recognizer.py → batch_output/{name}/results.xlsx
+2. collect_xlsx.py → Merge results.xlsx with {name}_text_img.xlsx → collected_xlsx/
+3. keyword_matcher.py → Match keywords in text → output/keyword_matched_results.xlsx
+4. verify_high_confidence.py → LLM verify unmatched high-confidence → *_llm_verified.xlsx
 ```

-## Key Technical Details
+## Key Scripts

-### 1. CAS Number Matching (`match_cas_numbers.py`)
- Supports multiple formats: `123-45-6`, `123 45 6`, `123 - 45 - 6`
- Auto-normalizes to standard format `XXX-XX-X`
- Uses regex pattern: `\b\d{2,7}[\s\-]+\d{2}[\s\-]+\d\b`
- Dual-mode: `"regex"` for CAS matching, `"keywords"` for keyword matching
+### keyword_matcher.py

-### 2. Keyword Matching (`keyword_matcher.py`) - REFACTORED
+Two detection modes with Strategy Pattern architecture:

-**Architecture:**
- Strategy Pattern with `KeywordMatcher` base class
- Concrete matchers: `CASRegexMatcher`, `FuzzyMatcher`
- Factory Pattern for matcher creation
- Dataclass-based result handling
+1. **CAS Number Recognition (`-m cas`)**
+   - Regex pattern: `\b(\d{2,7})[\s\-._]?(\d{2})[\s\-._]?(\d)\b`
+   - Supports formats: `123-45-6`, `123 45 6`, `123456`, `123.45.6`
+   - Auto-normalizes to standard `XXX-XX-X` format
+   - Source column: `CAS号`

-**Two Detection Modes:**
-
-1. **CAS Number Recognition (CAS号识别)**
-   - Uses `CASRegexMatcher` with comprehensive regex pattern
-   - Supports formats: `123-45-6`, `123 45 6`, `12345 6`, `123456`, `123.45.6`, `123_45_6`
-   - Auto-normalizes all formats to standard `XXX-XX-X`
-   - Regex: `\b(\d{2,7})[\s\-._]?(\d{2})[\s\-._]?(\d)\b`
-   - Extracts CAS from text, normalizes, compares with keyword database
-   - Source columns: `CAS号`
-
-2. **Fuzzy Matching (模糊匹配)**
-   - Uses `FuzzyMatcher` with RapidFuzz library
-   - Default threshold: 85 (configurable via `--threshold`)
-   - Scoring function: `partial_ratio`
+2. **Exact Matching (`-m exact`)**
+   - Uses Aho-Corasick automaton (if pyahocorasick installed) or regex with word boundaries
   - Source columns: `中文名`, `英文名`, `CAS号`, `简称`, `可能名称`
-   - **Note**: Fuzzy matching covers all cases that exact matching would find, making exact mode redundant

-**Multi-Mode Result Merging:**
- Automatically merges results from multiple modes
- Deduplicates by row index
- Combines matched keywords with ` | ` separator
- Adds `匹配模式` column showing which modes matched (e.g., "CAS号识别 + 模糊匹配")
-
-**Command-Line Options:**
-```bash
-k, --keywords      # Path to keywords file (default: ../data/input/keywords.xlsx)
-t, --text          # Path to text file (default: ../data/input/clickin_text_img.xlsx)
-o, --output        # Output file path (default: ../data/output/keyword_matched_results.xlsx)
-c, --text-column   # Column containing text to search (default: "文本")
-m, --modes         # Modes to run: cas, fuzzy (default: both)
--threshold         # Fuzzy matching threshold 0-100 (default: 85)
--separator         # Keyword separator in cells (default: "|||")
-```
-
-**Performance:**
- With keyword_all.xlsx (1308 CAS numbers):
-  - CAS mode: 255 rows matched (9.18%)
-  - Fuzzy mode: 513 rows matched (18.46%)
-  - Merged (both modes): ~516 unique rows
-
-**Uses `|||` separator:**
- Chemical names contain commas, hyphens, slashes, semicolons
- Triple pipe avoids conflicts with chemical nomenclature
- Example: `甲基苯丙胺|||冰毒|||Methamphetamine|||MA`
-
-### 3. Keyword Expansion (`expand_keywords_with_llm.py`)
- Expands Chinese names, English names, abbreviations
- Supports OpenAI and Anthropic APIs
- Mock mode available for testing without API costs
- Output formats: compact (single row with `|||` separators) or expanded (one name per row)
-
-## Configuration Patterns
-
-Scripts use command-line arguments (keyword_matcher.py) or in-file configuration blocks:
-
-```python
-# ========== Configuration ==========
-keywords_file = "../data/input/keywords.xlsx"
-text_file = "../data/input/clickin_text_img.xlsx"
-keywords_column = "中文名"
-text_column = "文本"
-separator = "|||"
-output_file = "../data/output/results.xlsx"
-# =============================
-```
-
-## Excel File Schemas
-
-**Input - clickin_text_img.xlsx:**
- Columns: `文本` (text), image paths, metadata
- 2779 rows of scraped e-commerce/social media data
-
-**Input - keywords.xlsx:**
- Columns: `中文名`, `英文名`, `CAS号`, `简称`, `备注`, `可能名称`
- `可能名称` contains multiple keywords separated by `|||`
- 22 rows (small test dataset)
-
-**Input - keyword_all.xlsx:**
- Same schema as keywords.xlsx
- 1659 rows with 1308 unique CAS numbers
- Production keyword database
-
-**Output - Multi-mode matched (keyword_matched_results.xlsx):**
- Adds columns:
-  - `匹配到的关键词` (matched keywords, separated by ` | `)
-  - `匹配模式` (matching modes, e.g., "CAS号识别 + 模糊匹配")
- Preserves all original columns
- Deduplicated across all modes
-
-**Output - CAS matched:**
- Adds column: `匹配到的CAS号` (matched CAS numbers)
- Preserves all original columns
- Typical match rate: ~9-11% (255-303/2779 rows)
-
-## Common Modifications
-
-**To change input/output paths:**
-Use command-line arguments for `keyword_matcher.py`:
-```bash
-python3 keyword_matcher.py -k /path/to/keywords.xlsx -t /path/to/text.xlsx -o /path/to/output.xlsx
-```
-
-Or edit the configuration block in other scripts' `main()` function.
-
-**To switch between CAS and keyword matching:**
-In `match_cas_numbers.py`, change `match_mode = "regex"` to `match_mode = "keywords"`.
-
-In `keyword_matcher.py`, use `-m` flag:
-```bash
-python3 keyword_matcher.py -m cas        # CAS only
-python3 keyword_matcher.py -m fuzzy      # Fuzzy only
-```
-
-**To adjust fuzzy matching sensitivity:**
-```bash
-python3 keyword_matcher.py -m fuzzy --threshold 90  # Stricter (fewer matches)
-python3 keyword_matcher.py -m fuzzy --threshold 70  # More lenient (more matches)
-```
-
-**To use different LLM APIs:**
-```bash
-# OpenAI (default)
-python3 expand_keywords_with_llm.py input.xlsx
-
-# Anthropic
-python3 expand_keywords_with_llm.py input.xlsx -a anthropic
-```
-
-## Code Architecture Highlights
-
-### keyword_matcher.py Design Patterns
-
-1. **Strategy Pattern**: Different matching algorithms (`KeywordMatcher` subclasses)
-2. **Template Method**: Common matching workflow in base class `match()` method
-3. **Factory Pattern**: `create_matcher()` selects appropriate matcher
-4. **Dependency Injection**: Optional dependency (rapidfuzz) handled gracefully
+**Multi-column text matching:**
+- Automatically detects and combines `detected_text` and `文本` columns
+- Use `-c col1 col2` to specify custom columns

 **Class Hierarchy:**
 ```
 KeywordMatcher (ABC)
-├── CASRegexMatcher          # Regex-based CAS number extraction
-└── FuzzyMatcher             # RapidFuzz partial_ratio matching
+├── CASRegexMatcher      # Regex CAS extraction + normalization
+├── RegexExactMatcher    # Word-boundary exact matching
+├── AhoCorasickMatcher   # Fast multi-pattern matching
+└── SetMatcher           # Simple substring matching
 ```

-**Data Flow:**
-```
-1. Load keywords -> load_keywords_for_mode()
-2. Create matcher -> create_matcher()
-3. Match text -> matcher.match()
-   ├── _prepare() (build automaton, etc.)
-   └── For each row:
-       ├── _match_single_text()
-       └── _format_matches()
-4. Save results -> save_results()
-5. If multiple modes -> merge_mode_results()
-```
+### verify_high_confidence.py

-## Data Sensitivity
+Compares keyword_matcher output with original data to find high-confidence rows that weren't matched, then uses LLM for secondary verification.

-This codebase handles sensitive data related to controlled substances monitoring. The data includes:
- Chemical compound names (Chinese and English)
- CAS registry numbers
- Image data from suspected illegal substance trading platforms
- All data is for legitimate law enforcement/research purposes
+- Uses `VERIFY_` prefixed env vars (separate from image_batch_recognizer.py)
+- Supports: OpenAI, DMX, Dify, Ollama, Mock modes
+- Input columns: `raw_response`, `文本`

-Do not commit actual data files or API keys to version control.
- to memorize
+### collect_xlsx.py
+
+Merges batch recognition results with original data:
+- Matches by image filename (handles both Windows `\` and Unix `/` paths)
+- Adds original columns (`文本`, metadata) to recognition results
+
+### image_batch_recognizer.py
+
+Batch image recognition with multiple API backends:
+- Supports: OpenAI, Anthropic, DMX, Dify, Mock
+- Outputs: `detected_text`, `detected_objects`, `sensitive_items`, `summary`, `confidence`
+- Parallel processing with `--max-workers`
+
+## Excel File Schemas
+
+**keywords.xlsx columns:**
+- `中文名`, `英文名`, `CAS号`, `简称`, `备注`, `可能名称`
+- `可能名称` uses `|||` separator for multiple values
+
+**Recognition output columns:**
+- `image_name`, `image_path`, `detected_text`, `detected_objects`
+- `sensitive_items`, `summary`, `confidence`, `raw_response`
+
+**Matched output adds:**
+- `匹配到的关键词` (matched keywords, ` | ` separated)
+- `匹配模式` (e.g., "CAS号识别 + 精确匹配")
+
+## Key Conventions
+
+- Triple pipe `|||` separator in keyword cells (avoids conflicts with chemical names)
+- Match result separator: ` | `
+- All scripts use relative paths from `scripts/` directory
+- Configuration priority: command-line args > VERIFY_* env > general env > defaults