init: first upload

2026-01-04 09:07:25 +08:00
commit 29f6e25f70
9 changed files with 2598 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,269 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+This is a drug risk monitoring and data processing system for detecting controlled substances in text and image data from e-commerce platforms, darknet sources, and social media.
+
+**Core Capabilities:**
+1. **CAS Number Matching**: Extract and match chemical CAS numbers from text using regex patterns (supports multiple formats)
+2. **Keyword Matching**: High-performance multi-mode keyword matching (fuzzy, CAS)
+3. **Keyword Expansion**: LLM-powered expansion of chemical/drug names to include variants, abbreviations, and aliases
+
+## Running Scripts
+
+All scripts must be run from the `scripts/` directory:
+
+```bash
+cd scripts/
+
+# Quick start (recommended for testing)
+python3 quick_start.py
+
+# CAS number matching
+python3 match_cas_numbers.py
+
+# Multi-mode keyword matching (default: both modes)
+python3 keyword_matcher.py
+
+# Single mode matching
+python3 keyword_matcher.py -m cas                    # CAS number only
+python3 keyword_matcher.py -m fuzzy --threshold 90   # Fuzzy matching only
+
+# Use larger keyword database
+python3 keyword_matcher.py -k ../data/input/keyword_all.xlsx
+
+# Keyword expansion (mock mode, no API)
+python3 expand_keywords_with_llm.py -m
+
+# Keyword expansion (with OpenAI API)
+export OPENAI_API_KEY="sk-..."
+python3 expand_keywords_with_llm.py ../data/input/keywords.xlsx
+```
+
+## Dependencies
+
+**Required:**
+```bash
+pip install pandas openpyxl
+```
+
+**Optional (for fuzzy keyword matching):**
+```bash
+pip install rapidfuzz
+```
+
+**Optional (for LLM keyword expansion):**
+```bash
+pip install openai anthropic
+```
+
+## Data Flow Architecture
+
+All scripts use relative paths from `scripts/` directory:
+
+```
+Input:  ../data/input/
+     clickin_text_img.xlsx  (2779 rows: text + image paths)
+     keywords.xlsx           (22 rows, basic keyword list)
+     keyword_all.xlsx        (1659 rows, 1308 unique CAS numbers)
+
+Output: ../data/output/
+     keyword_matched_results.xlsx  (multi-mode merged results)
+     cas_matched_results_final.xlsx
+     test_keywords_expanded_rows.xlsx
+
+Images: ../data/images/  (1955 JPG files, 84MB)
+```
+
+**Processing Pipeline:**
+```
+Raw data collection -> Text extraction (OCR/LLM) ->
+Feature matching (CAS/keywords) -> Data cleaning ->
+Risk determination
+```
+
+## Key Technical Details
+
+### 1. CAS Number Matching (`match_cas_numbers.py`)
+- Supports multiple formats: `123-45-6`, `123 45 6`, `123 - 45 - 6`
+- Auto-normalizes to standard format `XXX-XX-X`
+- Uses regex pattern: `\b\d{2,7}[\s\-]+\d{2}[\s\-]+\d\b`
+- Dual-mode: `"regex"` for CAS matching, `"keywords"` for keyword matching
+
+### 2. Keyword Matching (`keyword_matcher.py`) - REFACTORED
+
+**Architecture:**
+- Strategy Pattern with `KeywordMatcher` base class
+- Concrete matchers: `CASRegexMatcher`, `FuzzyMatcher`
+- Factory Pattern for matcher creation
+- Dataclass-based result handling
+
+**Two Detection Modes:**
+
+1. **CAS Number Recognition (CAS号识别)**
+   - Uses `CASRegexMatcher` with comprehensive regex pattern
+   - Supports formats: `123-45-6`, `123 45 6`, `12345 6`, `123456`, `123.45.6`, `123_45_6`
+   - Auto-normalizes all formats to standard `XXX-XX-X`
+   - Regex: `\b(\d{2,7})[\s\-._]?(\d{2})[\s\-._]?(\d)\b`
+   - Extracts CAS from text, normalizes, compares with keyword database
+   - Source columns: `CAS号`
+
+2. **Fuzzy Matching (模糊匹配)**
+   - Uses `FuzzyMatcher` with RapidFuzz library
+   - Default threshold: 85 (configurable via `--threshold`)
+   - Scoring function: `partial_ratio`
+   - Source columns: `中文名`, `英文名`, `CAS号`, `简称`, `可能名称`
+   - **Note**: Fuzzy matching covers all cases that exact matching would find, making exact mode redundant
+
+**Multi-Mode Result Merging:**
+- Automatically merges results from multiple modes
+- Deduplicates by row index
+- Combines matched keywords with ` | ` separator
+- Adds `匹配模式` column showing which modes matched (e.g., "CAS号识别 + 模糊匹配")
+
+**Command-Line Options:**
+```bash
+-k, --keywords      # Path to keywords file (default: ../data/input/keywords.xlsx)
+-t, --text          # Path to text file (default: ../data/input/clickin_text_img.xlsx)
+-o, --output        # Output file path (default: ../data/output/keyword_matched_results.xlsx)
+-c, --text-column   # Column containing text to search (default: "文本")
+-m, --modes         # Modes to run: cas, fuzzy (default: both)
+--threshold         # Fuzzy matching threshold 0-100 (default: 85)
+--separator         # Keyword separator in cells (default: "|||")
+```
+
+**Performance:**
+- With keyword_all.xlsx (1308 CAS numbers):
+  - CAS mode: 255 rows matched (9.18%)
+  - Fuzzy mode: 513 rows matched (18.46%)
+  - Merged (both modes): ~516 unique rows
+
+**Uses `|||` separator:**
+- Chemical names contain commas, hyphens, slashes, semicolons
+- Triple pipe avoids conflicts with chemical nomenclature
+- Example: `甲基苯丙胺|||冰毒|||Methamphetamine|||MA`
+
+### 3. Keyword Expansion (`expand_keywords_with_llm.py`)
+- Expands Chinese names, English names, abbreviations
+- Supports OpenAI and Anthropic APIs
+- Mock mode available for testing without API costs
+- Output formats: compact (single row with `|||` separators) or expanded (one name per row)
+
+## Configuration Patterns
+
+Scripts use command-line arguments (keyword_matcher.py) or in-file configuration blocks:
+
+```python
+# ========== Configuration ==========
+keywords_file = "../data/input/keywords.xlsx"
+text_file = "../data/input/clickin_text_img.xlsx"
+keywords_column = "中文名"
+text_column = "文本"
+separator = "|||"
+output_file = "../data/output/results.xlsx"
+# =============================
+```
+
+## Excel File Schemas
+
+**Input - clickin_text_img.xlsx:**
+- Columns: `文本` (text), image paths, metadata
+- 2779 rows of scraped e-commerce/social media data
+
+**Input - keywords.xlsx:**
+- Columns: `中文名`, `英文名`, `CAS号`, `简称`, `备注`, `可能名称`
+- `可能名称` contains multiple keywords separated by `|||`
+- 22 rows (small test dataset)
+
+**Input - keyword_all.xlsx:**
+- Same schema as keywords.xlsx
+- 1659 rows with 1308 unique CAS numbers
+- Production keyword database
+
+**Output - Multi-mode matched (keyword_matched_results.xlsx):**
+- Adds columns:
+  - `匹配到的关键词` (matched keywords, separated by ` | `)
+  - `匹配模式` (matching modes, e.g., "CAS号识别 + 模糊匹配")
+- Preserves all original columns
+- Deduplicated across all modes
+
+**Output - CAS matched:**
+- Adds column: `匹配到的CAS号` (matched CAS numbers)
+- Preserves all original columns
+- Typical match rate: ~9-11% (255-303/2779 rows)
+
+## Common Modifications
+
+**To change input/output paths:**
+Use command-line arguments for `keyword_matcher.py`:
+```bash
+python3 keyword_matcher.py -k /path/to/keywords.xlsx -t /path/to/text.xlsx -o /path/to/output.xlsx
+```
+
+Or edit the configuration block in other scripts' `main()` function.
+
+**To switch between CAS and keyword matching:**
+In `match_cas_numbers.py`, change `match_mode = "regex"` to `match_mode = "keywords"`.
+
+In `keyword_matcher.py`, use `-m` flag:
+```bash
+python3 keyword_matcher.py -m cas        # CAS only
+python3 keyword_matcher.py -m fuzzy      # Fuzzy only
+```
+
+**To adjust fuzzy matching sensitivity:**
+```bash
+python3 keyword_matcher.py -m fuzzy --threshold 90  # Stricter (fewer matches)
+python3 keyword_matcher.py -m fuzzy --threshold 70  # More lenient (more matches)
+```
+
+**To use different LLM APIs:**
+```bash
+# OpenAI (default)
+python3 expand_keywords_with_llm.py input.xlsx
+
+# Anthropic
+python3 expand_keywords_with_llm.py input.xlsx -a anthropic
+```
+
+## Code Architecture Highlights
+
+### keyword_matcher.py Design Patterns
+
+1. **Strategy Pattern**: Different matching algorithms (`KeywordMatcher` subclasses)
+2. **Template Method**: Common matching workflow in base class `match()` method
+3. **Factory Pattern**: `create_matcher()` selects appropriate matcher
+4. **Dependency Injection**: Optional dependency (rapidfuzz) handled gracefully
+
+**Class Hierarchy:**
+```
+KeywordMatcher (ABC)
+├── CASRegexMatcher          # Regex-based CAS number extraction
+└── FuzzyMatcher             # RapidFuzz partial_ratio matching
+```
+
+**Data Flow:**
+```
+1. Load keywords -> load_keywords_for_mode()
+2. Create matcher -> create_matcher()
+3. Match text -> matcher.match()
+   ├── _prepare() (build automaton, etc.)
+   └── For each row:
+       ├── _match_single_text()
+       └── _format_matches()
+4. Save results -> save_results()
+5. If multiple modes -> merge_mode_results()
+```
+
+## Data Sensitivity
+
+This codebase handles sensitive data related to controlled substances monitoring. The data includes:
+- Chemical compound names (Chinese and English)
+- CAS registry numbers
+- Image data from suspected illegal substance trading platforms
+- All data is for legitimate law enforcement/research purposes
+
+Do not commit actual data files or API keys to version control.
+- to memorize