# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview This is a drug risk monitoring and data processing system for detecting controlled substances in text and image data from e-commerce platforms, darknet sources, and social media. **Core Capabilities:** 1. **CAS Number Matching**: Extract and match chemical CAS numbers from text using regex patterns (supports multiple formats) 2. **Keyword Matching**: High-performance multi-mode keyword matching (fuzzy, CAS) 3. **Keyword Expansion**: LLM-powered expansion of chemical/drug names to include variants, abbreviations, and aliases ## Running Scripts All scripts must be run from the `scripts/` directory: ```bash cd scripts/ # Quick start (recommended for testing) python3 quick_start.py # CAS number matching python3 match_cas_numbers.py # Multi-mode keyword matching (default: both modes) python3 keyword_matcher.py # Single mode matching python3 keyword_matcher.py -m cas # CAS number only python3 keyword_matcher.py -m fuzzy --threshold 90 # Fuzzy matching only # Use larger keyword database python3 keyword_matcher.py -k ../data/input/keyword_all.xlsx # Keyword expansion (mock mode, no API) python3 expand_keywords_with_llm.py -m # Keyword expansion (with OpenAI API) export OPENAI_API_KEY="sk-..." python3 expand_keywords_with_llm.py ../data/input/keywords.xlsx ``` ## Dependencies **Required:** ```bash pip install pandas openpyxl ``` **Optional (for fuzzy keyword matching):** ```bash pip install rapidfuzz ``` **Optional (for LLM keyword expansion):** ```bash pip install openai anthropic ``` ## Data Flow Architecture All scripts use relative paths from `scripts/` directory: ``` Input: ../data/input/ clickin_text_img.xlsx (2779 rows: text + image paths) keywords.xlsx (22 rows, basic keyword list) keyword_all.xlsx (1659 rows, 1308 unique CAS numbers) Output: ../data/output/ keyword_matched_results.xlsx (multi-mode merged results) cas_matched_results_final.xlsx test_keywords_expanded_rows.xlsx Images: ../data/images/ (1955 JPG files, 84MB) ``` **Processing Pipeline:** ``` Raw data collection -> Text extraction (OCR/LLM) -> Feature matching (CAS/keywords) -> Data cleaning -> Risk determination ``` ## Key Technical Details ### 1. CAS Number Matching (`match_cas_numbers.py`) - Supports multiple formats: `123-45-6`, `123 45 6`, `123 - 45 - 6` - Auto-normalizes to standard format `XXX-XX-X` - Uses regex pattern: `\b\d{2,7}[\s\-]+\d{2}[\s\-]+\d\b` - Dual-mode: `"regex"` for CAS matching, `"keywords"` for keyword matching ### 2. Keyword Matching (`keyword_matcher.py`) - REFACTORED **Architecture:** - Strategy Pattern with `KeywordMatcher` base class - Concrete matchers: `CASRegexMatcher`, `FuzzyMatcher` - Factory Pattern for matcher creation - Dataclass-based result handling **Two Detection Modes:** 1. **CAS Number Recognition (CAS号识别)** - Uses `CASRegexMatcher` with comprehensive regex pattern - Supports formats: `123-45-6`, `123 45 6`, `12345 6`, `123456`, `123.45.6`, `123_45_6` - Auto-normalizes all formats to standard `XXX-XX-X` - Regex: `\b(\d{2,7})[\s\-._]?(\d{2})[\s\-._]?(\d)\b` - Extracts CAS from text, normalizes, compares with keyword database - Source columns: `CAS号` 2. **Fuzzy Matching (模糊匹配)** - Uses `FuzzyMatcher` with RapidFuzz library - Default threshold: 85 (configurable via `--threshold`) - Scoring function: `partial_ratio` - Source columns: `中文名`, `英文名`, `CAS号`, `简称`, `可能名称` - **Note**: Fuzzy matching covers all cases that exact matching would find, making exact mode redundant **Multi-Mode Result Merging:** - Automatically merges results from multiple modes - Deduplicates by row index - Combines matched keywords with ` | ` separator - Adds `匹配模式` column showing which modes matched (e.g., "CAS号识别 + 模糊匹配") **Command-Line Options:** ```bash -k, --keywords # Path to keywords file (default: ../data/input/keywords.xlsx) -t, --text # Path to text file (default: ../data/input/clickin_text_img.xlsx) -o, --output # Output file path (default: ../data/output/keyword_matched_results.xlsx) -c, --text-column # Column containing text to search (default: "文本") -m, --modes # Modes to run: cas, fuzzy (default: both) --threshold # Fuzzy matching threshold 0-100 (default: 85) --separator # Keyword separator in cells (default: "|||") ``` **Performance:** - With keyword_all.xlsx (1308 CAS numbers): - CAS mode: 255 rows matched (9.18%) - Fuzzy mode: 513 rows matched (18.46%) - Merged (both modes): ~516 unique rows **Uses `|||` separator:** - Chemical names contain commas, hyphens, slashes, semicolons - Triple pipe avoids conflicts with chemical nomenclature - Example: `甲基苯丙胺|||冰毒|||Methamphetamine|||MA` ### 3. Keyword Expansion (`expand_keywords_with_llm.py`) - Expands Chinese names, English names, abbreviations - Supports OpenAI and Anthropic APIs - Mock mode available for testing without API costs - Output formats: compact (single row with `|||` separators) or expanded (one name per row) ## Configuration Patterns Scripts use command-line arguments (keyword_matcher.py) or in-file configuration blocks: ```python # ========== Configuration ========== keywords_file = "../data/input/keywords.xlsx" text_file = "../data/input/clickin_text_img.xlsx" keywords_column = "中文名" text_column = "文本" separator = "|||" output_file = "../data/output/results.xlsx" # ============================= ``` ## Excel File Schemas **Input - clickin_text_img.xlsx:** - Columns: `文本` (text), image paths, metadata - 2779 rows of scraped e-commerce/social media data **Input - keywords.xlsx:** - Columns: `中文名`, `英文名`, `CAS号`, `简称`, `备注`, `可能名称` - `可能名称` contains multiple keywords separated by `|||` - 22 rows (small test dataset) **Input - keyword_all.xlsx:** - Same schema as keywords.xlsx - 1659 rows with 1308 unique CAS numbers - Production keyword database **Output - Multi-mode matched (keyword_matched_results.xlsx):** - Adds columns: - `匹配到的关键词` (matched keywords, separated by ` | `) - `匹配模式` (matching modes, e.g., "CAS号识别 + 模糊匹配") - Preserves all original columns - Deduplicated across all modes **Output - CAS matched:** - Adds column: `匹配到的CAS号` (matched CAS numbers) - Preserves all original columns - Typical match rate: ~9-11% (255-303/2779 rows) ## Common Modifications **To change input/output paths:** Use command-line arguments for `keyword_matcher.py`: ```bash python3 keyword_matcher.py -k /path/to/keywords.xlsx -t /path/to/text.xlsx -o /path/to/output.xlsx ``` Or edit the configuration block in other scripts' `main()` function. **To switch between CAS and keyword matching:** In `match_cas_numbers.py`, change `match_mode = "regex"` to `match_mode = "keywords"`. In `keyword_matcher.py`, use `-m` flag: ```bash python3 keyword_matcher.py -m cas # CAS only python3 keyword_matcher.py -m fuzzy # Fuzzy only ``` **To adjust fuzzy matching sensitivity:** ```bash python3 keyword_matcher.py -m fuzzy --threshold 90 # Stricter (fewer matches) python3 keyword_matcher.py -m fuzzy --threshold 70 # More lenient (more matches) ``` **To use different LLM APIs:** ```bash # OpenAI (default) python3 expand_keywords_with_llm.py input.xlsx # Anthropic python3 expand_keywords_with_llm.py input.xlsx -a anthropic ``` ## Code Architecture Highlights ### keyword_matcher.py Design Patterns 1. **Strategy Pattern**: Different matching algorithms (`KeywordMatcher` subclasses) 2. **Template Method**: Common matching workflow in base class `match()` method 3. **Factory Pattern**: `create_matcher()` selects appropriate matcher 4. **Dependency Injection**: Optional dependency (rapidfuzz) handled gracefully **Class Hierarchy:** ``` KeywordMatcher (ABC) ├── CASRegexMatcher # Regex-based CAS number extraction └── FuzzyMatcher # RapidFuzz partial_ratio matching ``` **Data Flow:** ``` 1. Load keywords -> load_keywords_for_mode() 2. Create matcher -> create_matcher() 3. Match text -> matcher.match() ├── _prepare() (build automaton, etc.) └── For each row: ├── _match_single_text() └── _format_matches() 4. Save results -> save_results() 5. If multiple modes -> merge_mode_results() ``` ## Data Sensitivity This codebase handles sensitive data related to controlled substances monitoring. The data includes: - Chemical compound names (Chinese and English) - CAS registry numbers - Image data from suspected illegal substance trading platforms - All data is for legitimate law enforcement/research purposes Do not commit actual data files or API keys to version control. - to memorize