8.8 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
This is a drug risk monitoring and data processing system for detecting controlled substances in text and image data from e-commerce platforms, darknet sources, and social media.
Core Capabilities:
- CAS Number Matching: Extract and match chemical CAS numbers from text using regex patterns (supports multiple formats)
- Keyword Matching: High-performance multi-mode keyword matching (fuzzy, CAS)
- Keyword Expansion: LLM-powered expansion of chemical/drug names to include variants, abbreviations, and aliases
Running Scripts
All scripts must be run from the scripts/ directory:
cd scripts/
# Quick start (recommended for testing)
python3 quick_start.py
# CAS number matching
python3 match_cas_numbers.py
# Multi-mode keyword matching (default: both modes)
python3 keyword_matcher.py
# Single mode matching
python3 keyword_matcher.py -m cas # CAS number only
python3 keyword_matcher.py -m fuzzy --threshold 90 # Fuzzy matching only
# Use larger keyword database
python3 keyword_matcher.py -k ../data/input/keyword_all.xlsx
# Keyword expansion (mock mode, no API)
python3 expand_keywords_with_llm.py -m
# Keyword expansion (with OpenAI API)
export OPENAI_API_KEY="sk-..."
python3 expand_keywords_with_llm.py ../data/input/keywords.xlsx
Dependencies
Required:
pip install pandas openpyxl
Optional (for fuzzy keyword matching):
pip install rapidfuzz
Optional (for LLM keyword expansion):
pip install openai anthropic
Data Flow Architecture
All scripts use relative paths from scripts/ directory:
Input: ../data/input/
clickin_text_img.xlsx (2779 rows: text + image paths)
keywords.xlsx (22 rows, basic keyword list)
keyword_all.xlsx (1659 rows, 1308 unique CAS numbers)
Output: ../data/output/
keyword_matched_results.xlsx (multi-mode merged results)
cas_matched_results_final.xlsx
test_keywords_expanded_rows.xlsx
Images: ../data/images/ (1955 JPG files, 84MB)
Processing Pipeline:
Raw data collection -> Text extraction (OCR/LLM) ->
Feature matching (CAS/keywords) -> Data cleaning ->
Risk determination
Key Technical Details
1. CAS Number Matching (match_cas_numbers.py)
- Supports multiple formats:
123-45-6,123 45 6,123 - 45 - 6 - Auto-normalizes to standard format
XXX-XX-X - Uses regex pattern:
\b\d{2,7}[\s\-]+\d{2}[\s\-]+\d\b - Dual-mode:
"regex"for CAS matching,"keywords"for keyword matching
2. Keyword Matching (keyword_matcher.py) - REFACTORED
Architecture:
- Strategy Pattern with
KeywordMatcherbase class - Concrete matchers:
CASRegexMatcher,FuzzyMatcher - Factory Pattern for matcher creation
- Dataclass-based result handling
Two Detection Modes:
-
CAS Number Recognition (CAS号识别)
- Uses
CASRegexMatcherwith comprehensive regex pattern - Supports formats:
123-45-6,123 45 6,12345 6,123456,123.45.6,123_45_6 - Auto-normalizes all formats to standard
XXX-XX-X - Regex:
\b(\d{2,7})[\s\-._]?(\d{2})[\s\-._]?(\d)\b - Extracts CAS from text, normalizes, compares with keyword database
- Source columns:
CAS号
- Uses
-
Fuzzy Matching (模糊匹配)
- Uses
FuzzyMatcherwith RapidFuzz library - Default threshold: 85 (configurable via
--threshold) - Scoring function:
partial_ratio - Source columns:
中文名,英文名,CAS号,简称,可能名称 - Note: Fuzzy matching covers all cases that exact matching would find, making exact mode redundant
- Uses
Multi-Mode Result Merging:
- Automatically merges results from multiple modes
- Deduplicates by row index
- Combines matched keywords with
|separator - Adds
匹配模式column showing which modes matched (e.g., "CAS号识别 + 模糊匹配")
Command-Line Options:
-k, --keywords # Path to keywords file (default: ../data/input/keywords.xlsx)
-t, --text # Path to text file (default: ../data/input/clickin_text_img.xlsx)
-o, --output # Output file path (default: ../data/output/keyword_matched_results.xlsx)
-c, --text-column # Column containing text to search (default: "文本")
-m, --modes # Modes to run: cas, fuzzy (default: both)
--threshold # Fuzzy matching threshold 0-100 (default: 85)
--separator # Keyword separator in cells (default: "|||")
Performance:
- With keyword_all.xlsx (1308 CAS numbers):
- CAS mode: 255 rows matched (9.18%)
- Fuzzy mode: 513 rows matched (18.46%)
- Merged (both modes): ~516 unique rows
Uses ||| separator:
- Chemical names contain commas, hyphens, slashes, semicolons
- Triple pipe avoids conflicts with chemical nomenclature
- Example:
甲基苯丙胺|||冰毒|||Methamphetamine|||MA
3. Keyword Expansion (expand_keywords_with_llm.py)
- Expands Chinese names, English names, abbreviations
- Supports OpenAI and Anthropic APIs
- Mock mode available for testing without API costs
- Output formats: compact (single row with
|||separators) or expanded (one name per row)
Configuration Patterns
Scripts use command-line arguments (keyword_matcher.py) or in-file configuration blocks:
# ========== Configuration ==========
keywords_file = "../data/input/keywords.xlsx"
text_file = "../data/input/clickin_text_img.xlsx"
keywords_column = "中文名"
text_column = "文本"
separator = "|||"
output_file = "../data/output/results.xlsx"
# =============================
Excel File Schemas
Input - clickin_text_img.xlsx:
- Columns:
文本(text), image paths, metadata - 2779 rows of scraped e-commerce/social media data
Input - keywords.xlsx:
- Columns:
中文名,英文名,CAS号,简称,备注,可能名称 可能名称contains multiple keywords separated by|||- 22 rows (small test dataset)
Input - keyword_all.xlsx:
- Same schema as keywords.xlsx
- 1659 rows with 1308 unique CAS numbers
- Production keyword database
Output - Multi-mode matched (keyword_matched_results.xlsx):
- Adds columns:
匹配到的关键词(matched keywords, separated by|)匹配模式(matching modes, e.g., "CAS号识别 + 模糊匹配")
- Preserves all original columns
- Deduplicated across all modes
Output - CAS matched:
- Adds column:
匹配到的CAS号(matched CAS numbers) - Preserves all original columns
- Typical match rate: ~9-11% (255-303/2779 rows)
Common Modifications
To change input/output paths:
Use command-line arguments for keyword_matcher.py:
python3 keyword_matcher.py -k /path/to/keywords.xlsx -t /path/to/text.xlsx -o /path/to/output.xlsx
Or edit the configuration block in other scripts' main() function.
To switch between CAS and keyword matching:
In match_cas_numbers.py, change match_mode = "regex" to match_mode = "keywords".
In keyword_matcher.py, use -m flag:
python3 keyword_matcher.py -m cas # CAS only
python3 keyword_matcher.py -m fuzzy # Fuzzy only
To adjust fuzzy matching sensitivity:
python3 keyword_matcher.py -m fuzzy --threshold 90 # Stricter (fewer matches)
python3 keyword_matcher.py -m fuzzy --threshold 70 # More lenient (more matches)
To use different LLM APIs:
# OpenAI (default)
python3 expand_keywords_with_llm.py input.xlsx
# Anthropic
python3 expand_keywords_with_llm.py input.xlsx -a anthropic
Code Architecture Highlights
keyword_matcher.py Design Patterns
- Strategy Pattern: Different matching algorithms (
KeywordMatchersubclasses) - Template Method: Common matching workflow in base class
match()method - Factory Pattern:
create_matcher()selects appropriate matcher - Dependency Injection: Optional dependency (rapidfuzz) handled gracefully
Class Hierarchy:
KeywordMatcher (ABC)
├── CASRegexMatcher # Regex-based CAS number extraction
└── FuzzyMatcher # RapidFuzz partial_ratio matching
Data Flow:
1. Load keywords -> load_keywords_for_mode()
2. Create matcher -> create_matcher()
3. Match text -> matcher.match()
├── _prepare() (build automaton, etc.)
└── For each row:
├── _match_single_text()
└── _format_matches()
4. Save results -> save_results()
5. If multiple modes -> merge_mode_results()
Data Sensitivity
This codebase handles sensitive data related to controlled substances monitoring. The data includes:
- Chemical compound names (Chinese and English)
- CAS registry numbers
- Image data from suspected illegal substance trading platforms
- All data is for legitimate law enforcement/research purposes
Do not commit actual data files or API keys to version control.
- to memorize