Files
chem-risk-detect/CLAUDE.md
2026-01-04 09:07:25 +08:00

8.8 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is a drug risk monitoring and data processing system for detecting controlled substances in text and image data from e-commerce platforms, darknet sources, and social media.

Core Capabilities:

  1. CAS Number Matching: Extract and match chemical CAS numbers from text using regex patterns (supports multiple formats)
  2. Keyword Matching: High-performance multi-mode keyword matching (fuzzy, CAS)
  3. Keyword Expansion: LLM-powered expansion of chemical/drug names to include variants, abbreviations, and aliases

Running Scripts

All scripts must be run from the scripts/ directory:

cd scripts/

# Quick start (recommended for testing)
python3 quick_start.py

# CAS number matching
python3 match_cas_numbers.py

# Multi-mode keyword matching (default: both modes)
python3 keyword_matcher.py

# Single mode matching
python3 keyword_matcher.py -m cas                    # CAS number only
python3 keyword_matcher.py -m fuzzy --threshold 90   # Fuzzy matching only

# Use larger keyword database
python3 keyword_matcher.py -k ../data/input/keyword_all.xlsx

# Keyword expansion (mock mode, no API)
python3 expand_keywords_with_llm.py -m

# Keyword expansion (with OpenAI API)
export OPENAI_API_KEY="sk-..."
python3 expand_keywords_with_llm.py ../data/input/keywords.xlsx

Dependencies

Required:

pip install pandas openpyxl

Optional (for fuzzy keyword matching):

pip install rapidfuzz

Optional (for LLM keyword expansion):

pip install openai anthropic

Data Flow Architecture

All scripts use relative paths from scripts/ directory:

Input:  ../data/input/
     clickin_text_img.xlsx  (2779 rows: text + image paths)
     keywords.xlsx           (22 rows, basic keyword list)
     keyword_all.xlsx        (1659 rows, 1308 unique CAS numbers)

Output: ../data/output/
     keyword_matched_results.xlsx  (multi-mode merged results)
     cas_matched_results_final.xlsx
     test_keywords_expanded_rows.xlsx

Images: ../data/images/  (1955 JPG files, 84MB)

Processing Pipeline:

Raw data collection -> Text extraction (OCR/LLM) ->
Feature matching (CAS/keywords) -> Data cleaning ->
Risk determination

Key Technical Details

1. CAS Number Matching (match_cas_numbers.py)

  • Supports multiple formats: 123-45-6, 123 45 6, 123 - 45 - 6
  • Auto-normalizes to standard format XXX-XX-X
  • Uses regex pattern: \b\d{2,7}[\s\-]+\d{2}[\s\-]+\d\b
  • Dual-mode: "regex" for CAS matching, "keywords" for keyword matching

2. Keyword Matching (keyword_matcher.py) - REFACTORED

Architecture:

  • Strategy Pattern with KeywordMatcher base class
  • Concrete matchers: CASRegexMatcher, FuzzyMatcher
  • Factory Pattern for matcher creation
  • Dataclass-based result handling

Two Detection Modes:

  1. CAS Number Recognition (CAS号识别)

    • Uses CASRegexMatcher with comprehensive regex pattern
    • Supports formats: 123-45-6, 123 45 6, 12345 6, 123456, 123.45.6, 123_45_6
    • Auto-normalizes all formats to standard XXX-XX-X
    • Regex: \b(\d{2,7})[\s\-._]?(\d{2})[\s\-._]?(\d)\b
    • Extracts CAS from text, normalizes, compares with keyword database
    • Source columns: CAS号
  2. Fuzzy Matching (模糊匹配)

    • Uses FuzzyMatcher with RapidFuzz library
    • Default threshold: 85 (configurable via --threshold)
    • Scoring function: partial_ratio
    • Source columns: 中文名, 英文名, CAS号, 简称, 可能名称
    • Note: Fuzzy matching covers all cases that exact matching would find, making exact mode redundant

Multi-Mode Result Merging:

  • Automatically merges results from multiple modes
  • Deduplicates by row index
  • Combines matched keywords with | separator
  • Adds 匹配模式 column showing which modes matched (e.g., "CAS号识别 + 模糊匹配")

Command-Line Options:

-k, --keywords      # Path to keywords file (default: ../data/input/keywords.xlsx)
-t, --text          # Path to text file (default: ../data/input/clickin_text_img.xlsx)
-o, --output        # Output file path (default: ../data/output/keyword_matched_results.xlsx)
-c, --text-column   # Column containing text to search (default: "文本")
-m, --modes         # Modes to run: cas, fuzzy (default: both)
--threshold         # Fuzzy matching threshold 0-100 (default: 85)
--separator         # Keyword separator in cells (default: "|||")

Performance:

  • With keyword_all.xlsx (1308 CAS numbers):
    • CAS mode: 255 rows matched (9.18%)
    • Fuzzy mode: 513 rows matched (18.46%)
    • Merged (both modes): ~516 unique rows

Uses ||| separator:

  • Chemical names contain commas, hyphens, slashes, semicolons
  • Triple pipe avoids conflicts with chemical nomenclature
  • Example: 甲基苯丙胺|||冰毒|||Methamphetamine|||MA

3. Keyword Expansion (expand_keywords_with_llm.py)

  • Expands Chinese names, English names, abbreviations
  • Supports OpenAI and Anthropic APIs
  • Mock mode available for testing without API costs
  • Output formats: compact (single row with ||| separators) or expanded (one name per row)

Configuration Patterns

Scripts use command-line arguments (keyword_matcher.py) or in-file configuration blocks:

# ========== Configuration ==========
keywords_file = "../data/input/keywords.xlsx"
text_file = "../data/input/clickin_text_img.xlsx"
keywords_column = "中文名"
text_column = "文本"
separator = "|||"
output_file = "../data/output/results.xlsx"
# =============================

Excel File Schemas

Input - clickin_text_img.xlsx:

  • Columns: 文本 (text), image paths, metadata
  • 2779 rows of scraped e-commerce/social media data

Input - keywords.xlsx:

  • Columns: 中文名, 英文名, CAS号, 简称, 备注, 可能名称
  • 可能名称 contains multiple keywords separated by |||
  • 22 rows (small test dataset)

Input - keyword_all.xlsx:

  • Same schema as keywords.xlsx
  • 1659 rows with 1308 unique CAS numbers
  • Production keyword database

Output - Multi-mode matched (keyword_matched_results.xlsx):

  • Adds columns:
    • 匹配到的关键词 (matched keywords, separated by |)
    • 匹配模式 (matching modes, e.g., "CAS号识别 + 模糊匹配")
  • Preserves all original columns
  • Deduplicated across all modes

Output - CAS matched:

  • Adds column: 匹配到的CAS号 (matched CAS numbers)
  • Preserves all original columns
  • Typical match rate: ~9-11% (255-303/2779 rows)

Common Modifications

To change input/output paths: Use command-line arguments for keyword_matcher.py:

python3 keyword_matcher.py -k /path/to/keywords.xlsx -t /path/to/text.xlsx -o /path/to/output.xlsx

Or edit the configuration block in other scripts' main() function.

To switch between CAS and keyword matching: In match_cas_numbers.py, change match_mode = "regex" to match_mode = "keywords".

In keyword_matcher.py, use -m flag:

python3 keyword_matcher.py -m cas        # CAS only
python3 keyword_matcher.py -m fuzzy      # Fuzzy only

To adjust fuzzy matching sensitivity:

python3 keyword_matcher.py -m fuzzy --threshold 90  # Stricter (fewer matches)
python3 keyword_matcher.py -m fuzzy --threshold 70  # More lenient (more matches)

To use different LLM APIs:

# OpenAI (default)
python3 expand_keywords_with_llm.py input.xlsx

# Anthropic
python3 expand_keywords_with_llm.py input.xlsx -a anthropic

Code Architecture Highlights

keyword_matcher.py Design Patterns

  1. Strategy Pattern: Different matching algorithms (KeywordMatcher subclasses)
  2. Template Method: Common matching workflow in base class match() method
  3. Factory Pattern: create_matcher() selects appropriate matcher
  4. Dependency Injection: Optional dependency (rapidfuzz) handled gracefully

Class Hierarchy:

KeywordMatcher (ABC)
├── CASRegexMatcher          # Regex-based CAS number extraction
└── FuzzyMatcher             # RapidFuzz partial_ratio matching

Data Flow:

1. Load keywords -> load_keywords_for_mode()
2. Create matcher -> create_matcher()
3. Match text -> matcher.match()
   ├── _prepare() (build automaton, etc.)
   └── For each row:
       ├── _match_single_text()
       └── _format_matches()
4. Save results -> save_results()
5. If multiple modes -> merge_mode_results()

Data Sensitivity

This codebase handles sensitive data related to controlled substances monitoring. The data includes:

  • Chemical compound names (Chinese and English)
  • CAS registry numbers
  • Image data from suspected illegal substance trading platforms
  • All data is for legitimate law enforcement/research purposes

Do not commit actual data files or API keys to version control.

  • to memorize