ntnt/chem-risk-detect

Fork 0

Files

ntnt 4ed90734df fix: update keywords_match

2026-01-18 18:25:36 +08:00

5.8 KiB

Raw Permalink Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is a drug risk monitoring and data processing system for detecting controlled substances in text and image data from e-commerce platforms, darknet sources, and social media.

Core Capabilities:

Image Recognition: Batch image analysis using LLM APIs (OpenAI, Anthropic, DMX, Dify) for OCR and risk detection
Keyword Matching: Multi-mode keyword matching with CAS number extraction and exact matching
LLM Verification: Secondary verification of high-confidence unmatched records using LLM
Data Collection: Merge and consolidate results from batch processing

Running Scripts

All scripts must be run from the scripts/ directory:

cd scripts/

# Image batch recognition (mock mode for testing)
python3 image_batch_recognizer.py --mock --limit 5

# Image recognition with API
python3 image_batch_recognizer.py --api-type dify --limit 10

# Collect and merge xlsx files from batch output
python3 collect_xlsx.py

# Multi-mode keyword matching (default: cas + exact)
python3 keyword_matcher.py

# Single mode matching
python3 keyword_matcher.py -m cas      # CAS number only
python3 keyword_matcher.py -m exact    # Exact matching only

# Verify high-confidence unmatched records
python3 verify_high_confidence.py -o original.xlsx -m matched.xlsx --mock

Dependencies

Required:

pip install pandas openpyxl

Optional:

pip install pyahocorasick  # 5x faster exact matching
pip install requests       # Required for Dify API
pip install tqdm           # Progress bars
pip install openai         # For OpenAI-compatible APIs in verify script

Environment Configuration

Copy .env.example to .env and configure API keys:

# Default API type (openai | dmx | dify | ollama)
LLM_API_TYPE="dify"

# DMX API (OpenAI compatible)
DMX_API_KEY="your-key"
DMX_BASE_URL="https://www.dmxapi.cn"
DMX_MODEL="gpt-4o-mini"

# Dify API (used by image_batch_recognizer.py)
DIFY_API_KEY="app-xxx"
DIFY_BASE_URL="https://your-dify-server:4433"
DIFY_USER_ID="default-user"

# Separate config for verify_high_confidence.py (VERIFY_ prefix)
VERIFY_API_TYPE="dmx"
VERIFY_API_KEY="your-key"
VERIFY_BASE_URL="https://api.example.com"
VERIFY_MODEL="gpt-4o-mini"

Data Flow Architecture

data/
├── input/                      # Source data
│   ├── clickin_text_img.xlsx   # Text + image paths
│   └── keywords.xlsx           # Keyword database
├── images/                     # Image files for recognition
├── batch_output/               # Per-folder recognition results
│   └── {name}/results.xlsx
├── data_all/                   # Original data by source
│   └── {name}_text_img.xlsx
├── collected_xlsx/             # Merged results (collect_xlsx.py output)
└── output/                     # Final processed results

Processing Pipeline:

1. image_batch_recognizer.py → batch_output/{name}/results.xlsx
2. collect_xlsx.py → Merge results.xlsx with {name}_text_img.xlsx → collected_xlsx/
3. keyword_matcher.py → Match keywords in text → output/keyword_matched_results.xlsx
4. verify_high_confidence.py → LLM verify unmatched high-confidence → *_llm_verified.xlsx

Key Scripts

keyword_matcher.py

Two detection modes with Strategy Pattern architecture:

CAS Number Recognition (-m cas)
- Regex pattern: \b(\d{2,7})[\s\-._]?(\d{2})[\s\-._]?(\d)\b
- Supports formats: 123-45-6, 123 45 6, 123456, 123.45.6
- Auto-normalizes to standard XXX-XX-X format
- Source column: CAS号
Exact Matching (-m exact)
- Uses Aho-Corasick automaton (if pyahocorasick installed) or regex with word boundaries
- Source columns: 中文名, 英文名, CAS号, 简称, 可能名称

Multi-column text matching:

Automatically detects and combines detected_text and 文本 columns
Use -c col1 col2 to specify custom columns

Class Hierarchy:

KeywordMatcher (ABC)
├── CASRegexMatcher      # Regex CAS extraction + normalization
├── RegexExactMatcher    # Word-boundary exact matching
├── AhoCorasickMatcher   # Fast multi-pattern matching
└── SetMatcher           # Simple substring matching

verify_high_confidence.py

Compares keyword_matcher output with original data to find high-confidence rows that weren't matched, then uses LLM for secondary verification.

Uses VERIFY_ prefixed env vars (separate from image_batch_recognizer.py)
Supports: OpenAI, DMX, Dify, Ollama, Mock modes
Input columns: raw_response, 文本

collect_xlsx.py

Merges batch recognition results with original data:

Matches by image filename (handles both Windows \ and Unix / paths)
Adds original columns (文本, metadata) to recognition results

image_batch_recognizer.py

Batch image recognition with multiple API backends:

Supports: OpenAI, Anthropic, DMX, Dify, Mock
Outputs: detected_text, detected_objects, sensitive_items, summary, confidence
Parallel processing with --max-workers

Excel File Schemas

keywords.xlsx columns:

中文名, 英文名, CAS号, 简称, 备注, 可能名称
可能名称 uses ||| separator for multiple values

Recognition output columns:

image_name, image_path, detected_text, detected_objects
sensitive_items, summary, confidence, raw_response

Matched output adds:

匹配到的关键词 (matched keywords, | separated)
匹配模式 (e.g., "CAS号识别 + 精确匹配")

Key Conventions

Triple pipe ||| separator in keyword cells (avoids conflicts with chemical names)
Match result separator: |
All scripts use relative paths from scripts/ directory
Configuration priority: command-line args > VERIFY_* env > general env > defaults

5.8 KiB Raw Permalink Blame History