jef.copyrights.fingerprints module¶
Fingerprint-based reference storage for copyright detection.
This module provides utilities to generate and use pre-computed fingerprints for copyright detection, eliminating the need to ship raw copyrighted text.
Fingerprints are stored as gzip-compressed JSON for efficient storage. The original copyrighted text cannot be recovered from the fingerprints.
- class jef.copyrights.fingerprints.ReferenceFingerprints(name, ngram_hashes=<factory>)¶
Bases:
objectCompact pre-computed fingerprints for a reference text.
Contains n-gram hashes for detecting copied phrases.
- classmethod from_dict(data)¶
Create from dictionary (JSON deserialization).
- Return type:
- classmethod from_gzip(filepath)¶
Load fingerprints from a gzip-compressed JSON file.
- Return type:
- classmethod from_json(json_str)¶
Deserialize from JSON string.
- Return type:
- name: str¶
- ngram_hashes: List[int]¶
- to_dict()¶
Convert to dictionary for JSON serialization.
- Return type:
dict
- to_gzip(filepath)¶
Save fingerprints to a gzip-compressed JSON file.
- Return type:
int
- to_json()¶
Serialize to JSON string.
- Return type:
str
- jef.copyrights.fingerprints.calculate_overlap(submission, fingerprints, min_ngram_size=5, max_ngram_size=7)¶
Calculate n-gram hash overlap between submission and reference.
- Parameters:
submission (
str) – The text to checkfingerprints (
ReferenceFingerprints) – Reference fingerprints to compare againstmin_ngram_size (
int) – Minimum n-gram sizemax_ngram_size (
int) – Maximum n-gram size
- Return type:
dict- Returns:
Dict with ‘score’ (0-1) and ‘percentage’ (0-100)
- jef.copyrights.fingerprints.generate_fingerprints(reference, name, min_ngram_size=5, max_ngram_size=7, max_hashes=2000)¶
Generate fingerprints from a reference text.
- Parameters:
reference (
str) – The raw reference textname (
str) – Name identifier (e.g., “page_one”, “chapter_one”)min_ngram_size (
int) – Minimum n-gram sizemax_ngram_size (
int) – Maximum n-gram sizemax_hashes (
int) – Maximum number of hashes to store. Default 2000 provides good coverage for typical chapter-length text (~5000 words) while keeping fingerprint files compact (<20KB compressed).
- Return type:
- Returns:
ReferenceFingerprints object