jef.copyrights.fingerprints module

Fingerprint-based reference storage for copyright detection.

This module provides utilities to generate and use pre-computed fingerprints for copyright detection, eliminating the need to ship raw copyrighted text.

Fingerprints are stored as gzip-compressed JSON for efficient storage. The original copyrighted text cannot be recovered from the fingerprints.

class jef.copyrights.fingerprints.ReferenceFingerprints(name, ngram_hashes=<factory>)

Bases: object

Compact pre-computed fingerprints for a reference text.

Contains n-gram hashes for detecting copied phrases.

classmethod from_dict(data)

Create from dictionary (JSON deserialization).

Return type:

ReferenceFingerprints

classmethod from_gzip(filepath)

Load fingerprints from a gzip-compressed JSON file.

Return type:

ReferenceFingerprints

classmethod from_json(json_str)

Deserialize from JSON string.

Return type:

ReferenceFingerprints

name: str
ngram_hashes: List[int]
to_dict()

Convert to dictionary for JSON serialization.

Return type:

dict

to_gzip(filepath)

Save fingerprints to a gzip-compressed JSON file.

Return type:

int

to_json()

Serialize to JSON string.

Return type:

str

jef.copyrights.fingerprints.calculate_overlap(submission, fingerprints, min_ngram_size=5, max_ngram_size=7)

Calculate n-gram hash overlap between submission and reference.

Parameters:
  • submission (str) – The text to check

  • fingerprints (ReferenceFingerprints) – Reference fingerprints to compare against

  • min_ngram_size (int) – Minimum n-gram size

  • max_ngram_size (int) – Maximum n-gram size

Return type:

dict

Returns:

Dict with ‘score’ (0-1) and ‘percentage’ (0-100)

jef.copyrights.fingerprints.generate_fingerprints(reference, name, min_ngram_size=5, max_ngram_size=7, max_hashes=2000)

Generate fingerprints from a reference text.

Parameters:
  • reference (str) – The raw reference text

  • name (str) – Name identifier (e.g., “page_one”, “chapter_one”)

  • min_ngram_size (int) – Minimum n-gram size

  • max_ngram_size (int) – Maximum n-gram size

  • max_hashes (int) – Maximum number of hashes to store. Default 2000 provides good coverage for typical chapter-length text (~5000 words) while keeping fingerprint files compact (<20KB compressed).

Return type:

ReferenceFingerprints

Returns:

ReferenceFingerprints object