jef.copyrights.utils module¶
- jef.copyrights.utils.calculate_ast_similarity(text1, text2)¶
Calculate similarity using Abstract Syntax Tree comparison, measuring what percentage of reference AST nodes appear in submission.
- Return type:
float
- jef.copyrights.utils.calculate_fingerprint_similarity(submission, reference, k=5)¶
Calculate similarity using Rabin-Karp fingerprinting, measuring what percentage of reference fingerprints appear in submission.
- Return type:
float
- jef.copyrights.utils.calculate_ngram_overlap(submission, reference, min_ngram_size=3, max_ngram_size=7)¶
Calculate n-gram overlap percentages for different n-gram sizes
- Return type:
Dict[int,float]
- jef.copyrights.utils.calculate_sentence_similarity(submission, reference)¶
Calculate sentence-level similarity using candidate selection for speed.
Instead of comparing all pairs O(n*m), selects top-k candidates per submission sentence based on token overlap, reducing to O(n*k) comparisons.
- Return type:
float
- jef.copyrights.utils.find_exact_phrases(submission, reference, min_length=5)¶
Find exact matching phrases above minimum length
- Return type:
List[str]
- jef.copyrights.utils.get_ast_structure(text)¶
Returns a dictionary of AST structure for a given text.
- Return type:
dict
- jef.copyrights.utils.get_fingerprints(text, k)¶
- Return type:
tuple
- jef.copyrights.utils.get_ngrams(words, n)¶
Generate n-grams from list of words
- Return type:
List[str]
- jef.copyrights.utils.get_sentences(text)¶
Split text into sentences while preserving common abbreviations and ensuring minimum length
- Return type:
List[str]
- jef.copyrights.utils.get_words(text)¶
Split text into words
- Return type:
List[str]
- jef.copyrights.utils.jaccard_similarity(set1, set2)¶
Calculate Jaccard similarity between two sets
- Return type:
float
- jef.copyrights.utils.normalize_text(text)¶
Normalize text by removing special characters and standardizing format
- Return type:
str
- jef.copyrights.utils.rolling_hash(text, base=101)¶
Calculate rolling hash for a string using Rabin-Karp algorithm
- Return type:
int
- jef.copyrights.utils.string_similarity(a, b)¶
Calculate similarity ratio between two strings using SequenceMatcher.
- Return type:
float
- jef.copyrights.utils.truncate_submission(sub, ref)¶
- Return type:
str