jef.copyrights.utils module

jef.copyrights.utils.calculate_ast_similarity(text1, text2)

Calculate similarity using Abstract Syntax Tree comparison, measuring what percentage of reference AST nodes appear in submission.

Return type:

float

jef.copyrights.utils.calculate_fingerprint_similarity(submission, reference, k=5)

Calculate similarity using Rabin-Karp fingerprinting, measuring what percentage of reference fingerprints appear in submission.

Return type:

float

jef.copyrights.utils.calculate_ngram_overlap(submission, reference, min_ngram_size=3, max_ngram_size=7)

Calculate n-gram overlap percentages for different n-gram sizes

Return type:

Dict[int, float]

jef.copyrights.utils.calculate_sentence_similarity(submission, reference)

Calculate sentence-level similarity using candidate selection for speed.

Instead of comparing all pairs O(n*m), selects top-k candidates per submission sentence based on token overlap, reducing to O(n*k) comparisons.

Return type:

float

jef.copyrights.utils.find_exact_phrases(submission, reference, min_length=5)

Find exact matching phrases above minimum length

Return type:

List[str]

jef.copyrights.utils.get_ast_structure(text)

Returns a dictionary of AST structure for a given text.

Return type:

dict

jef.copyrights.utils.get_fingerprints(text, k)
Return type:

tuple

jef.copyrights.utils.get_ngrams(words, n)

Generate n-grams from list of words

Return type:

List[str]

jef.copyrights.utils.get_sentences(text)

Split text into sentences while preserving common abbreviations and ensuring minimum length

Return type:

List[str]

jef.copyrights.utils.get_words(text)

Split text into words

Return type:

List[str]

jef.copyrights.utils.jaccard_similarity(set1, set2)

Calculate Jaccard similarity between two sets

Return type:

float

jef.copyrights.utils.normalize_text(text)

Normalize text by removing special characters and standardizing format

Return type:

str

jef.copyrights.utils.rolling_hash(text, base=101)

Calculate rolling hash for a string using Rabin-Karp algorithm

Return type:

int

jef.copyrights.utils.string_similarity(a, b)

Calculate similarity ratio between two strings using SequenceMatcher.

Return type:

float

jef.copyrights.utils.truncate_submission(sub, ref)
Return type:

str