6. Algorithms ¶

Given source-segment and target-segment databases, we have designed and developed a library that outputs a number of candidates (of choice) from the target data for each source-segment entry. The TeAM algorithm allows for a number of other choices including among athers boosting the strength of a matched term using its lemma or soundex and/or matching possible abbreviations. To achieve this, we successfully derive benefit from existing libraries such as panda, fuzzywuzzy, spacy and ftfy. Moreover, we compare our approach with a rather simpler one mostly based on Gensim. Hereby we explain them both.

6.1 Gensim+ ¶

In order to have a baseline for comparing the TeAM approach, we chose the ‘of the shelf’ library ‘Gensim’ proposed for text search. We implement it first by strictly following the instructions presented in here. Since it is meant for exact term-match and it is, therefore, unable to deal with non existing (exact) terms in the target data, the results at first appear quite poor for our complex data. One way to help dealing with such issue is to convert the source and target data into their respective stems. This indeed improves the results since the matches are now approximated via the stems. Still, this does not account for misspellings or spelling variations. An attempt to address the latter issue led us to multiply the original query into a number of reconstructed queries from which the best result is selected. To reconstruct a query, n approximations are computed for each stemmed term in the query. From this, new queries are generated (term combinations by n-ary Cartesian product). For example, a query of 5 terms will result in 32 new queries if 2 candidates are found for each term. This obviously does not scale in the event of (i) big queries and/or (ii) large candidates (even 2). However, the use of a single candidate does scale and provides at list one alternative to terms that do not exist in the target data, meaning that misspell or spelling variations are now included to a certain extent.

To summarise, using the Gensim library we have developed and implemented a text matching approach that now includes the search of terms that do not exist in the target data.

6.2 TeAM¶

This section presents TeAM, the proposed algorithm for text matching. First its parameters are described, then its 4 macro steps are explained.

Function Definition ¶

Function Definition

results = TeAM.run(
    queries=montias_data, target=inventories, stop_words=dutch_stop_word, language='nl',
    max_candidates=5, boost=True, preprocess=True, normalise=True, find_abbreviated=True,
    logarithm=False, remove_number=True, no_ties=True, x_best_tf_idfs=50, bests=1,  sample=0)

 Parameter Description
    :param queries          : The source data. It can be a plain text or a list of text segments.
    :param target           : The target data. It can be a plain text or a list of text segments.
    :param stop_words       : List of current terms that should not be indexed.
    :param language         : A string value determining the language (e.g en, de...) in which
                              the source and target data are written.
    :param max_candidates   : The number of candidates terms that can be return for a given query-token
    :param boost            : A boolean value for deciding whether to boost the strength of a matched 
                              candidate below a low-bound of 75 using the sound and lemma of the terms.
    :param preprocess       : A boolean value for deciding whether to apply the fix_text_segment function for fixing
                              inconsistencies and glitches in the source and target data
    :param normalise_text   : Defaulted to [False], it is applied to the dutch written source and target data to
                              normalise certain characters. For example, 'sch' and 'kx' will become respectfully
                              'see' and 'xx'.
    :param normalize_number : Here too, It is also used to determine whether to convert numbers
                              to words using num2words.
    :param logarithm        : Defaulted to [False], it uses the log function in computing tf-idf if set to [True]
    :param find_abbreviated : Defaulted to True, it ensures that abbreviated terms are recognised, tagged and that,
                              possible terms for which the abbreviated term can stand for within the data are looked for.  
    :param remove_number    : Defaulted to True, it ensures that numbers are not indexed
    :param no_ties          : Defaulted to [True]..
    :param x_best_tf_idfs   : Defaulted to 50. The tf_idf vector enabled us to detect potential segment candidates. 
                              The integer value indicates the number of best candidates to look into.
    :param bests            : By default, for a given query, the predicted best is returned (bests=1).
    :param sample           : If the integer value is greater than 0, it prints panda tables of the index source, 
                              target and tf_idf vector of the target data 
    :return                 : a dictionary of list or results for each query segment

Overview ¶

Steps Overview

def run(queries: (str, list), target: (str, list), stop_words: list, language: str = 'nl',
        preprocess: bool = True, normalise_text: bool = True, normalize_number: bool = True, 
        remove_number=True, find_abbreviated=True, no_ties: bool = True, max_candidates: int = 5,  
        boost_candidates: bool = True, logarithm=False, n_best_tf_idfs: int = 50, n_bests: int = 1):

    global nl, en
    nl, en = nl_core_news_sm.load(), en_core_web_sm.load()

    # STEP 1: TEXT SEGMENTATION
    src_segments, source_ids_map = text_segments(queries)
    trg_segments, target_ids_map = text_segments(target)

    # 2.1 INDEXING THE SOURCE
    src_index, src_vectors, src_indexed_terms_per_doc = text_indexer(
        src_segments, stop_words=stop_words, language=language, normalize_text=normalise_text,
        normalize_number=normalize_number, preprocess=preprocess, find_abbreviated=find_abbreviated,
        logarithm=logarithm, boost=boost, remove_numbers=remove_number)

    # 2.2 INDEXING THE TARGET
    trg_index, trg_vectors, trg_indexed_terms_per_doc = text_indexer(
        trg_segments, stop_words=stop_words, language=language, normalize_text=normalise_text,
        normalize_number=normalize_number, preprocess=preprocess, find_abbreviated=find_abbreviated,
        logarithm=logarithm, boost=boost, remove_numbers=remove_number,
        build_tfidf=True)

    # 3. UPDATE THE SOURCE'S INDEXES WITH CANDIDATES
    # For each indexed term in the source index, find a maximum of 5 candidate terms in the target index
    find_candidates(src_index, trg_index, max_candidates=max_candidates, boost=boost)

    # 4. FOR A GIVEN SOURCE-SEGMENT FIND ONE OR MORE SIMILAR TARGET-SEGMENTS
    results = find_match(
        src_index, trg_index, trg_vectors, src_segments, trg_segments,
        src_indexed_terms_per_doc, x_best_tf_idfs, x_bests=bests, no_ties=no_ties)

    return results

STEP 1: Segmenting ¶

The ideal input scenario for “text-segment” matching with the proposed library is for the source and target data to each be provided as a list of segments. However, if the data is provided as plain text instead, our basic text segmentation function in the library converts the plain text-document into a list of lines partitioned based on the \n character as segments.
In short, for now, the default implementation is a line-based segmentation if the input comes as plain text. We expect in the future to enable other types of segmentation such as phrase-based or paragraph-based segmentations. Naturally, for the algorithm to produce sensible results, the source and target segment-sets should follow the same segmentation criteria, or should at least be minimally compatible. For example, source segmented as phrases and target segmented as lines is not ideal, but can still provide meaningful results.

STEP 2: Indexing ¶

The function iterates over each segment from the segments input list. First, if the preprocess argument is set as true (default), the fix_text_segment function from ftfy is applied over each input segments. This provides some fixes for inconsistencies and glitches. Then, if the normalize_number parameter is set to true (default), the numbers in the segments are converted into words using regular expression to detect numbers within the segment and the python library num2words to convert the numbers into word(s) according to the chosen language. Moreover, if the normalize_text parameter is set as true (default) and language is set to dutch, a language normalisation is applied to the segments using the following list of tuples (in future versions this list could be provided as parameter):
('qu', 'kw'), ('sch', 'se'), ('ks', 'x'), ('kx', 'x'), ('kc', 'k'), ('ck', 'k'),
('dt', 't'), ('td', 't'), ('ch', 'g'), ('sz', 's'), ('ij', 'y')
Finally, we run spacy over the fixed segment for language based tokenisation. Then, for each token, its inclusion in the index is ruled by (i) not-punctuation; (ii) not-stop-word, if such list is provided; (ii) not-number-like, if normalize_number and/or remove_numbers is set as true (default). If these conditions are passed, the lowercased version of the token is indexed, and property values of the term are documented accordingly.

STEP 3: Candidate Terms¶

1. Approximations to a query-term The input term (query-term) is of type string and represents either a full term or an abbreviated term. The task here of finding m candidate-terms (m given by the max_candidates parameter, defaulted to 5) for the each of the given query-terms is performed using our fuzzy_match function implemented on top the fuzzywuzzy library. For a given query-term, fuzzy_match searches for at least m candidates among the indexed target terms. Only the ones with a minimum matching strength of 60 are returned. The function returns a dictionary of candidate-terms (abbreviated or not) with their respective matching-strength.

2. Candidates strength boosting The strengths of the respective candidate-terms previously selected can be boosted based on how similar their respective lemmas (and/or soundex) are to the corresponding query-terms, if the boost parameter is set as true (default). Furthermore, the boost is only applied to candidates for which the strength is below 1. For these relatively “weak terms”, the strength obtained by comparing lemmas and/or soundex is averaged with the approximation strength. If the resulting new strength is above a preset threshold of 75, an average of the original strength and the current one is computed otherwise, it is left unmodified.

3. Abbreviations or extensions to a query token Roughly, terms are considered possible abbreviations if they are followed by a dot that has not been classified as punctuation by spacy.

The input here is the panda’s frame generated in the previous section. The idea here is to find within the frame potential extensions (full tokens) for an abbreviated token (shortened form of a token). For example, the full token civic can extend the abbreviated cvc. whereas civilian does not. The goal is to speed up the (later) process of matching possible abbreviations, if the parameter find_abbreviated is set as true (default).

STEP 4: Candidate Segments. ¶

1. Collect candidates-terms for the query-terms After the indexing of both source and target, each entry in the source index is populated with m=5 candidates from the target index as described in step 3. This allows us to quickly retrieve candidates for each of the query-terms. For example, given a query-segment with 6 terms and max_candidates set to 5, we can quickly retrieve 30 maximum candidate-terms from various target-segments that will be further evaluated in combination.

2. Retrieve all candidate-segments based of their tf-idf

A tf-idf matrix computed for the target segments is filtered based on the previously collected candidate-terms, such that it contains only the documents/segments in which at least one of the candidate-terms occur. The candidate-segments are then sorted in descending order based on the sum of their respective terms’ tf-idf weights, meaning the higher the score the better. We then select the n first best candidate-segments, according to the parameter n\_best\_tf\_idfs (default 50).

3. Compute the number of hits obtained per document At this step, for each of the n candidate-segments, we sum up the strength of its composing candidate-terms. This gives an idea of the overall matching strength for a given candidate-segment.

Then, we compute their delta-hit, i.e. the difference between the maximum possible hit and the observed hit. Consequently, the candidates are reordered in ascending order based on their delta-hits, to which extra penalty and rewards are applied. The reward counts the number of intersecting terms between source and target, while the penalty looks at the offset length of the target candidate-segment with respect to the query-segment. Now, here, the smaller the score the better, since small delta implies being closer to the goal.

In the end, we return the candidate-segment(s) according to the parameterised argument n_bests (default set to 1). In the default setting, the function returns the first best even in the event of ties. If the value assigned to n_bests is greater than 1, the function returns the number of candidates requested. However, in case of ties, it returns all of them even if the requested number is less than the number of ties.