7. TESTS. ¶

In this section we present the results of the conducted experiments. First, we compare results using both approaches previously described. Then we present results obtained by using different settings for the TeAM approach.

For all experiments we use stop-words for Dutch written text as provided below. They were obtained here, except for the terms ‘noch’, ‘den and ‘vant’ which were added by us.

Stop-words

dutch_stop_words = 
    ['aan',       'af',       'al',       'alles',        'als',      'altijd',   'andere',
     'ben',       'bij',      'daar',     'dan',          'dat',      'de',       'der',
     'deze',      'die',      'dit',      'doch',         'doen',     'door',     'dus',
     'een',       'eens',     'en',       'er',           'ge',       'geen',     'geweest',
     'haar',      'had',      'heb',      'hebben',       'heeft',    'hem',      'het',
     'hier',      'hij',      'hoe',      'hun',          'iemand',   'iets',     'ik',
     'in',        'is',       'ja',       'je',           'kan',      'kon',      'kunnen',
     'maar',      'me',       'meer',     'men',          'met',      'mij',      'mijn',
     'moet',      'na',       'naar',     'niet',         'niets',    'nog',      'nu',
     'of',        'om',       'omdat',    'ons',          'ook',      'op',       'over',
     'reeds',     'te',       'tegen',    'toch',         'toen',     'tot',      'u',
     'uit',       'uw',       'van',      'veel',         'voor',     'want',     'waren',
     'was',       'wat',      'we',       'wel',          'werd',     'wezen',    'wie',
     'wij',       'wil',      'worden',   'zal',          'ze',       'zei',      'zelf',
     'zich',      'zij',      'zijn',     'zo',           'zonder',   'zou',      'noch', 
     'den',       'vant']

7.1 Gensim+ versus TeAM¶

In this section, we compare both approaches by increasing the parameter that defines the number m of maximum candidate approximations allowed for each term in a query. This important feature could have a big impact in the algorithm for its combinatorial nature, i.e. in principle the 0<=n<=m candidates for each term need to be combined. Results from the table present the execution times and the F₁ evaluation measures given an m value in the interval [1, 10]. The table shows that the proposed approach (TeAM) outperforms our implementation of Gensim++, which at first appears to be fast in execution time. The success of TeAM approach is specially because the combinatorial nature of the aforementioned feature is addressed in such a way that it does not affect the performance, in spite of all the extra features offered compared to the simple approach based on Gensim++. In this experiment, TeAM is at it best (F₁=0.95) with both 6 and 7 candidates-options for a given term in the input query.

Candidates Per Term	Gensim++	TeAM - Best 50 TF_IDF	TeAM - BEST 100 TF_IDF
1	0:00:36.943996 (0.870)	0:00:39.220609 (0.890)	0:00:42.995636 (0.890)
2	0:01:39.538036 (0.768)	0:00:39.690304 (0.900)	0:00:42.633038 (0.900)
3	—	0:00:40.725126 (0.925)	0:00:42.508065 (0.920)
4	—	0:00:39.612479 (0.945)	0:00:43.408359 (0.940)
5	—	0:00:39.980332 (0.945)	0:00:42.566867 (0.940)
6	—	0:00:40.150912 (0.950)	0:00:42.716441 (0.940)
7	—	0:00:40.094593 (0.950)	0:00:43.328592 (0.935)
8	—	0:00:39.921235 (0.950)	0:00:43.198330 (0.940)
9	—	0:00:39.811426 (0.935)
10	—	0:00:40.319461 (0.930)

Result Discussion ¶

The observation is that the increase of the number of candidates per query-term or the number of best tf_idfs candidates-segments impacts negatively the quality of the results produced by TeAM once passed its optimum. This contradicts the our expectation. However, as the table above shows, pass height candidates per query-term, the F₁ score starts decreasing.

The right answer is not always the best match which explains the fluctuation of the F₁ scores.

This feature vector favours rare words which means flushing out documents with non common misspellings for example Neederlandsche histoorien get a much higher tf_idf compared to Nederlandse historien because correctly writing Nederlandse and historien is more frequent that the misspells. To compensate for that, our final choice for the best candidate(s) depends on “the best one(s)” (i) being in the pool of best td_idfs and (ii) having the best hit.

TEST-1: Default Settings¶

The first experiment uses the default setting of each parameter, as shown in the code bellow, followed by the result of the execution. The obtained F₁ measure is 0.945 and the execution time is about 40s.

Default Settings

results = TeAM.run(
            queries=source_data, target=target_data, stop_words=TeAM.dutch_stop_word, language=language,
            preprocess=True, normalise_text=True, normalize_number=True, remove_number=True,
            find_abbreviated=True, boost_candidates=True, logarithm=False, ties=False,
            max_candidates=7, n_best_tf_idfs=50, n_bests=1, threshold=0)

Default Settings Results

1. Segmenting the source and target Text                   200 | 4179
2.1 Indexing the sources segments                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           0:00:03.035731

    257      .................petrus.................   in line    200 / 200

    1. So far with a dictionary of 257 indexes          0:00:01.639557
    2. So far with Frames                               0:00:01.641369
    3. So far with abbreviation                         0:00:01.644959

2.2 Indexing the targets segments                       0:00:04.680766

    3617     ...............beukelaere...............   in line   4178 / 4179

    1. So far with a dictionary of 3617 indexes         0:00:29.802926
        ->     3617 / 3617     beukelaere               0:00:00.034006
    2. So far with Frames                               0:00:29.845476
    3. So far with abbreviation                         0:00:29.900189

3. Finding 5 candidates per sources term                0:00:34.581303

    257      ................petrus.................    0:00:05.248847 [Approx: 5 s] [Boost: 5 s] [Abbrev: 5 s]

4. Given a src-seg find a similar trg-seg               0:00:39.849065

    Document [..199] een cleijn bortje met een ebben lijstje van den apostel Petrus

5. It took a total of                                  0:00:40.736017


                   -----------------------------------
                   |       4179 GROUND TRUTHS        |
                   -----------------------------------
                   |  GT Positive   |  GT Negative   |
                   |      200       |      3979      |
---------------------------------------------------------------------------------------------------------
        | Positive | True Positive  | False Positive |      Precision      | False discovery rate (FDR) |
        |   200    |      189       |       11       |        0.945        |           0.055            |
PREDICT -------------------------------------------------------------------------------------------------
  4179  | Negative | False Negative | True Negative  | False omission rate | Negative predictive value  |
        |   3979   |       11       |      3968      |        0.003        |           0.997            |
---------------------------------------------------------------------------------------------------------
                   |     Recall     |    Fall-out    | P. Likelihood Ratio |   F1 score     Accuracy    |
                   |     0.945      |     0.003      |       315.0         |     0.945        0.995     |
                   --------------------------------------------------------------------------------------

TEST-2: stop_words=None¶

In this experiment, the result shows that not applying stop words appear detrimental to the performance of TeAM compared to the default settings, producing F₁ measure 0.925 against the original 0.945 with a negligible execution time difference.

Stop_words=None

                  -----------------------------------
                   |       4179 GROUND TRUTHS        |
                   -----------------------------------
                   |  GT Positive   |  GT Negative   |
                   |      200       |      3979      |
---------------------------------------------------------------------------------------------------------
        | Positive | True Positive  | False Positive |      Precision      | False discovery rate (FDR) |
        |   200    |      185       |       15       |        0.925        |           0.075            |
PREDICT -------------------------------------------------------------------------------------------------
  4179  | Negative | False Negative | True Negative  | False omission rate | Negative predictive value  |
        |   3979   |       15       |      3964      |        0.004        |           0.996            |
---------------------------------------------------------------------------------------------------------
                   |     Recall     |    Fall-out    | P. Likelihood Ratio |   F1 score     Accuracy    |
                   |     0.925      |     0.004      |       231.25        |     0.925        0.993     |
                   --------------------------------------------------------------------------------------

TEST-3: boost=False¶

Here again, compared to the default settings the result shows that not applying candidate boost appears detrimental to the performance of TeAM, producing F₁ measure 0.925 against the original 0.945 with a negligible execution time difference.

Boost=False

                   -----------------------------------
                   |       4179 GROUND TRUTHS        |
                   -----------------------------------
                   |  GT Positive   |  GT Negative   |
                   |      200       |      3979      |
---------------------------------------------------------------------------------------------------------
        | Positive | True Positive  | False Positive |      Precision      | False discovery rate (FDR) |
        |   200    |      185       |       15       |        0.925        |           0.075            |
PREDICT -------------------------------------------------------------------------------------------------
  4179  | Negative | False Negative | True Negative  | False omission rate | Negative predictive value  |
        |   3979   |       15       |      3964      |        0.004        |           0.996            |
---------------------------------------------------------------------------------------------------------
                   |     Recall     |    Fall-out    | P. Likelihood Ratio |   F1 score     Accuracy    |
                   |     0.925      |     0.004      |       231.25        |     0.925        0.993     |
                   --------------------------------------------------------------------------------------

TEST-4: preprocess=False¶

Removing the text preprocessing consisting in fixes for inconsistencies and glitches appear to not affect TeAM’s result for this experiment. This may suggest that the used data may be highly free from inconsistencies and glitches. Here, the result show similar F₁ score of 0.945 compared to the original score. Since having clean data may not always be true, and the difference in execution time seem negligible, this feature is enabled by default.

preprocess=False

                   -----------------------------------
                   |       4179 GROUND TRUTHS        |
                   -----------------------------------
                   |  GT Positive   |  GT Negative   |
                   |      200       |      3979      |
---------------------------------------------------------------------------------------------------------
        | Positive | True Positive  | False Positive |      Precision      | False discovery rate (FDR) |
        |   200    |      189       |       11       |        0.945        |           0.055            |
PREDICT -------------------------------------------------------------------------------------------------
  4179  | Negative | False Negative | True Negative  | False omission rate | Negative predictive value  |
        |   3979   |       11       |      3968      |        0.003        |           0.997            |
---------------------------------------------------------------------------------------------------------
                   |     Recall     |    Fall-out    | P. Likelihood Ratio |   F1 score     Accuracy    |
                   |     0.945      |     0.003      |       315.0         |     0.945        0.995     |
                   --------------------------------------------------------------------------------------

TEST-5: normalise=False¶

Compared to the default settings, the result shows that not applying text normalisation appears detrimental to the performance of TeAM in this experiment, producing an F₁ score of 0.93 against the original 0.945. The execution time difference is also negligible. To keep i mind, the current code provides only dutch normalisation.

normalise=False

                   -----------------------------------
                   |       4179 GROUND TRUTHS        |
                   -----------------------------------
                   |  GT Positive   |  GT Negative   |
                   |      200       |      3979      |
---------------------------------------------------------------------------------------------------------
        | Positive | True Positive  | False Positive |      Precision      | False discovery rate (FDR) |
        |   200    |      186       |       14       |        0.93         |            0.07            |
PREDICT -------------------------------------------------------------------------------------------------
  4179  | Negative | False Negative | True Negative  | False omission rate | Negative predictive value  |
        |   3979   |       14       |      3965      |        0.004        |           0.996            |
---------------------------------------------------------------------------------------------------------
                   |     Recall     |    Fall-out    | P. Likelihood Ratio |   F1 score     Accuracy    |
                   |      0.93      |     0.004      |       232.5         |     0.93         0.993     |
                   --------------------------------------------------------------------------------------

TEST-6: find_abbreviated=False¶

Finding abbreviations helps approximating an abbreviated term to its fullest form eiter on the source or target data. This is expected to help matching documents known to have abbreviations. ==However, the results of this experiment show that it does not have an impact in our data. This means that the matching of the non-abbreviated terms are enough to find the right segment match. Since this may not always be true, and the difference in execution time seem negligible, this feature is enabled by default.

find_abbreviated =False

                   -----------------------------------
                   |       4179 GROUND TRUTHS        |
                   -----------------------------------
                   |  GT Positive   |  GT Negative   |
                   |      200       |      3979      |
---------------------------------------------------------------------------------------------------------
        | Positive | True Positive  | False Positive |      Precision      | False discovery rate (FDR) |
        |   200    |      189       |       11       |        0.945        |           0.055            |
PREDICT -------------------------------------------------------------------------------------------------
  4179  | Negative | False Negative | True Negative  | False omission rate | Negative predictive value  |
        |   3979   |       11       |      3968      |        0.003        |           0.997            |
---------------------------------------------------------------------------------------------------------
                   |     Recall     |    Fall-out    | P. Likelihood Ratio |   F1 score     Accuracy    |
                   |     0.945      |     0.003      |       315.0         |     0.945        0.995     |

TEST-7: normalize_number, remove_number=False¶

In general, we expect the removal of number to be a good practice specially if it provides no adding value to the search. In doing so, it enables a decrease in space usage and computation power. As the result shows, in our experiment, removing or normalising numbers does not impact the result. Though, the good thing about keeping numbers is that it may help untying matches if relevant.

normalize_number, remove_number=False

                   -----------------------------------
                   |       4179 GROUND TRUTHS        |
                   -----------------------------------
                   |  GT Positive   |  GT Negative   |
                   |      200       |      3979      |
---------------------------------------------------------------------------------------------------------
        | Positive | True Positive  | False Positive |      Precision      | False discovery rate (FDR) |
        |   200    |      189       |       11       |        0.945        |           0.055            |
PREDICT -------------------------------------------------------------------------------------------------
  4179  | Negative | False Negative | True Negative  | False omission rate | Negative predictive value  |
        |   3979   |       11       |      3968      |        0.003        |           0.997            |
---------------------------------------------------------------------------------------------------------
                   |     Recall     |    Fall-out    | P. Likelihood Ratio |   F1 score     Accuracy    |
                   |     0.945      |     0.003      |       315.0         |     0.945        0.995     |
                   --------------------------------------------------------------------------------------