BlueSpice MediaWiki master
 All Classes Namespaces Files Functions Variables Groups Pages
Tesa (text sanitizer)

![Build Status]( ![Code Coverage]( ![Scrutinizer Code Quality]( ![Latest Stable Version]( ![Packagist download count]( ![Dependency Status](

The library contains a small collection of helper classes to support sanitization of text or string elements of arbitrary length with the aim to improve search match confidence during a query execution that is required by Semantic MediaWiki project and is deployed independently.


  • PHP 5.3 / HHVM 3.5 or later
  • Recommended to enable the ICU extension


The recommended installation method for this library is by adding the following dependency to your composer.json.

```json { "require": { "onoi/tesa": "~0.1" } } ```


```php use Onoi; use Onoi; use Onoi;

$sanitizerFactory = new SanitizerFactory();

$sanitizer = $sanitizerFactory->newSanitizer( 'A string that contains ...' );

$sanitizer->reduceLengthTo( 200 ); $sanitizer->toLowercase();

$sanitizer->replace( array( "'", "http://", "https://", "mailto:", "tel:" ), array( '' ) );

$sanitizer->setOption( Sanitizer::MIN_LENGTH, 4 ); $sanitizer->setOption( Sanitizer::WHITELIST, array( 'that' ) );

$sanitizer->applyTransliteration( Transliterator::DIACRITICS | Transliterator::GREEK );

$text = $sanitizer->sanitizeWith( $sanitizerFactory->newGenericTokenizer(), $sanitizerFactory->newNullStopwordAnalyzer(), $sanitizerFactory->newNullSynonymizer() );


  • SanitizerFactory is expected to be the sole entry point for services and instances when used outside of this library
  • IcuWordBoundaryTokenizer is a preferred tokenizer in case the ICU extension is available
  • NGramTokenizer is provided to increase CJK match confidence in case the back-end does not provide an explicit ngram tokenizer
  • StopwordAnalyzer together with a LanguageDetector is provided as a means to reduce ambiguity of frequent "noise" words from a possible search index
  • Synonymizer currently only provides an interface

Contribution and support

If you want to contribute work to the project please subscribe to the developers mailing list and have a look at the / "contribution guidelinee". A list of people who have made contributions in the past can be found here.


The library provides unit tests that covers the core-functionality normally run by the continues integration platform. Tests can also be executed manually using the composer phpunit command from the root directory.

Release notes

  • 0.1.0 Initial release (2016-08-07)
    • Added SanitizerFactory with support for a
    • Tokenizer, LanguageDetector, Synonymizer, and StopwordAnalyzer interface


  • The Transliterator uses the same diacritics conversion table as (except the German diaeresis ä, ü, and ö)
  • The stopwords used by the StopwordAnalyzer have been collected from different sources, each json file identifies its origin
  • CdbStopwordAnalyzer relies on wikimedia/cdb to avoid using an external database or cache layer (with extra stopwords being available here)
  • JaTinySegmenterTokenizer is based on the work of Taku Kudo and his tiny_segmenter.js
  • TextCatLanguageDetector uses the `wikimedia/textcat` library to make predictions about a language


GNU General Public License 2.0 or later.