BlueSpice MediaWiki master
 All Classes Namespaces Files Functions Variables Groups Pages
Tesa (text sanitizer)

![Build Status](https://secure.travis-ci.org/onoi/tesa.svg?branch=master) ![Code Coverage](https://scrutinizer-ci.com/g/onoi/tesa/badges/coverage.png?b=master) ![Scrutinizer Code Quality](https://scrutinizer-ci.com/g/onoi/tesa/badges/quality-score.png?b=master) ![Latest Stable Version](https://poser.pugx.org/onoi/tesa/version.png) ![Packagist download count](https://poser.pugx.org/onoi/tesa/d/total.png) ![Dependency Status](https://www.versioneye.com/php/onoi:tesa/badge.png)

The library contains a small collection of helper classes to support sanitization of text or string elements of arbitrary length with the aim to improve search match confidence during a query execution that is required by Semantic MediaWiki project and is deployed independently.

Requirements

  • PHP 5.3 / HHVM 3.5 or later
  • Recommended to enable the ICU extension

Installation

The recommended installation method for this library is by adding the following dependency to your composer.json.

```json { "require": { "onoi/tesa": "~0.1" } } ```

Usage

```php use Onoi; use Onoi; use Onoi;

$sanitizerFactory = new SanitizerFactory();

$sanitizer = $sanitizerFactory->newSanitizer( 'A string that contains ...' );

$sanitizer->reduceLengthTo( 200 ); $sanitizer->toLowercase();

$sanitizer->replace( array( "'", "http://", "https://", "mailto:", "tel:" ), array( '' ) );

$sanitizer->setOption( Sanitizer::MIN_LENGTH, 4 ); $sanitizer->setOption( Sanitizer::WHITELIST, array( 'that' ) );

$sanitizer->applyTransliteration( Transliterator::DIACRITICS | Transliterator::GREEK );

$text = $sanitizer->sanitizeWith( $sanitizerFactory->newGenericTokenizer(), $sanitizerFactory->newNullStopwordAnalyzer(), $sanitizerFactory->newNullSynonymizer() );

```

  • SanitizerFactory is expected to be the sole entry point for services and instances when used outside of this library
  • IcuWordBoundaryTokenizer is a preferred tokenizer in case the ICU extension is available
  • NGramTokenizer is provided to increase CJK match confidence in case the back-end does not provide an explicit ngram tokenizer
  • StopwordAnalyzer together with a LanguageDetector is provided as a means to reduce ambiguity of frequent "noise" words from a possible search index
  • Synonymizer currently only provides an interface

Contribution and support

If you want to contribute work to the project please subscribe to the developers mailing list and have a look at the /CONTRIBUTING.md "contribution guidelinee". A list of people who have made contributions in the past can be found here.

Tests

The library provides unit tests that covers the core-functionality normally run by the continues integration platform. Tests can also be executed manually using the composer phpunit command from the root directory.

Release notes

  • 0.1.0 Initial release (2016-08-07)
    • Added SanitizerFactory with support for a
    • Tokenizer, LanguageDetector, Synonymizer, and StopwordAnalyzer interface

Acknowledgments

  • The Transliterator uses the same diacritics conversion table as http://jsperf.com/latinize (except the German diaeresis ä, ü, and ö)
  • The stopwords used by the StopwordAnalyzer have been collected from different sources, each json file identifies its origin
  • CdbStopwordAnalyzer relies on wikimedia/cdb to avoid using an external database or cache layer (with extra stopwords being available here)
  • JaTinySegmenterTokenizer is based on the work of Taku Kudo and his tiny_segmenter.js
  • TextCatLanguageDetector uses the `wikimedia/textcat` library to make predictions about a language

License

GNU General Public License 2.0 or later.