Reference

spaczz.matcher

Module for matchers.

class spaczz.matcher.FuzzyMatcher(vocab, **defaults)

spaCy-like matcher for finding fuzzy phrase matches in Doc objects.

Fuzzy matches patterns against the Doc it is called on. Accepts labeled patterns in the form of Doc objects with optional, per-pattern match settings.

name

Class attribute - the name of the matcher.

Type

str

defaults

Keyword arguments to be used as default match settings. Per-pattern match settings take precedence over defaults.

Type

dict[str, bool|int|str|Literal[‘default’, ‘min’, ‘max’]]

Match Settings
  • ignore_case (bool) – Whether to lower-case text before matching. Default is True.

  • min_r (int) – Minimum match ratio required.

  • thresh (int) – If this ratio is exceeded in initial scan, and flex > 0, no optimization will be attempted. If flex == 0, thresh has no effect. Default is 100.

  • fuzzy_func (str) – Key name of fuzzy matching function to use. All rapidfuzz matching functions with default settings are available. Additional fuzzy matching functions can be registered by users. Included functions are:

    • “simple” = ratio

    • “partial” = partial_ratio

    • “token” = token_ratio

    • “token_set” = token_set_ratio

    • “token_sort” = token_sort_ratio

    • “partial_token” = partial_token_ratio

    • “partial_token_set” = partial_token_set_ratio

    • “partial_token_sort” = partial_token_sort_ratio

    • “weighted” = WRatio

    • “quick” = QRatio

    • “partial_alignment” = partial_ratio_alignment (Requires rapidfuzz>=2.0.3)

    Default is “simple”.

  • flex (int|Literal[‘default’, ‘min’, ‘max’]) – Number of tokens to move match boundaries left and right during optimization. Can be an int with a max of len(pattern) and a min of 0, (will warn and change if higher or lower). “max”, “min”, or “default” are also valid. Default is “default”: len(pattern) // 2.

  • min_r1 (int|None) – Optional granular control over the minimum match ratio required for selection during the initial scan. If flex == 0, min_r1 will be overwritten by min_r2. If flex > 0, min_r1 must be lower than min_r2 and “low” in general because match boundaries are not flexed initially. Default is None, which will result in min_r1 being set to round(min_r / 1.5).

  • min_r2 (int|None) – Optional granular control over the minimum match ratio required for selection during match optimization. Needs to be higher than min_r1 and “high” in general to ensure only quality matches are returned. Default is None, which will result in min_r2 being set to min_r.

__call__(doc)

Finds matches in doc given the matchers patterns.

Parameters

doc (Doc) – The Doc object to match over.

Return type

List[Tuple[str, int, int, int, str]]

Returns

A list of MatchResult tuples, (label, start index, end index, match ratio, pattern).

Example

>>> import spacy
>>> from spaczz.matcher import FuzzyMatcher
>>> nlp = spacy.blank("en")
>>> matcher = FuzzyMatcher(nlp.vocab)
>>> doc = nlp("Rdley Scott was the director of Alien.")
>>> matcher.add("NAME", [nlp.make_doc("Ridley Scott")])
>>> matcher(doc)
[('NAME', 0, 2, 96, 'Ridley Scott')]
__contains__(label)

Whether the matcher contains patterns for a label.

Return type

bool

__len__()

The number of labels added to the matcher.

Return type

int

__reduce__()

Interface for pickling the matcher.

Return type

Tuple[Any, Any]

add(label, patterns, kwargs=None, on_match=None)

Add a rule to the matcher, consisting of a label and one or more patterns.

Patterns must be a list of Doc objects and if kwargs is not None, kwargs must be a list of dicts.

Parameters
  • label (str) – Name of the rule added to the matcher.

  • patterns (List[Doc]) – Doc objects that will be matched against the Doc object the matcher is called on.

  • kwargs (Optional[List[Dict[str, Any]]]) – Optional settings to modify the matching behavior. If supplying kwargs, one per pattern should be included. Empty dicts will use the matcher instances default settings. Default is None.

  • on_match (Optional[Callable[[TypeVar(PMT, bound= PhraseMatcher), Doc, int, List[Tuple[str, int, int, int, str]]], None]]) – Optional callback function to modify the Doc object the matcher is called on after matching. Default is None.

Warning

KwargsWarning:
  • If there are more patterns than kwargs default matching settings will be used for extra patterns.

  • If there are more kwargs dicts than patterns, the extra kwargs will be ignored.

Example

>>> import spacy
>>> from spaczz.matcher import FuzzyMatcher
>>> nlp = spacy.blank("en")
>>> matcher = FuzzyMatcher(nlp.vocab)
>>> matcher.add("SOUND", [nlp.make_doc("mooo")])
>>> "SOUND" in matcher
True
Return type

None

property labels: Tuple[str, ...]

All labels present in the matcher.

Return type

Tuple[str, ...]

Returns

The unique labels as a tuple of strings.

Example

>>> import spacy
>>> from spaczz.matcher import FuzzyMatcher
>>> nlp = spacy.blank("en")
>>> matcher = FuzzyMatcher(nlp.vocab)
>>> matcher.add("AUTHOR", [nlp.make_doc("Kerouac")])
>>> matcher.labels
('AUTHOR',)
property patterns: List[Dict[str, Any]]

Get all patterns and match settings that were added to the matcher.

Return type

List[Dict[str, Any]]

Returns

The patterns and their respective match settings as a list of dicts.

Example

>>> import spacy
>>> from spaczz.matcher import FuzzyMatcher
>>> nlp = spacy.blank("en")
>>> matcher = FuzzyMatcher(nlp.vocab)
>>> matcher.add("AUTHOR", [nlp.make_doc("Kerouac")],
    [{"ignore_case": False}])
>>> matcher.patterns == [
    {
        "label": "AUTHOR",
        "pattern": "Kerouac",
        "type": "fuzzy",
        "kwargs": {"ignore_case": False}
        },
        ]
True
remove(label)

Remove a label and its respective patterns from the matcher.

Parameters

label (str) – Name of the rule added to the matcher.

Example

>>> import spacy
>>> from spaczz.matcher import FuzzyMatcher
>>> nlp = spacy.blank("en")
>>> matcher = FuzzyMatcher(nlp.vocab)
>>> matcher.add("SOUND", [nlp.make_doc("mooo")])
>>> matcher.remove("SOUND")
>>> "SOUND" in matcher
False
Return type

None

property type: Literal['fuzzy', 'regex', 'token', 'similarity', 'phrase']

Getter for the matchers SpaczzType.

Return type

Literal[‘fuzzy’, ‘regex’, ‘token’, ‘similarity’, ‘phrase’]

property vocab: spacy.vocab.Vocab

Getter for the matchers Vocab.

Return type

Vocab

class spaczz.matcher.RegexMatcher(vocab, **defaults)

spaCy-like matcher for finding regex phrase matches in Doc objects.

Regex matches patterns against the Doc it is called on. Accepts labeled patterns in the form of strings with optional, per-pattern match settings.

To utilize regex flags, use inline flags.

name

Class attribute - the name of the matcher.

Type

str

defaults

Keyword arguments to be used as default match settings. Per-pattern match settings take precedence over defaults.

Type

dict[str, bool|int|str]

Match Settings
  • ignore_case (bool) – Whether to lower-case text before matching. Default is True.

  • min_r (int) – Minimum match ratio required.

  • fuzzy_weights (str) – Name of weighting method for regex insertion, deletion, and substituion counts. Additional weighting methods can be registered by users. Included weighting methods are:

    • “indel” = (1, 1, 2)

    • “lev” = (1, 1, 1)

    Default is “indel”.

  • partial – (bool): Whether partial matches should be extended to Token or Span boundaries in doc or not. For example, the regex only matches part of a Token or Span in doc. Default is True.

  • predef (string) – Whether the regex string should be interpreted as a key to a predefined regex pattern or not. Additional predefined regex patterns can be registered by users. The included predefined regex patterns are:

    • “dates”

    • “times”

    • “phones”

    • “phones_with_exts”

    • “links”

    • “emails”

    • “ips”

    • “ipv6s”

    • “prices”

    • “hex_colors”

    • “credit_cards”

    • “btc_addresses”

    • “street_addresses”

    • “zip_codes”

    • “po_boxes”

    • “ssn_numbers”

    Default is False.

__call__(doc)

Finds matches in doc given the matchers patterns.

Parameters

doc (Doc) – The Doc object to match over.

Return type

List[Tuple[str, int, int, int, str]]

Returns

A list of MatchResult tuples, (label, start index, end index, match ratio, pattern).

Example

>>> import spacy
>>> from spaczz.matcher import RegexMatcher
>>> nlp = spacy.blank("en")
>>> matcher = RegexMatcher(nlp.vocab)
>>> doc = nlp.make_doc("I live in the united states, or the US")
>>> matcher.add("GPE", ["[Uu](nited|\.?) ?[Ss](tates|\.?)"])
>>> matcher(doc)[0]
('GPE', 4, 6, 100, '[Uu](nited|\\.?) ?[Ss](tates|\\.?)')
__contains__(label)

Whether the matcher contains patterns for a label.

Return type

bool

__len__()

The number of labels added to the matcher.

Return type

int

__reduce__()

Interface for pickling the matcher.

Return type

Tuple[Any, Any]

add(label, patterns, kwargs=None, on_match=None)

Add a rule to the matcher, consisting of a label and one or more patterns.

Patterns must be a list of Doc objects and if kwargs is not None, kwargs must be a list of dicts.

Parameters
  • label (str) – Name of the rule added to the matcher.

  • patterns (List[str]) – Doc objects that will be matched against the Doc object the matcher is called on.

  • kwargs (Optional[List[Dict[str, Any]]]) – Optional settings to modify the matching behavior. If supplying kwargs, one per pattern should be included. Empty dicts will use the matcher instances default settings. Default is None.

  • on_match (Optional[Callable[[RegexMatcher, Doc, int, List[Tuple[str, int, int, int, str]]], None]]) – Optional callback function to modify the Doc object the matcher is called on after matching. Default is None.

Raises
  • TypeError – If patterns is not a list of strings.

  • TypeError – If kwargs is not a list of dictionaries.

Warning

KwargsWarning:
  • If there are more patterns than kwargs default matching settings will be used for extra patterns.

  • If there are more kwargs dicts than patterns, the extra kwargs will be ignored.

Example

>>> import spacy
>>> from spaczz.matcher import RegexMatcher
>>> nlp = spacy.blank("en")
>>> matcher = RegexMatcher(nlp.vocab)
>>> matcher.add("GPE", ["[Uu](nited|\.?) ?[Ss](tates|\.?)"])
>>> "GPE" in matcher
True
Return type

None

property labels: Tuple[str, ...]

All labels present in the matcher.

Return type

Tuple[str, ...]

Returns

The unique labels as a tuple of strings.

Example

>>> import spacy
>>> from spaczz.matcher import RegexMatcher
>>> nlp = spacy.blank("en")
>>> matcher = RegexMatcher(nlp.vocab)
>>> matcher.add("ZIP", ["zip_codes"], [{"predef": True}])
>>> matcher.labels
('ZIP',)
property patterns: List[Dict[str, Any]]

Get all patterns and match settings that were added to the matcher.

Return type

List[Dict[str, Any]]

Returns

The patterns and their respective match settings as a list of dicts.

Example

>>> import spacy
>>> from spaczz.matcher import RegexMatcher
>>> nlp = spacy.blank("en")
>>> matcher = RegexMatcher(nlp.vocab)
>>> matcher.add("ZIP", ["zip_codes"], [{"predef": True}])
>>> matcher.patterns == [
    {
        "label": "ZIP",
        "pattern": "zip_codes",
        "type": "regex",
        "kwargs": {"predef": True},
        }
        ]
True
remove(label)

Remove a label and its respective patterns from the matcher.

Parameters

label (str) – Name of the rule added to the matcher.

Raises

ValueError – If label does not exist in the matcher.

Example

>>> import spacy
>>> from spaczz.matcher import RegexMatcher
>>> nlp = spacy.blank("en")
>>> matcher = RegexMatcher(nlp.vocab)
>>> matcher.add("GPE", ["[Uu](nited|\.?) ?[Ss](tates|\.?)"])
>>> matcher.remove("GPE")
>>> "GPE" in matcher
False
Return type

None

property type: Literal['fuzzy', 'regex', 'token', 'similarity', 'phrase']

Getter for the matchers SpaczzType.

Return type

Literal[‘fuzzy’, ‘regex’, ‘token’, ‘similarity’, ‘phrase’]

property vocab: spacy.vocab.Vocab

Getter for the matchers Vocab.

Return type

Vocab

class spaczz.matcher.SimilarityMatcher(vocab, **defaults)

spaCy-like matcher for finding phrase similarity matches in Doc objects.

Similarity matches patterns against the Doc it is called on. Accepts labeled patterns in the form of Doc objects with optional, per-pattern match settings.

Similarity matching uses spaCy word vectors if available, therefore spaCy vocabs without word vectors may not produce useful results. The spaCy medium and large English models provide word vectors that will work for this purpose.

Searching over/with Doc objects that do not have vectors will always return a similarity score of 0.

Warnings from spaCy about the above two scenarios are suppressed for convenience. However, spaczz will still warn about the former.

name

Class attribute - the name of the matcher.

Type

str

defaults

Keyword arguments to be used as default match settings. Per-pattern match settings take precedence over defaults.

Type

dict[str, bool|int|str|Literal[‘default’, ‘min’, ‘max’]]

Match Settings
  • ignore_case (bool) – Whether to lower-case text before fuzzy matching. Default is True.

  • min_r (int) – Minimum match ratio required.

  • thresh (int) – If this ratio is exceeded in initial scan, and flex > 0, no optimization will be attempted. If flex == 0, thresh has no effect. Default is 100.

  • flex (int|Literal[‘default’, ‘min’, ‘max’]) – Number of tokens to move match boundaries left and right during optimization. Can be an int with a max of len(pattern) and a min of 0, (will warn and change if higher or lower). “max”, “min”, or “default” are also valid. Default is “default”: len(pattern) // 2.

  • min_r1 (int|None) – Optional granular control over the minimum match ratio required for selection during the initial scan. If flex == 0, min_r1 will be overwritten by min_r2. If flex > 0, min_r1 must be lower than min_r2 and “low” in general because match boundaries are not flexed initially. Default is None, which will result in min_r1 being set to round(min_r / 1.5).

  • min_r2 (int|None) – Optional granular control over the minimum match ratio required for selection during match optimization. Needs to be higher than min_r1 and “high” in general to ensure only quality matches are returned. Default is None, which will result in min_r2 being set to min_r.

Warning

MissingVectorsWarning:

If vocab does not contain any word vectors.

__call__(doc)

Finds matches in doc given the matchers patterns.

Parameters

doc (Doc) – The Doc object to match over.

Return type

List[Tuple[str, int, int, int, str]]

Returns

A list of MatchResult tuples, (label, start index, end index, match ratio, pattern).

Example

>>> import spacy
>>> from spaczz.matcher import SimilarityMatcher
>>> nlp = spacy.load("en_core_web_md")
>>> matcher = SimilarityMatcher(nlp.vocab)
>>> doc = nlp("I like apples.")
>>> matcher.add("FRUIT", [nlp("fruit")], [{'min_r': 60}])
>>> matcher(doc)
[('FRUIT', 2, 3, 70, 'fruit')]
__contains__(label)

Whether the matcher contains patterns for a label.

Return type

bool

__len__()

The number of labels added to the matcher.

Return type

int

__reduce__()

Interface for pickling the matcher.

Return type

Tuple[Any, Any]

add(label, patterns, kwargs=None, on_match=None)

Add a rule to the matcher, consisting of a label and one or more patterns.

Patterns must be a list of Doc objects and if kwargs is not None, kwargs must be a list of dicts.

Parameters
  • label (str) – Name of the rule added to the matcher.

  • patterns (List[Doc]) – Doc objects that will be matched against the Doc object the matcher is called on.

  • kwargs (Optional[List[Dict[str, Any]]]) – Optional settings to modify the matching behavior. If supplying kwargs, one per pattern should be included. Empty dicts will use the matcher instances default settings. Default is None.

  • on_match (Optional[Callable[[TypeVar(PMT, bound= PhraseMatcher), Doc, int, List[Tuple[str, int, int, int, str]]], None]]) – Optional callback function to modify the Doc object the matcher is called on after matching. Default is None.

Warning

KwargsWarning:
  • If there are more patterns than kwargs default matching settings will be used for extra patterns.

  • If there are more kwargs dicts than patterns, the extra kwargs will be ignored.

Example

>>> import spacy
>>> from spaczz.matcher import SimilarityMatcher
>>> nlp = spacy.load("en_core_web_md")
>>> matcher = SimilarityMatcher(nlp.vocab)
>>> matcher.add("SOUND", [nlp("mooo")])
>>> "SOUND" in matcher
True
Return type

None

property labels: Tuple[str, ...]

All labels present in the matcher.

Return type

Tuple[str, ...]

Returns

The unique labels as a tuple of strings.

Example

>>> import spacy
>>> from spaczz.matcher import SimilarityMatcher
>>> nlp = spacy.load("en_core_web_md")
>>> matcher = SimilarityMatcher(nlp.vocab)
>>> matcher.add("AUTHOR", [nlp("Kerouac")])
>>> matcher.labels
('AUTHOR',)
property patterns: List[Dict[str, Any]]

Get all patterns and match settings that were added to the matcher.

Return type

List[Dict[str, Any]]

Returns

The patterns and their respective match settings as a list of dicts.

Example

>>> import spacy
>>> from spaczz.matcher import SimilarityMatcher
>>> nlp = spacy.load("en_core_web_md")
>>> matcher = SimilarityMatcher(nlp.vocab)
>>> matcher.add("AUTHOR", [nlp("Kerouac")],
    [{"ignore_case": False}])
>>> matcher.patterns == [
    {
        "label": "AUTHOR",
        "pattern": "Kerouac",
        "type": "similarity",
        "kwargs": {"ignore_case": False}
        },
        ]
True
remove(label)

Remove a label and its respective patterns from the matcher.

Parameters

label (str) – Name of the rule added to the matcher.

Example

>>> import spacy
>>> from spaczz.matcher import SimilarityMatcher
>>> nlp = spacy.load("en_core_web_md")
>>> matcher = SimilarityMatcher(nlp.vocab)
>>> matcher.add("SOUND", [nlp("mooo")])
>>> matcher.remove("SOUND")
>>> "SOUND" in matcher
False
Return type

None

property type: Literal['fuzzy', 'regex', 'token', 'similarity', 'phrase']

Getter for the matchers SpaczzType.

Return type

Literal[‘fuzzy’, ‘regex’, ‘token’, ‘similarity’, ‘phrase’]

property vocab: spacy.vocab.Vocab

Getter for the matchers Vocab.

Return type

Vocab

class spaczz.matcher.TokenMatcher(vocab, **defaults)

spaCy-like matcher for finding fuzzy token matches in Doc objects.

Fuzzy matches added patterns against the Doc object it is called on. Accepts labeled patterns in the form of lists of dictionaries where each list describes an individual pattern and each dictionary describes an individual token.

Uses extended spaCy token matching patterns. “FUZZY” and “FREGEX” are the two additional spaCy token pattern options.

For example:

[
    {"TEXT": {"FREGEX": "(database){e<=1}"}},
    {"LOWER": {"FUZZY": "access", "MIN_R": 85, "FUZZY_FUNC": "partial"}},
]

Make sure to use uppercase dictionary keys in patterns.

name

Class attribute - the name of the matcher.

Type

str

defaults

Keyword arguments to be used as default match settings. Per-pattern match settings take precedence over defaults.

Type

dict[str, bool|int|str]

Match Settings
  • ignore_case (bool) – Whether to lower-case text before matching. Can only be set at the pattern level. For “FUZZY” and “FREGEX” patterns. Default is True.

  • min_r (int) – Minimum match ratio required. For “FUZZY” and “FREGEX” patterns.

  • fuzzy_func (str) – Key name of fuzzy matching function to use. Can only be set at the pattern level. For “FUZZY” patterns only. All rapidfuzz matching functions with default settings are available, however any token-based functions provide no utility at the individual token level. Additional fuzzy matching functions can be registered by users. Included, and useful, functions are:

    • “simple” = ratio

    • “partial” = partial_ratio

    • “quick” = QRatio

    • “partial_alignment” = partial_ratio_alignment

      (Requires rapidfuzz>=2.0.3)

    Default is “simple”.

  • fuzzy_weights – Name of weighting method for regex insertion, deletion, and substituion counts. Can only be set at the pattern level. For “FREGEX” patterns only. Included weighting methods are:

    • “indel” = (1, 1, 2)

    • “lev” = (1, 1, 1)

    Default is “indel”.

  • predef – Whether regex should be interpreted as a key to a predefined regex pattern or not. Can only be set at the pattern level. For “FREGEX” patterns only. Default is False.

__call__(doc)

Finds matches in doc given the matchers patterns.

Parameters

doc (Doc) – The Doc object to match over.

Return type

List[Tuple[str, int, int, int, str]]

Returns

A list of MatchResult tuples, (label, start index, end index, match ratio, pattern).

Example

>>> import spacy
>>> from spaczz.matcher import TokenMatcher
>>> nlp = spacy.blank("en")
>>> matcher = TokenMatcher(nlp.vocab)
>>> doc = nlp("Rdley Scot was the director of Alien.")
>>> matcher.add("NAME", [
    [{"TEXT": {"FUZZY": "Ridley"}},
    {"TEXT": {"FUZZY": "Scott"}}]
    ])
>>> matcher(doc)[0][:4]
('NAME', 0, 2, 90)
__contains__(label)

Whether the matcher contains patterns for a label.

Return type

bool

__len__()

The number of labels added to the matcher.

Return type

int

__reduce__()

Interface for pickling the matcher.

Return type

Tuple[Any, Any]

add(label, patterns, on_match=None)

Add a rule to the matcher, consisting of a label and one or more patterns.

Patterns must be a list of lists of dicts where each list of dicts represent an individual pattern and each dictionary represents an individual token.

Uses extended spaCy token matching patterns. “FUZZY” and “FREGEX” are the two additional spaCy token pattern options.

For example:

[
    {"TEXT": {"FREGEX": "(database){e<=1}"}},
    {"LOWER": {"FUZZY": "access", "MIN_R": 85, "FUZZY_FUNC": "partial"}},
]

Make sure to use uppercase dictionary keys in patterns.

Parameters
  • label (str) – Name of the rule added to the matcher.

  • patterns (List[List[Dict[str, Any]]]) – List of lists of dicts that will be matched against the Doc object the matcher is called on.

  • on_match (Optional[Callable[[TokenMatcher, Doc, int, List[Tuple[str, int, int, int, str]]], None]]) – Optional callback function to modify the Doc object the matcher is called on after matching. Default is None.

Raises
  • TypeError – If patterns is not a list of Doc objects.

  • ValueError – Patterns cannot have zero tokens.

Example

>>> import spacy
>>> from spaczz.matcher import TokenMatcher
>>> nlp = spacy.blank("en")
>>> matcher = TokenMatcher(nlp.vocab)
>>> matcher.add("AUTHOR", [[{"TEXT": {"FUZZY": "Kerouac"}}]])
>>> "AUTHOR" in matcher
True
Return type

None

property labels: Tuple[str, ...]

All labels present in the matcher.

Return type

Tuple[str, ...]

Returns

The unique labels as a tuple of strings.

Example

>>> import spacy
>>> from spaczz.matcher import TokenMatcher
>>> nlp = spacy.blank("en")
>>> matcher = TokenMatcher(nlp.vocab)
>>> matcher.add("AUTHOR", [[{"TEXT": {"FUZZY": "Kerouac"}}]])
>>> matcher.labels
('AUTHOR',)
property patterns: List[Dict[str, Any]]

Get all patterns and match settings that were added to the matcher.

Return type

List[Dict[str, Any]]

Returns

The patterns and their respective match settings as a list of dicts.

Example

>>> import spacy
>>> from spaczz.matcher import TokenMatcher
>>> nlp = spacy.blank("en")
>>> matcher = TokenMatcher(nlp.vocab)
>>> matcher.add("AUTHOR", [[{"TEXT": {"FUZZY": "Kerouac"}}]])
>>> matcher.patterns == [
    {
        "label": "AUTHOR",
        "pattern": [{"TEXT": {"FUZZY": "Kerouac"}}],
        "type": "token",
        },
        ]
True
remove(label)

Remove a label and its respective patterns from the matcher.

Parameters

label (str) – Name of the rule added to the matcher.

Raises

ValueError – If label does not exist in the matcher.

Example

>>> import spacy
>>> from spaczz.matcher import TokenMatcher
>>> nlp = spacy.blank("en")
>>> matcher = TokenMatcher(nlp.vocab)
>>> matcher.add("AUTHOR", [[{"TEXT": {"FUZZY": "Kerouac"}}]])
>>> matcher.remove("AUTHOR")
>>> "AUTHOR" in matcher
False
Return type

None

property type: Literal['fuzzy', 'regex', 'token', 'similarity', 'phrase']

Getter for the matchers SpaczzType.

Return type

Literal[‘fuzzy’, ‘regex’, ‘token’, ‘similarity’, ‘phrase’]

property vocab: spacy.vocab.Vocab

Getter for the matchers Vocab.

Return type

Vocab

spaczz.pipeline

Module for pipeline components.

class spaczz.pipeline.SpaczzRuler(nlp, name='spaczz_ruler', *, overwrite_ents=False, ent_id_sep='||', fuzzy_defaults={}, regex_defaults={}, token_defaults={}, patterns=None, scorer=<function spaczz_ruler_scorer>)

The SpaczzRuler adds fuzzy matches to spaCy Doc.ents.

It can be combined with other spaCy NER components like the statistical EntityRecognizer, and/or the EntityRuler it is inspired by, to boost accuracy. After initialization, the component is typically added to the pipeline using nlp.add_pipe.

nlp

The shared Language object that passes its Vocab to the matchers (not currently used by spaczz matchers) and processes fuzzy patterns.

Type

Language

name

Instance name of the current pipeline component. Typically passed in automatically from the factory when the component is added. Used to disable the current entity ruler while creating phrase patterns with the nlp object.

Type

str

overwrite_ents

If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary.

Type

bool

ent_id_sep

Separator used internally for entity IDs.

Type

str

scorer

The scoring method for the ruler.

Type

Optional[Callable]

fuzzy_matcher

The FuzzyMatcher instance the spaczz ruler will use for fuzzy phrase matching.

Type

FuzzyMatcher

regex_matcher

The RegexMatcher instance the spaczz ruler will use for regex phrase matching.

Type

RegexMatcher

token_matcher

The TokenMatcher instance the spaczz ruler will use for fuzzy token matching.

Type

TokenMatcher

defaults

Default match settings for their respective matchers.

Type

Dict[str, Any]

__call__(doc)

Find matches in document and add them as entities.

Parameters

doc (Doc) – The Doc object in the pipeline.

Return type

Doc

Returns

The Doc with added entities, if available.

Example

>>> import spacy
>>> from spaczz.pipeline import SpaczzRuler
>>> nlp = spacy.blank("en")
>>> ruler = SpaczzRuler(nlp)
>>> doc = nlp.make_doc("My name is Anderson, Grunt")
>>> ruler.add_patterns([{"label": "NAME", "pattern": "Grant Andersen",
    "type": "fuzzy", "kwargs": {"fuzzy_func": "token_sort"}}])
>>> doc = ruler(doc)
>>> "Anderson, Grunt" in [ent.text for ent in doc.ents]
True
__contains__(label)

Whether a label is present in the patterns.

Return type

bool

__len__()

The number of all patterns added to the ruler.

Return type

int

add_patterns(patterns)

Add patterns to the ruler.

A pattern must be a spaczz pattern: {label (str), pattern (str or list), type (str), optional kwargs (dict[str, Any]), and optional id (str)}.

For example, a fuzzy phrase pattern:

{
    'label': 'ORG',
    'pattern': 'Apple',
    'kwargs': {'min_r2': 90},
    'type': 'fuzzy',
}

Or, a token pattern:

{
    'label': 'ORG',
    'pattern': [{'TEXT': {'FUZZY': 'Apple'}}],
    'type': 'token',
}

To utilize regex flags, use inline flags.

Parameters

patterns (List[Dict[str, Union[str, Dict[str, Any], List[Dict[str, Any]]]]]) – The spaczz patterns to add.

Raises
  • TypeError – If patterns is not a list of dicts.

  • ValueError – If one or more patterns do not conform the spaczz pattern structure.

Example

>>> import spacy
>>> from spaczz.pipeline import SpaczzRuler
>>> nlp = spacy.blank("en")
>>> ruler = SpaczzRuler(nlp)
>>> ruler.add_patterns([{"label": "AUTHOR", "pattern": "Kerouac",
    "type": "fuzzy"}])
>>> "AUTHOR" in ruler.labels
True
Return type

None

clear()

Reset all patterns.

Return type

None

property ent_ids: Tuple[Optional[str], ...]

All entity ids present in the match patterns id properties.

Return type

Tuple[Optional[str], ...]

Returns

The unique string entity ids as a tuple.

Example

>>> import spacy
>>> from spaczz.pipeline import SpaczzRuler
>>> nlp = spacy.blank("en")
>>> ruler = SpaczzRuler(nlp)
>>> ruler.add_patterns([{"label": "AUTHOR", "pattern": "Kerouac",
    "type": "fuzzy", "id": "BEAT"}])
>>> ruler.ent_ids
('BEAT',)
from_bytes(patterns_bytes, *, exclude=[])

Load the spaczz ruler from a bytestring.

Parameters
  • patterns_bytes (bytes) – The bytestring to load.

  • exclude (Iterable[str]) – For spaCy consistency.

Return type

SpaczzRuler

Returns

The loaded spaczz ruler.

Example

>>> import spacy
>>> from spaczz.pipeline import SpaczzRuler
>>> nlp = spacy.blank("en")
>>> ruler = SpaczzRuler(nlp)
>>> ruler.add_patterns([{"label": "AUTHOR", "pattern": "Kerouac",
    "type": "fuzzy"}])
>>> ruler_bytes = ruler.to_bytes()
>>> new_ruler = SpaczzRuler(nlp)
>>> new_ruler = new_ruler.from_bytes(ruler_bytes)
>>> "AUTHOR" in new_ruler
True
from_disk(path, *, exclude=[])

Load the spaczz ruler from a file.

Expects a file containing newline-delimited JSON (JSONL) with one entry per line.

Parameters
  • path (Union[str, Path]) – The JSONL file to load.

  • exclude (Iterable[str]) – For spaCy consistency.

Return type

SpaczzRuler

Returns

The loaded spaczz ruler.

Raises

ValueError – If path does not exist or cannot be accessed.

Example

>>> import os
>>> import tempfile
>>> import spacy
>>> from spaczz.pipeline import SpaczzRuler
>>> nlp = spacy.blank("en")
>>> ruler = SpaczzRuler(nlp)
>>> ruler.add_patterns([{"label": "AUTHOR", "pattern": "Kerouac",
    "type": "fuzzy"}])
>>> with tempfile.TemporaryDirectory() as tmpdir:
>>>     ruler.to_disk(f"{tmpdir}/ruler")
>>>     new_ruler = SpaczzRuler(nlp)
>>>     new_ruler = new_ruler.from_disk(f"{tmpdir}/ruler")
>>> "AUTHOR" in new_ruler
True
initialize(get_examples, *, nlp=None, patterns=None)

Initialize the pipe for training.

Parameters
  • get_examples (Callable[[], Iterable[Example]]) – Function that returns a representative sample of gold-standard Example objects.

  • nlp (Optional[Language]) – The current nlp object the component is part of.

  • patterns (Optional[Sequence[Dict[str, Union[str, Dict[str, Any], List[Dict[str, Any]]]]]]) – The list of patterns.

Return type

None

property labels: Tuple[str, ...]

All labels present in the ruler.

Return type

Tuple[str, ...]

Returns

The unique string labels as a tuple.

Example

>>> import spacy
>>> from spaczz.pipeline import SpaczzRuler
>>> nlp = spacy.blank("en")
>>> ruler = SpaczzRuler(nlp)
>>> ruler.add_patterns([{"label": "AUTHOR", "pattern": "Kerouac",
    "type": "fuzzy"}])
>>> ruler.labels
('AUTHOR',)
match(doc)

Used in call to find matches in doc.

Return type

List[Tuple[str, int, int, int, str, Literal[‘fuzzy’, ‘regex’, ‘token’, ‘similarity’, ‘phrase’]]]

property patterns: List[Dict[str, Union[str, Dict[str, Any], List[Dict[str, Any]]]]]

Get all patterns and kwargs that were added to the ruler.

Return type

List[Dict[str, Union[str, Dict[str, Any], List[Dict[str, Any]]]]]

Returns

The original patterns and kwargs, one dictionary for each combination.

Example

>>> import spacy
>>> from spaczz.pipeline import SpaczzRuler
>>> nlp = spacy.blank("en")
>>> ruler = SpaczzRuler(nlp)
>>> ruler.add_patterns([{"label": "STREET", "pattern": "street_addresses",
    "type": "regex", "kwargs": {"predef": True}}])
>>> ruler.patterns == [
    {
        "label": "STREET",
        "pattern": "street_addresses",
        "type": "regex",
        "kwargs": {"predef": True},
        },
        ]
True
remove(ent_id)

Remove patterns by their ent_id.

Return type

None

score(examples, **kwargs)

Pipeline scoring for spaCy >= 3.0, < 3.2 compatibility.

Return type

Any

set_annotations(doc, matches)

Modify the document in place.

Return type

None

to_bytes(*, exclude=[])

Serialize the spaczz ruler patterns to a bytestring.

Parameters

exclude (Iterable[str]) – For spaCy consistency.

Return type

bytes

Returns

The serialized patterns.

Example

>>> import spacy
>>> from spaczz.pipeline import SpaczzRuler
>>> nlp = spacy.blank("en")
>>> ruler = SpaczzRuler(nlp)
>>> ruler.add_patterns([{"label": "AUTHOR", "pattern": "Kerouac",
    "type": "fuzzy"}])
>>> ruler_bytes = ruler.to_bytes()
>>> isinstance(ruler_bytes, bytes)
True
to_disk(path, *, exclude=[])

Save the spaczz ruler patterns to a directory.

The patterns will be saved as newline-delimited JSON (JSONL).

Parameters
  • path (Union[str, Path]) – The JSONL file to save.

  • exclude (Iterable[str]) – For spaCy consistency.

Example

>>> import os
>>> import tempfile
>>> import spacy
>>> from spaczz.pipeline import SpaczzRuler
>>> nlp = spacy.blank("en")
>>> ruler = SpaczzRuler(nlp)
>>> ruler.add_patterns([{"label": "AUTHOR", "pattern": "Kerouac",
    "type": "fuzzy"}])
>>> with tempfile.TemporaryDirectory() as tmpdir:
>>>     ruler.to_disk(f"{tmpdir}/ruler")
>>>     isdir = os.path.isdir(f"{tmpdir}/ruler")
>>> isdir
True
Return type

None

spaczz.registry

Function and object registries.

spaczz.registry.get_fuzzy_func(name)

Get the registered function for a given name.

name (str): The name. RETURNS (Any): The registered function.

Return type

Any

spaczz.registry.get_re_pattern(name)

Get the registered function for a given name.

name (str): The name. RETURNS (Any): The registered function.

Return type

Any

spaczz.registry.get_re_weights(name)

Get the registered function for a given name.

name (str): The name. RETURNS (Any): The registered function.

Return type

Any

spaczz.customattrs

Custom spaCy attributes for spaczz.

class spaczz.customattrs.SpaczzAttrs

Adds spaczz custom attributes to spaCy.

static get_doc_types(doc)

Getter for spaczz_types Doc attribute.

Return type

Set[Literal[‘fuzzy’, ‘regex’, ‘token’, ‘similarity’, ‘phrase’]]

classmethod get_pattern(span)

Getter for spaczz_pattern Span attribute.

Return type

Optional[str]

classmethod get_ratio(span)

Getter for spaczz_ratio Span attribute.

Return type

Optional[int]

static get_spaczz_doc(doc)

Getter for spaczz_doc Doc attribute.

Return type

bool

static get_spaczz_ent(span)

Getter for spaczz_ent Span attribute.

Return type

bool

classmethod get_span_type(span)

Getter for spaczz_type Span attribute.

Return type

Optional[Literal[‘fuzzy’, ‘regex’, ‘token’, ‘similarity’, ‘phrase’]]

static get_span_types(span)

Getter for spaczz_types Span attribute.

Return type

Set[Literal[‘fuzzy’, ‘regex’, ‘token’, ‘similarity’, ‘phrase’]]

classmethod initialize()

Initializes and registers custom attributes.

Return type

None

spaczz.customtypes

Custom spaczz types.

spaczz.exceptions

Module for custom exceptions and warnings.

exception spaczz.exceptions.AttrOverwriteWarning

It warns if custom attributes are being overwritten.

exception spaczz.exceptions.FlexWarning

It warns if flex value is changed if too large.

exception spaczz.exceptions.KwargsWarning

It warns if there are more kwargs than patterns or vice versa.

exception spaczz.exceptions.MissingVectorsWarning

It warns if the spaCy Vocab does not have word vectors.

exception spaczz.exceptions.PatternTypeWarning

It warns if the spaczz pattern does not have a valid pattern type.

exception spaczz.exceptions.RatioWarning

It warns if match ratio values are incompatible with each other.

exception spaczz.exceptions.RegexParseError

General error for errors that may happen during regex compilation.