Reference¶
spaczz.matcher¶
Module for matchers.
- class spaczz.matcher.FuzzyMatcher(vocab, **defaults)¶
spaCy-like matcher for finding fuzzy phrase matches in Doc objects.
Fuzzy matches patterns against the Doc it is called on. Accepts labeled patterns in the form of Doc objects with optional, per-pattern match settings.
- name¶
Class attribute - the name of the matcher.
- Type
str
- defaults¶
Keyword arguments to be used as default match settings. Per-pattern match settings take precedence over defaults.
- Type
dict[str, bool|int|str|Literal[‘default’, ‘min’, ‘max’]]
- Match Settings
ignore_case (bool) – Whether to lower-case text before matching. Default is True.
min_r (int) – Minimum match ratio required.
thresh (int) – If this ratio is exceeded in initial scan, and flex > 0, no optimization will be attempted. If flex == 0, thresh has no effect. Default is 100.
fuzzy_func (str) – Key name of fuzzy matching function to use. All rapidfuzz matching functions with default settings are available. Additional fuzzy matching functions can be registered by users. Included functions are:
“simple” = ratio
“partial” = partial_ratio
“token” = token_ratio
“token_set” = token_set_ratio
“token_sort” = token_sort_ratio
“partial_token” = partial_token_ratio
“partial_token_set” = partial_token_set_ratio
“partial_token_sort” = partial_token_sort_ratio
“weighted” = WRatio
“quick” = QRatio
“partial_alignment” = partial_ratio_alignment (Requires rapidfuzz>=2.0.3)
Default is “simple”.
flex (int|Literal[‘default’, ‘min’, ‘max’]) – Number of tokens to move match boundaries left and right during optimization. Can be an int with a max of len(pattern) and a min of 0, (will warn and change if higher or lower). “max”, “min”, or “default” are also valid. Default is “default”: len(pattern) // 2.
min_r1 (int|None) – Optional granular control over the minimum match ratio required for selection during the initial scan. If flex == 0, min_r1 will be overwritten by min_r2. If flex > 0, min_r1 must be lower than min_r2 and “low” in general because match boundaries are not flexed initially. Default is None, which will result in min_r1 being set to round(min_r / 1.5).
min_r2 (int|None) – Optional granular control over the minimum match ratio required for selection during match optimization. Needs to be higher than min_r1 and “high” in general to ensure only quality matches are returned. Default is None, which will result in min_r2 being set to min_r.
- __call__(doc)¶
Finds matches in doc given the matchers patterns.
- Parameters
doc (
Doc
) – The Doc object to match over.- Return type
List
[Tuple
[str
,int
,int
,int
,str
]]- Returns
A list of MatchResult tuples, (label, start index, end index, match ratio, pattern).
Example
>>> import spacy >>> from spaczz.matcher import FuzzyMatcher >>> nlp = spacy.blank("en") >>> matcher = FuzzyMatcher(nlp.vocab) >>> doc = nlp("Rdley Scott was the director of Alien.") >>> matcher.add("NAME", [nlp.make_doc("Ridley Scott")]) >>> matcher(doc) [('NAME', 0, 2, 96, 'Ridley Scott')]
- __contains__(label)¶
Whether the matcher contains patterns for a label.
- Return type
bool
- __len__()¶
The number of labels added to the matcher.
- Return type
int
- __reduce__()¶
Interface for pickling the matcher.
- Return type
Tuple
[Any
,Any
]
- add(label, patterns, kwargs=None, on_match=None)¶
Add a rule to the matcher, consisting of a label and one or more patterns.
Patterns must be a list of Doc objects and if kwargs is not None, kwargs must be a list of dicts.
- Parameters
label (
str
) – Name of the rule added to the matcher.patterns (
List
[Doc
]) – Doc objects that will be matched against the Doc object the matcher is called on.kwargs (
Optional
[List
[Dict
[str
,Any
]]]) – Optional settings to modify the matching behavior. If supplying kwargs, one per pattern should be included. Empty dicts will use the matcher instances default settings. Default is None.on_match (
Optional
[Callable
[[TypeVar
(PMT
, bound=PhraseMatcher
),Doc
,int
,List
[Tuple
[str
,int
,int
,int
,str
]]],None
]]) – Optional callback function to modify the Doc object the matcher is called on after matching. Default is None.
Warning
- KwargsWarning:
If there are more patterns than kwargs default matching settings will be used for extra patterns.
If there are more kwargs dicts than patterns, the extra kwargs will be ignored.
Example
>>> import spacy >>> from spaczz.matcher import FuzzyMatcher >>> nlp = spacy.blank("en") >>> matcher = FuzzyMatcher(nlp.vocab) >>> matcher.add("SOUND", [nlp.make_doc("mooo")]) >>> "SOUND" in matcher True
- Return type
None
- property labels: Tuple[str, ...]¶
All labels present in the matcher.
- Return type
Tuple
[str
,...
]- Returns
The unique labels as a tuple of strings.
Example
>>> import spacy >>> from spaczz.matcher import FuzzyMatcher >>> nlp = spacy.blank("en") >>> matcher = FuzzyMatcher(nlp.vocab) >>> matcher.add("AUTHOR", [nlp.make_doc("Kerouac")]) >>> matcher.labels ('AUTHOR',)
- property patterns: List[Dict[str, Any]]¶
Get all patterns and match settings that were added to the matcher.
- Return type
List
[Dict
[str
,Any
]]- Returns
The patterns and their respective match settings as a list of dicts.
Example
>>> import spacy >>> from spaczz.matcher import FuzzyMatcher >>> nlp = spacy.blank("en") >>> matcher = FuzzyMatcher(nlp.vocab) >>> matcher.add("AUTHOR", [nlp.make_doc("Kerouac")], [{"ignore_case": False}]) >>> matcher.patterns == [ { "label": "AUTHOR", "pattern": "Kerouac", "type": "fuzzy", "kwargs": {"ignore_case": False} }, ] True
- remove(label)¶
Remove a label and its respective patterns from the matcher.
- Parameters
label (
str
) – Name of the rule added to the matcher.
Example
>>> import spacy >>> from spaczz.matcher import FuzzyMatcher >>> nlp = spacy.blank("en") >>> matcher = FuzzyMatcher(nlp.vocab) >>> matcher.add("SOUND", [nlp.make_doc("mooo")]) >>> matcher.remove("SOUND") >>> "SOUND" in matcher False
- Return type
None
- property type: Literal['fuzzy', 'regex', 'token', 'similarity', 'phrase']¶
Getter for the matchers SpaczzType.
- Return type
Literal
[‘fuzzy’, ‘regex’, ‘token’, ‘similarity’, ‘phrase’]
- property vocab: spacy.vocab.Vocab¶
Getter for the matchers Vocab.
- Return type
Vocab
- class spaczz.matcher.RegexMatcher(vocab, **defaults)¶
spaCy-like matcher for finding regex phrase matches in Doc objects.
Regex matches patterns against the Doc it is called on. Accepts labeled patterns in the form of strings with optional, per-pattern match settings.
To utilize regex flags, use inline flags.
- name¶
Class attribute - the name of the matcher.
- Type
str
- defaults¶
Keyword arguments to be used as default match settings. Per-pattern match settings take precedence over defaults.
- Type
dict[str, bool|int|str]
- Match Settings
ignore_case (bool) – Whether to lower-case text before matching. Default is True.
min_r (int) – Minimum match ratio required.
fuzzy_weights (str) – Name of weighting method for regex insertion, deletion, and substituion counts. Additional weighting methods can be registered by users. Included weighting methods are:
“indel” = (1, 1, 2)
“lev” = (1, 1, 1)
Default is “indel”.
partial – (bool): Whether partial matches should be extended to Token or Span boundaries in doc or not. For example, the regex only matches part of a Token or Span in doc. Default is True.
predef (string) – Whether the regex string should be interpreted as a key to a predefined regex pattern or not. Additional predefined regex patterns can be registered by users. The included predefined regex patterns are:
“dates”
“times”
“phones”
“phones_with_exts”
“links”
“emails”
“ips”
“ipv6s”
“prices”
“hex_colors”
“credit_cards”
“btc_addresses”
“street_addresses”
“zip_codes”
“po_boxes”
“ssn_numbers”
Default is False.
- __call__(doc)¶
Finds matches in doc given the matchers patterns.
- Parameters
doc (
Doc
) – The Doc object to match over.- Return type
List
[Tuple
[str
,int
,int
,int
,str
]]- Returns
A list of MatchResult tuples, (label, start index, end index, match ratio, pattern).
Example
>>> import spacy >>> from spaczz.matcher import RegexMatcher >>> nlp = spacy.blank("en") >>> matcher = RegexMatcher(nlp.vocab) >>> doc = nlp.make_doc("I live in the united states, or the US") >>> matcher.add("GPE", ["[Uu](nited|\.?) ?[Ss](tates|\.?)"]) >>> matcher(doc)[0] ('GPE', 4, 6, 100, '[Uu](nited|\\.?) ?[Ss](tates|\\.?)')
- __contains__(label)¶
Whether the matcher contains patterns for a label.
- Return type
bool
- __len__()¶
The number of labels added to the matcher.
- Return type
int
- __reduce__()¶
Interface for pickling the matcher.
- Return type
Tuple
[Any
,Any
]
- add(label, patterns, kwargs=None, on_match=None)¶
Add a rule to the matcher, consisting of a label and one or more patterns.
Patterns must be a list of Doc objects and if kwargs is not None, kwargs must be a list of dicts.
- Parameters
label (
str
) – Name of the rule added to the matcher.patterns (
List
[str
]) – Doc objects that will be matched against the Doc object the matcher is called on.kwargs (
Optional
[List
[Dict
[str
,Any
]]]) – Optional settings to modify the matching behavior. If supplying kwargs, one per pattern should be included. Empty dicts will use the matcher instances default settings. Default is None.on_match (
Optional
[Callable
[[RegexMatcher
,Doc
,int
,List
[Tuple
[str
,int
,int
,int
,str
]]],None
]]) – Optional callback function to modify the Doc object the matcher is called on after matching. Default is None.
- Raises
TypeError – If patterns is not a list of strings.
TypeError – If kwargs is not a list of dictionaries.
Warning
- KwargsWarning:
If there are more patterns than kwargs default matching settings will be used for extra patterns.
If there are more kwargs dicts than patterns, the extra kwargs will be ignored.
Example
>>> import spacy >>> from spaczz.matcher import RegexMatcher >>> nlp = spacy.blank("en") >>> matcher = RegexMatcher(nlp.vocab) >>> matcher.add("GPE", ["[Uu](nited|\.?) ?[Ss](tates|\.?)"]) >>> "GPE" in matcher True
- Return type
None
- property labels: Tuple[str, ...]¶
All labels present in the matcher.
- Return type
Tuple
[str
,...
]- Returns
The unique labels as a tuple of strings.
Example
>>> import spacy >>> from spaczz.matcher import RegexMatcher >>> nlp = spacy.blank("en") >>> matcher = RegexMatcher(nlp.vocab) >>> matcher.add("ZIP", ["zip_codes"], [{"predef": True}]) >>> matcher.labels ('ZIP',)
- property patterns: List[Dict[str, Any]]¶
Get all patterns and match settings that were added to the matcher.
- Return type
List
[Dict
[str
,Any
]]- Returns
The patterns and their respective match settings as a list of dicts.
Example
>>> import spacy >>> from spaczz.matcher import RegexMatcher >>> nlp = spacy.blank("en") >>> matcher = RegexMatcher(nlp.vocab) >>> matcher.add("ZIP", ["zip_codes"], [{"predef": True}]) >>> matcher.patterns == [ { "label": "ZIP", "pattern": "zip_codes", "type": "regex", "kwargs": {"predef": True}, } ] True
- remove(label)¶
Remove a label and its respective patterns from the matcher.
- Parameters
label (
str
) – Name of the rule added to the matcher.- Raises
ValueError – If label does not exist in the matcher.
Example
>>> import spacy >>> from spaczz.matcher import RegexMatcher >>> nlp = spacy.blank("en") >>> matcher = RegexMatcher(nlp.vocab) >>> matcher.add("GPE", ["[Uu](nited|\.?) ?[Ss](tates|\.?)"]) >>> matcher.remove("GPE") >>> "GPE" in matcher False
- Return type
None
- property type: Literal['fuzzy', 'regex', 'token', 'similarity', 'phrase']¶
Getter for the matchers SpaczzType.
- Return type
Literal
[‘fuzzy’, ‘regex’, ‘token’, ‘similarity’, ‘phrase’]
- property vocab: spacy.vocab.Vocab¶
Getter for the matchers Vocab.
- Return type
Vocab
- class spaczz.matcher.SimilarityMatcher(vocab, **defaults)¶
spaCy-like matcher for finding phrase similarity matches in Doc objects.
Similarity matches patterns against the Doc it is called on. Accepts labeled patterns in the form of Doc objects with optional, per-pattern match settings.
Similarity matching uses spaCy word vectors if available, therefore spaCy vocabs without word vectors may not produce useful results. The spaCy medium and large English models provide word vectors that will work for this purpose.
Searching over/with Doc objects that do not have vectors will always return a similarity score of 0.
Warnings from spaCy about the above two scenarios are suppressed for convenience. However, spaczz will still warn about the former.
- name¶
Class attribute - the name of the matcher.
- Type
str
- defaults¶
Keyword arguments to be used as default match settings. Per-pattern match settings take precedence over defaults.
- Type
dict[str, bool|int|str|Literal[‘default’, ‘min’, ‘max’]]
- Match Settings
ignore_case (bool) – Whether to lower-case text before fuzzy matching. Default is True.
min_r (int) – Minimum match ratio required.
thresh (int) – If this ratio is exceeded in initial scan, and flex > 0, no optimization will be attempted. If flex == 0, thresh has no effect. Default is 100.
flex (int|Literal[‘default’, ‘min’, ‘max’]) – Number of tokens to move match boundaries left and right during optimization. Can be an int with a max of len(pattern) and a min of 0, (will warn and change if higher or lower). “max”, “min”, or “default” are also valid. Default is “default”: len(pattern) // 2.
min_r1 (int|None) – Optional granular control over the minimum match ratio required for selection during the initial scan. If flex == 0, min_r1 will be overwritten by min_r2. If flex > 0, min_r1 must be lower than min_r2 and “low” in general because match boundaries are not flexed initially. Default is None, which will result in min_r1 being set to round(min_r / 1.5).
min_r2 (int|None) – Optional granular control over the minimum match ratio required for selection during match optimization. Needs to be higher than min_r1 and “high” in general to ensure only quality matches are returned. Default is None, which will result in min_r2 being set to min_r.
Warning
- MissingVectorsWarning:
If vocab does not contain any word vectors.
- __call__(doc)¶
Finds matches in doc given the matchers patterns.
- Parameters
doc (
Doc
) – The Doc object to match over.- Return type
List
[Tuple
[str
,int
,int
,int
,str
]]- Returns
A list of MatchResult tuples, (label, start index, end index, match ratio, pattern).
Example
>>> import spacy >>> from spaczz.matcher import SimilarityMatcher >>> nlp = spacy.load("en_core_web_md") >>> matcher = SimilarityMatcher(nlp.vocab) >>> doc = nlp("I like apples.") >>> matcher.add("FRUIT", [nlp("fruit")], [{'min_r': 60}]) >>> matcher(doc) [('FRUIT', 2, 3, 70, 'fruit')]
- __contains__(label)¶
Whether the matcher contains patterns for a label.
- Return type
bool
- __len__()¶
The number of labels added to the matcher.
- Return type
int
- __reduce__()¶
Interface for pickling the matcher.
- Return type
Tuple
[Any
,Any
]
- add(label, patterns, kwargs=None, on_match=None)¶
Add a rule to the matcher, consisting of a label and one or more patterns.
Patterns must be a list of Doc objects and if kwargs is not None, kwargs must be a list of dicts.
- Parameters
label (
str
) – Name of the rule added to the matcher.patterns (
List
[Doc
]) – Doc objects that will be matched against the Doc object the matcher is called on.kwargs (
Optional
[List
[Dict
[str
,Any
]]]) – Optional settings to modify the matching behavior. If supplying kwargs, one per pattern should be included. Empty dicts will use the matcher instances default settings. Default is None.on_match (
Optional
[Callable
[[TypeVar
(PMT
, bound=PhraseMatcher
),Doc
,int
,List
[Tuple
[str
,int
,int
,int
,str
]]],None
]]) – Optional callback function to modify the Doc object the matcher is called on after matching. Default is None.
Warning
- KwargsWarning:
If there are more patterns than kwargs default matching settings will be used for extra patterns.
If there are more kwargs dicts than patterns, the extra kwargs will be ignored.
Example
>>> import spacy >>> from spaczz.matcher import SimilarityMatcher >>> nlp = spacy.load("en_core_web_md") >>> matcher = SimilarityMatcher(nlp.vocab) >>> matcher.add("SOUND", [nlp("mooo")]) >>> "SOUND" in matcher True
- Return type
None
- property labels: Tuple[str, ...]¶
All labels present in the matcher.
- Return type
Tuple
[str
,...
]- Returns
The unique labels as a tuple of strings.
Example
>>> import spacy >>> from spaczz.matcher import SimilarityMatcher >>> nlp = spacy.load("en_core_web_md") >>> matcher = SimilarityMatcher(nlp.vocab) >>> matcher.add("AUTHOR", [nlp("Kerouac")]) >>> matcher.labels ('AUTHOR',)
- property patterns: List[Dict[str, Any]]¶
Get all patterns and match settings that were added to the matcher.
- Return type
List
[Dict
[str
,Any
]]- Returns
The patterns and their respective match settings as a list of dicts.
Example
>>> import spacy >>> from spaczz.matcher import SimilarityMatcher >>> nlp = spacy.load("en_core_web_md") >>> matcher = SimilarityMatcher(nlp.vocab) >>> matcher.add("AUTHOR", [nlp("Kerouac")], [{"ignore_case": False}]) >>> matcher.patterns == [ { "label": "AUTHOR", "pattern": "Kerouac", "type": "similarity", "kwargs": {"ignore_case": False} }, ] True
- remove(label)¶
Remove a label and its respective patterns from the matcher.
- Parameters
label (
str
) – Name of the rule added to the matcher.
Example
>>> import spacy >>> from spaczz.matcher import SimilarityMatcher >>> nlp = spacy.load("en_core_web_md") >>> matcher = SimilarityMatcher(nlp.vocab) >>> matcher.add("SOUND", [nlp("mooo")]) >>> matcher.remove("SOUND") >>> "SOUND" in matcher False
- Return type
None
- property type: Literal['fuzzy', 'regex', 'token', 'similarity', 'phrase']¶
Getter for the matchers SpaczzType.
- Return type
Literal
[‘fuzzy’, ‘regex’, ‘token’, ‘similarity’, ‘phrase’]
- property vocab: spacy.vocab.Vocab¶
Getter for the matchers Vocab.
- Return type
Vocab
- class spaczz.matcher.TokenMatcher(vocab, **defaults)¶
spaCy-like matcher for finding fuzzy token matches in Doc objects.
Fuzzy matches added patterns against the Doc object it is called on. Accepts labeled patterns in the form of lists of dictionaries where each list describes an individual pattern and each dictionary describes an individual token.
Uses extended spaCy token matching patterns. “FUZZY” and “FREGEX” are the two additional spaCy token pattern options.
For example:
[ {"TEXT": {"FREGEX": "(database){e<=1}"}}, {"LOWER": {"FUZZY": "access", "MIN_R": 85, "FUZZY_FUNC": "partial"}}, ]
Make sure to use uppercase dictionary keys in patterns.
- name¶
Class attribute - the name of the matcher.
- Type
str
- defaults¶
Keyword arguments to be used as default match settings. Per-pattern match settings take precedence over defaults.
- Type
dict[str, bool|int|str]
- Match Settings
ignore_case (bool) – Whether to lower-case text before matching. Can only be set at the pattern level. For “FUZZY” and “FREGEX” patterns. Default is True.
min_r (int) – Minimum match ratio required. For “FUZZY” and “FREGEX” patterns.
fuzzy_func (str) – Key name of fuzzy matching function to use. Can only be set at the pattern level. For “FUZZY” patterns only. All rapidfuzz matching functions with default settings are available, however any token-based functions provide no utility at the individual token level. Additional fuzzy matching functions can be registered by users. Included, and useful, functions are:
“simple” = ratio
“partial” = partial_ratio
“quick” = QRatio
- “partial_alignment” = partial_ratio_alignment
(Requires rapidfuzz>=2.0.3)
Default is “simple”.
fuzzy_weights – Name of weighting method for regex insertion, deletion, and substituion counts. Can only be set at the pattern level. For “FREGEX” patterns only. Included weighting methods are:
“indel” = (1, 1, 2)
“lev” = (1, 1, 1)
Default is “indel”.
predef – Whether regex should be interpreted as a key to a predefined regex pattern or not. Can only be set at the pattern level. For “FREGEX” patterns only. Default is False.
- __call__(doc)¶
Finds matches in doc given the matchers patterns.
- Parameters
doc (
Doc
) – The Doc object to match over.- Return type
List
[Tuple
[str
,int
,int
,int
,str
]]- Returns
A list of MatchResult tuples, (label, start index, end index, match ratio, pattern).
Example
>>> import spacy >>> from spaczz.matcher import TokenMatcher >>> nlp = spacy.blank("en") >>> matcher = TokenMatcher(nlp.vocab) >>> doc = nlp("Rdley Scot was the director of Alien.") >>> matcher.add("NAME", [ [{"TEXT": {"FUZZY": "Ridley"}}, {"TEXT": {"FUZZY": "Scott"}}] ]) >>> matcher(doc)[0][:4] ('NAME', 0, 2, 90)
- __contains__(label)¶
Whether the matcher contains patterns for a label.
- Return type
bool
- __len__()¶
The number of labels added to the matcher.
- Return type
int
- __reduce__()¶
Interface for pickling the matcher.
- Return type
Tuple
[Any
,Any
]
- add(label, patterns, on_match=None)¶
Add a rule to the matcher, consisting of a label and one or more patterns.
Patterns must be a list of lists of dicts where each list of dicts represent an individual pattern and each dictionary represents an individual token.
Uses extended spaCy token matching patterns. “FUZZY” and “FREGEX” are the two additional spaCy token pattern options.
For example:
[ {"TEXT": {"FREGEX": "(database){e<=1}"}}, {"LOWER": {"FUZZY": "access", "MIN_R": 85, "FUZZY_FUNC": "partial"}}, ]
Make sure to use uppercase dictionary keys in patterns.
- Parameters
label (
str
) – Name of the rule added to the matcher.patterns (
List
[List
[Dict
[str
,Any
]]]) – List of lists of dicts that will be matched against the Doc object the matcher is called on.on_match (
Optional
[Callable
[[TokenMatcher
,Doc
,int
,List
[Tuple
[str
,int
,int
,int
,str
]]],None
]]) – Optional callback function to modify the Doc object the matcher is called on after matching. Default is None.
- Raises
TypeError – If patterns is not a list of Doc objects.
ValueError – Patterns cannot have zero tokens.
Example
>>> import spacy >>> from spaczz.matcher import TokenMatcher >>> nlp = spacy.blank("en") >>> matcher = TokenMatcher(nlp.vocab) >>> matcher.add("AUTHOR", [[{"TEXT": {"FUZZY": "Kerouac"}}]]) >>> "AUTHOR" in matcher True
- Return type
None
- property labels: Tuple[str, ...]¶
All labels present in the matcher.
- Return type
Tuple
[str
,...
]- Returns
The unique labels as a tuple of strings.
Example
>>> import spacy >>> from spaczz.matcher import TokenMatcher >>> nlp = spacy.blank("en") >>> matcher = TokenMatcher(nlp.vocab) >>> matcher.add("AUTHOR", [[{"TEXT": {"FUZZY": "Kerouac"}}]]) >>> matcher.labels ('AUTHOR',)
- property patterns: List[Dict[str, Any]]¶
Get all patterns and match settings that were added to the matcher.
- Return type
List
[Dict
[str
,Any
]]- Returns
The patterns and their respective match settings as a list of dicts.
Example
>>> import spacy >>> from spaczz.matcher import TokenMatcher >>> nlp = spacy.blank("en") >>> matcher = TokenMatcher(nlp.vocab) >>> matcher.add("AUTHOR", [[{"TEXT": {"FUZZY": "Kerouac"}}]]) >>> matcher.patterns == [ { "label": "AUTHOR", "pattern": [{"TEXT": {"FUZZY": "Kerouac"}}], "type": "token", }, ] True
- remove(label)¶
Remove a label and its respective patterns from the matcher.
- Parameters
label (
str
) – Name of the rule added to the matcher.- Raises
ValueError – If label does not exist in the matcher.
Example
>>> import spacy >>> from spaczz.matcher import TokenMatcher >>> nlp = spacy.blank("en") >>> matcher = TokenMatcher(nlp.vocab) >>> matcher.add("AUTHOR", [[{"TEXT": {"FUZZY": "Kerouac"}}]]) >>> matcher.remove("AUTHOR") >>> "AUTHOR" in matcher False
- Return type
None
- property type: Literal['fuzzy', 'regex', 'token', 'similarity', 'phrase']¶
Getter for the matchers SpaczzType.
- Return type
Literal
[‘fuzzy’, ‘regex’, ‘token’, ‘similarity’, ‘phrase’]
- property vocab: spacy.vocab.Vocab¶
Getter for the matchers Vocab.
- Return type
Vocab
spaczz.pipeline¶
Module for pipeline components.
- class spaczz.pipeline.SpaczzRuler(nlp, name='spaczz_ruler', *, overwrite_ents=False, ent_id_sep='||', fuzzy_defaults={}, regex_defaults={}, token_defaults={}, patterns=None, scorer=<function spaczz_ruler_scorer>)¶
The SpaczzRuler adds fuzzy matches to spaCy Doc.ents.
It can be combined with other spaCy NER components like the statistical EntityRecognizer, and/or the EntityRuler it is inspired by, to boost accuracy. After initialization, the component is typically added to the pipeline using nlp.add_pipe.
- nlp¶
The shared Language object that passes its Vocab to the matchers (not currently used by spaczz matchers) and processes fuzzy patterns.
- Type
Language
- name¶
Instance name of the current pipeline component. Typically passed in automatically from the factory when the component is added. Used to disable the current entity ruler while creating phrase patterns with the nlp object.
- Type
str
- overwrite_ents¶
If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary.
- Type
bool
- ent_id_sep¶
Separator used internally for entity IDs.
- Type
str
- scorer¶
The scoring method for the ruler.
- Type
Optional[Callable]
- fuzzy_matcher¶
The FuzzyMatcher instance the spaczz ruler will use for fuzzy phrase matching.
- Type
- regex_matcher¶
The RegexMatcher instance the spaczz ruler will use for regex phrase matching.
- Type
- token_matcher¶
The TokenMatcher instance the spaczz ruler will use for fuzzy token matching.
- Type
- defaults¶
Default match settings for their respective matchers.
- Type
Dict[str, Any]
- __call__(doc)¶
Find matches in document and add them as entities.
- Parameters
doc (
Doc
) – The Doc object in the pipeline.- Return type
Doc
- Returns
The Doc with added entities, if available.
Example
>>> import spacy >>> from spaczz.pipeline import SpaczzRuler >>> nlp = spacy.blank("en") >>> ruler = SpaczzRuler(nlp) >>> doc = nlp.make_doc("My name is Anderson, Grunt") >>> ruler.add_patterns([{"label": "NAME", "pattern": "Grant Andersen", "type": "fuzzy", "kwargs": {"fuzzy_func": "token_sort"}}]) >>> doc = ruler(doc) >>> "Anderson, Grunt" in [ent.text for ent in doc.ents] True
- __contains__(label)¶
Whether a label is present in the patterns.
- Return type
bool
- __len__()¶
The number of all patterns added to the ruler.
- Return type
int
- add_patterns(patterns)¶
Add patterns to the ruler.
A pattern must be a spaczz pattern: {label (str), pattern (str or list), type (str), optional kwargs (dict[str, Any]), and optional id (str)}.
For example, a fuzzy phrase pattern:
{ 'label': 'ORG', 'pattern': 'Apple', 'kwargs': {'min_r2': 90}, 'type': 'fuzzy', }
Or, a token pattern:
{ 'label': 'ORG', 'pattern': [{'TEXT': {'FUZZY': 'Apple'}}], 'type': 'token', }
To utilize regex flags, use inline flags.
- Parameters
patterns (
List
[Dict
[str
,Union
[str
,Dict
[str
,Any
],List
[Dict
[str
,Any
]]]]]) – The spaczz patterns to add.- Raises
TypeError – If patterns is not a list of dicts.
ValueError – If one or more patterns do not conform the spaczz pattern structure.
Example
>>> import spacy >>> from spaczz.pipeline import SpaczzRuler >>> nlp = spacy.blank("en") >>> ruler = SpaczzRuler(nlp) >>> ruler.add_patterns([{"label": "AUTHOR", "pattern": "Kerouac", "type": "fuzzy"}]) >>> "AUTHOR" in ruler.labels True
- Return type
None
- clear()¶
Reset all patterns.
- Return type
None
- property ent_ids: Tuple[Optional[str], ...]¶
All entity ids present in the match patterns id properties.
- Return type
Tuple
[Optional
[str
],...
]- Returns
The unique string entity ids as a tuple.
Example
>>> import spacy >>> from spaczz.pipeline import SpaczzRuler >>> nlp = spacy.blank("en") >>> ruler = SpaczzRuler(nlp) >>> ruler.add_patterns([{"label": "AUTHOR", "pattern": "Kerouac", "type": "fuzzy", "id": "BEAT"}]) >>> ruler.ent_ids ('BEAT',)
- from_bytes(patterns_bytes, *, exclude=[])¶
Load the spaczz ruler from a bytestring.
- Parameters
patterns_bytes (
bytes
) – The bytestring to load.exclude (
Iterable
[str
]) – For spaCy consistency.
- Return type
- Returns
The loaded spaczz ruler.
Example
>>> import spacy >>> from spaczz.pipeline import SpaczzRuler >>> nlp = spacy.blank("en") >>> ruler = SpaczzRuler(nlp) >>> ruler.add_patterns([{"label": "AUTHOR", "pattern": "Kerouac", "type": "fuzzy"}]) >>> ruler_bytes = ruler.to_bytes() >>> new_ruler = SpaczzRuler(nlp) >>> new_ruler = new_ruler.from_bytes(ruler_bytes) >>> "AUTHOR" in new_ruler True
- from_disk(path, *, exclude=[])¶
Load the spaczz ruler from a file.
Expects a file containing newline-delimited JSON (JSONL) with one entry per line.
- Parameters
path (
Union
[str
,Path
]) – The JSONL file to load.exclude (
Iterable
[str
]) – For spaCy consistency.
- Return type
- Returns
The loaded spaczz ruler.
- Raises
ValueError – If path does not exist or cannot be accessed.
Example
>>> import os >>> import tempfile >>> import spacy >>> from spaczz.pipeline import SpaczzRuler >>> nlp = spacy.blank("en") >>> ruler = SpaczzRuler(nlp) >>> ruler.add_patterns([{"label": "AUTHOR", "pattern": "Kerouac", "type": "fuzzy"}]) >>> with tempfile.TemporaryDirectory() as tmpdir: >>> ruler.to_disk(f"{tmpdir}/ruler") >>> new_ruler = SpaczzRuler(nlp) >>> new_ruler = new_ruler.from_disk(f"{tmpdir}/ruler") >>> "AUTHOR" in new_ruler True
- initialize(get_examples, *, nlp=None, patterns=None)¶
Initialize the pipe for training.
- Parameters
get_examples (
Callable
[[],Iterable
[Example
]]) – Function that returns a representative sample of gold-standard Example objects.nlp (
Optional
[Language
]) – The current nlp object the component is part of.patterns (
Optional
[Sequence
[Dict
[str
,Union
[str
,Dict
[str
,Any
],List
[Dict
[str
,Any
]]]]]]) – The list of patterns.
- Return type
None
- property labels: Tuple[str, ...]¶
All labels present in the ruler.
- Return type
Tuple
[str
,...
]- Returns
The unique string labels as a tuple.
Example
>>> import spacy >>> from spaczz.pipeline import SpaczzRuler >>> nlp = spacy.blank("en") >>> ruler = SpaczzRuler(nlp) >>> ruler.add_patterns([{"label": "AUTHOR", "pattern": "Kerouac", "type": "fuzzy"}]) >>> ruler.labels ('AUTHOR',)
- match(doc)¶
Used in call to find matches in doc.
- Return type
List
[Tuple
[str
,int
,int
,int
,str
,Literal
[‘fuzzy’, ‘regex’, ‘token’, ‘similarity’, ‘phrase’]]]
- property patterns: List[Dict[str, Union[str, Dict[str, Any], List[Dict[str, Any]]]]]¶
Get all patterns and kwargs that were added to the ruler.
- Return type
List
[Dict
[str
,Union
[str
,Dict
[str
,Any
],List
[Dict
[str
,Any
]]]]]- Returns
The original patterns and kwargs, one dictionary for each combination.
Example
>>> import spacy >>> from spaczz.pipeline import SpaczzRuler >>> nlp = spacy.blank("en") >>> ruler = SpaczzRuler(nlp) >>> ruler.add_patterns([{"label": "STREET", "pattern": "street_addresses", "type": "regex", "kwargs": {"predef": True}}]) >>> ruler.patterns == [ { "label": "STREET", "pattern": "street_addresses", "type": "regex", "kwargs": {"predef": True}, }, ] True
- remove(ent_id)¶
Remove patterns by their ent_id.
- Return type
None
- score(examples, **kwargs)¶
Pipeline scoring for spaCy >= 3.0, < 3.2 compatibility.
- Return type
Any
- set_annotations(doc, matches)¶
Modify the document in place.
- Return type
None
- to_bytes(*, exclude=[])¶
Serialize the spaczz ruler patterns to a bytestring.
- Parameters
exclude (
Iterable
[str
]) – For spaCy consistency.- Return type
bytes
- Returns
The serialized patterns.
Example
>>> import spacy >>> from spaczz.pipeline import SpaczzRuler >>> nlp = spacy.blank("en") >>> ruler = SpaczzRuler(nlp) >>> ruler.add_patterns([{"label": "AUTHOR", "pattern": "Kerouac", "type": "fuzzy"}]) >>> ruler_bytes = ruler.to_bytes() >>> isinstance(ruler_bytes, bytes) True
- to_disk(path, *, exclude=[])¶
Save the spaczz ruler patterns to a directory.
The patterns will be saved as newline-delimited JSON (JSONL).
- Parameters
path (
Union
[str
,Path
]) – The JSONL file to save.exclude (
Iterable
[str
]) – For spaCy consistency.
Example
>>> import os >>> import tempfile >>> import spacy >>> from spaczz.pipeline import SpaczzRuler >>> nlp = spacy.blank("en") >>> ruler = SpaczzRuler(nlp) >>> ruler.add_patterns([{"label": "AUTHOR", "pattern": "Kerouac", "type": "fuzzy"}]) >>> with tempfile.TemporaryDirectory() as tmpdir: >>> ruler.to_disk(f"{tmpdir}/ruler") >>> isdir = os.path.isdir(f"{tmpdir}/ruler") >>> isdir True
- Return type
None
spaczz.registry¶
Function and object registries.
- spaczz.registry.get_fuzzy_func(name)¶
Get the registered function for a given name.
name (str): The name. RETURNS (Any): The registered function.
- Return type
Any
- spaczz.registry.get_re_pattern(name)¶
Get the registered function for a given name.
name (str): The name. RETURNS (Any): The registered function.
- Return type
Any
- spaczz.registry.get_re_weights(name)¶
Get the registered function for a given name.
name (str): The name. RETURNS (Any): The registered function.
- Return type
Any
spaczz.customattrs¶
Custom spaCy attributes for spaczz.
- class spaczz.customattrs.SpaczzAttrs¶
Adds spaczz custom attributes to spaCy.
- static get_doc_types(doc)¶
Getter for spaczz_types Doc attribute.
- Return type
Set
[Literal
[‘fuzzy’, ‘regex’, ‘token’, ‘similarity’, ‘phrase’]]
- classmethod get_pattern(span)¶
Getter for spaczz_pattern Span attribute.
- Return type
Optional
[str
]
- classmethod get_ratio(span)¶
Getter for spaczz_ratio Span attribute.
- Return type
Optional
[int
]
- static get_spaczz_doc(doc)¶
Getter for spaczz_doc Doc attribute.
- Return type
bool
- static get_spaczz_ent(span)¶
Getter for spaczz_ent Span attribute.
- Return type
bool
- classmethod get_span_type(span)¶
Getter for spaczz_type Span attribute.
- Return type
Optional
[Literal
[‘fuzzy’, ‘regex’, ‘token’, ‘similarity’, ‘phrase’]]
- static get_span_types(span)¶
Getter for spaczz_types Span attribute.
- Return type
Set
[Literal
[‘fuzzy’, ‘regex’, ‘token’, ‘similarity’, ‘phrase’]]
- classmethod initialize()¶
Initializes and registers custom attributes.
- Return type
None
spaczz.customtypes¶
Custom spaczz types.
spaczz.exceptions¶
Module for custom exceptions and warnings.
- exception spaczz.exceptions.AttrOverwriteWarning¶
It warns if custom attributes are being overwritten.
- exception spaczz.exceptions.FlexWarning¶
It warns if flex value is changed if too large.
- exception spaczz.exceptions.KwargsWarning¶
It warns if there are more kwargs than patterns or vice versa.
- exception spaczz.exceptions.MissingVectorsWarning¶
It warns if the spaCy Vocab does not have word vectors.
- exception spaczz.exceptions.PatternTypeWarning¶
It warns if the spaczz pattern does not have a valid pattern type.
- exception spaczz.exceptions.RatioWarning¶
It warns if match ratio values are incompatible with each other.
- exception spaczz.exceptions.RegexParseError¶
General error for errors that may happen during regex compilation.