API Reference

A package for reading and manipulating word embeddings.

Reach

class reach.Reach(vectors: Union[ndarray, List[ndarray]], items: List[Hashable], name: str = '', unk_index: Optional[int] = None)

Work with vector representations of items.

Supports functions for calculating fast batched similarity between items or composite representations of items.

Parameters:

vectors (numpy array) – The vector space.
items (list) – A list of items. Length must be equal to the number of vectors, and aligned with the vectors.
name (string, optional, default '') – A string giving the name of the current reach. Only useful if you have multiple spaces and want to keep track of them.
unk_index (int or None, optional, default None) – The index of the UNK item. If this is None, any attempts at vectorizing OOV items will throw an error.

unk_index

The integer index of your unknown glyph. This glyph will be inserted into your BoW space whenever an unknown item is encountered.

Type:: int

name

The name of the Reach instance.

Type:: string

property items: Dict[Hashable, int]: A mapping from item ids to their indices.

property indices: Dict[int, Hashable]: A mapping from integers to item indices.

property sorted_items: Iterable[Hashable]: The items, sorted by index.

property size: int: The dimensionality of the vectors

property vectors: ndarray: The vectors themselves

property norm_vectors: ndarray: Vectors, but normalized to unit length.

classmethod load(vector_file: Union[Path, TextIOWrapper, str], wordlist: Optional[Tuple[str, ...]] = None, num_to_load: Optional[int] = None, truncate_embeddings: Optional[int] = None, unk_word: Optional[str] = None, sep: str = ' ', recover_from_errors: bool = False, desired_dtype: Union[str, dtype] = 'float32', **kwargs: Any) → Reach

Read a file in word2vec .txt format.

The load function will raise a ValueError when trying to load items which do not conform to line lengths.

Parameters:

vector_file (string, Path or file handle) – The path to the vector file, or an opened vector file.
header (bool) – Whether the vector file has a header of the type (NUMBER OF ITEMS, SIZE OF VECTOR).
wordlist (iterable, optional, default ()) – A list of words you want loaded from the vector file. If this is None (default), all words will be loaded.
num_to_load (int, optional, default None) – The number of items to load from the file. Because loading can take some time, it is sometimes useful to onlyl load the first n items from a vector file for quick inspection.
truncate_embeddings (int, optional, default None) – If this value is not None, the vectors in the vector space will be truncated to the number of dimensions indicated by this value.
unk_word (object) – The object to treat as UNK in your vector space. If this is not in your items dictionary after loading, we add it with a zero vector.
recover_from_errors (bool) – If this flag is True, the model will continue after encountering duplicates or other errors.

Returns:

r – An initialized Reach instance.

Return type:

Reach

vectorize(tokens: Iterable[Hashable], remove_oov: bool = False, norm: bool = False) → ndarray

Vectorize a sentence by replacing all items with their vectors.

Parameters:

tokens (object or list of objects) – The tokens to vectorize.
remove_oov (bool, optional, default False) – Whether to remove OOV items. If False, OOV items are replaced by the UNK glyph. If this is True, the returned sequence might have a different length than the original sequence.
norm (bool, optional, default False) – Whether to return the unit vectors, or the regular vectors.

Returns:

s – An M * N matrix, where every item has been replaced by its vector. OOV items are either removed, or replaced by the value of the UNK glyph.

Return type:

numpy array

mean_pool(tokens: Iterable[Hashable], remove_oov: bool = False, safeguard: bool = True) → ndarray

Mean pool a list of tokens.

Parameters:

tokens (list.) – The list of items to vectorize and then mean pool.
remove_oov (bool.) – Whether to remove OOV items from the input. If this is False, and an unknown item is encountered, then the <UNK> symbol will be inserted if it is set. If it is not set, then the function will throw a ValueError.
safeguard (bool.) –
There are a variety of reasons why we can’t vectorize a list of tokens:
- The list might be empty after removing OOV
- We remove OOV but haven’t set <UNK>
- The list of tokens is empty
If safeguard is False, we simply supply a zero vector instead of erroring.

Returns:

vector – a vector of the correct size, which is the mean of all tokens in the sentence.

Return type:

np.ndarray

mean_pool_corpus(corpus: List[Iterable[Hashable]], remove_oov: bool = False, safeguard: bool = True) → ndarray

Mean pool a list of list of tokens.

Parameters:

corpus (a list of list of tokens.) – The list of items to vectorize and then mean pool.
remove_oov (bool.) – Whether to remove OOV items from the input. If this is False, and an unknown item is encountered, then the <UNK> symbol will be inserted if it is set. If it is not set, then the function will throw a ValueError.
safeguard (bool.) – There are a variety of reasons why we can’t vectorize a list of tokens: - The list might be empty after removing OOV - We remove OOV but haven’t set <UNK> - The list of tokens is empty If safeguard is False, we simply supply a zero vector instead of erroring.

Returns:

vector – a matrix with number of rows n, where n is the number of input lists, and columns s, which is the number of columns of a single vector.

Return type:

np.ndarray

bow(tokens: Iterable[Hashable], remove_oov: bool = False) → List[int]

Create a bow representation of a list of tokens.

Parameters:

tokens (list.) – The list of items to change into a bag of words representation.
remove_oov (bool.) – Whether to remove OOV items from the input. If this is True, the length of the returned BOW representation might not be the length of the original representation.

Returns:

bow – A BOW representation of the list of items.

Return type:

list

transform(corpus: List[Iterable[Hashable]], remove_oov: bool = False, norm: bool = False) → List[ndarray]

Transform a corpus by repeated calls to vectorize, defined above.

Parameters:

corpus (A list of list of strings.) – Represents a corpus as a list of sentences, where a sentence is a list of tokens.
remove_oov (bool, optional, default False) – If True, removes OOV items from the input before vectorization.
norm (bool, optional, default False) – If True, this will return normalized vectors.

Returns:

c – A list of numpy arrays, where each array represents the transformed sentence in the original list. The list is guaranteed to be the same length as the input list, but the arrays in the list may be of different lengths, depending on whether remove_oov is True.

Return type:

list

most_similar(items: Iterable[Hashable], num: int = 10, batch_size: int = 100, show_progressbar: bool = False) → List[List[Tuple[Hashable, float]]]

Return the num most similar items to a given list of items.

Parameters:

items (list of objects or a single object.) – The items to get the most similar items to.
num (int, optional, default 10) – The number of most similar items to retrieve.
batch_size (int, optional, default 100.) – The batch size to use. 100 is a good default option. Increasing the batch size may increase the speed.
show_progressbar (bool, optional, default False) – Whether to show a progressbar.

Returns:

sim – For each items in the input the num most similar items are returned in the form of (NAME, SIMILARITY) tuples.

Return type:

array

threshold(items: Iterable[Hashable], threshold: float = 0.5, batch_size: int = 100, show_progressbar: bool = False) → List[List[Tuple[Hashable, float]]]

Return all items whose similarity is higher than threshold.

Parameters:

items (list of objects or a single object.) – The items to get the most similar items to.
threshold (float, optional, default .5) – The radius within which to retrieve items.
batch_size (int, optional, default 100.) – The batch size to use. 100 is a good default option. Increasing the batch size may increase the speed.
show_progressbar (bool, optional, default False) – Whether to show a progressbar.

Returns:

sim – For each items in the input the num most similar items are returned in the form of (NAME, SIMILARITY) tuples.

Return type:

array

nearest_neighbor(vectors: ndarray, num: int = 10, batch_size: int = 100, show_progressbar: bool = False) → List[List[Tuple[Hashable, float]]]

Find the nearest neighbors to some arbitrary vector.

This function is meant to be used in composition operations. The most_similar function can only handle items that are in vocab, and looks up their vector through a dictionary. Compositions, e.g. “King - man + woman” are necessarily not in the vocabulary.

Parameters:

vectors (list of arrays or numpy array) – The vectors to find the nearest neighbors to.
num (int, optional, default 10) – The number of most similar items to retrieve.
batch_size (int, optional, default 100.) – The batch size to use. 100 is a good default option. Increasing the batch size may increase speed.
show_progressbar (bool, optional, default False) – Whether to show a progressbar.

Returns:

sim – For each item in the input the num most similar items are returned in the form of (NAME, SIMILARITY) tuples.

Return type:

list of tuples.

nearest_neighbor_threshold(vectors: ndarray, threshold: float = 0.5, batch_size: int = 100, show_progressbar: bool = False) → List[List[Tuple[Hashable, float]]]

Find the nearest neighbors to some arbitrary vector.

This function is meant to be used in composition operations. The most_similar function can only handle items that are in vocab, and looks up their vector through a dictionary. Compositions, e.g. “King - man + woman” are necessarily not in the vocabulary.

Parameters:

vectors (list of arrays or numpy array) – The vectors to find the nearest neighbors to.
threshold (float, optional, default .5) – The threshold within to retrieve items.
batch_size (int, optional, default 100.) – The batch size to use. 100 is a good default option. Increasing the batch size may increase speed.
show_progressbar (bool, optional, default False) – Whether to show a progressbar.

Returns:

sim – For each item in the input the num most similar items are returned in the form of (NAME, SIMILARITY) tuples.

Return type:

list of tuples.

static normalize(vectors: ndarray) → ndarray

Normalize a matrix of row vectors to unit length.

Contains a shortcut if there are no zero vectors in the matrix. If there are zero vectors, we do some indexing tricks to avoid dividing by 0.

Parameters:: vectors (np.array) – The vectors to normalize.
Returns:: vectors – The input vectors, normalized to unit length.
Return type:: np.array

vector_similarity(vector: ndarray, items: Iterable[Hashable]) → ndarray: Compute the similarity between a vector and a set of items.

similarity(items_1: Iterable[Hashable], items_2: Iterable[Hashable]) → ndarray

Compute the similarity between two collections of items.

Parameters:

items_1 (iterable of items) – The first collection of items.
items_2 (iterable of items) – The second collection of item.

Returns:

sim – An array of similarity scores between 1 and -1.

Return type:

array of floats

intersect(itemlist: Iterable[Hashable]) → Reach

Intersect a reach instance with a list of items.

Parameters:: itemlist (list of hashables) – A list of items to keep. Note that this itemlist need not include all words in the Reach instance. Any words which are in the itemlist, but not in the reach instance, are ignored.

union(other: Reach, check: bool = True) → Reach

Union a reach with another reach. If items are in both reach instances, the current instance gets precedence.

Parameters:

other (Reach) – Another Reach instance.
check (bool) – Whether to check if duplicates are the same vector.

save(path: str, write_header: bool = True) → None

Save the current vector space in word2vec format.

Parameters:

path (str) – The path to save the vector file to.
write_header (bool, optional, default True) – Whether to write a word2vec-style header as the first line of the file

save_fast_format(filename: str) → None

Save a reach instance in a fast format.

The reach fast format stores the words and vectors of a Reach instance separately in a JSON and numpy format, respectively.

Parameters:: filename (str) – The prefix to add to the saved filename. Note that this is not the real filename under which these items are stored. The words and unk_index are stored under “{filename}_words.json”, and the numpy matrix is saved under “{filename}_vectors.npy”.

classmethod load_fast_format(filename: Union[str, Path], desired_dtype: Union[str, dtype] = 'float32') → Reach

Load a reach instance in fast format.

As described above, the fast format stores the words and vectors of the Reach instance separately, and is drastically faster than loading from .txt files.

Parameters:: filename (str) – The filename prefix from which to load. Note that this is not a real filepath as such, but a shared prefix for both files. In order for this to work, both {filename}_words.json and {filename}_vectors.npy should be present.

AutoReach

class reach.AutoReach(vectors: Union[ndarray, List[ndarray]], items: List[Hashable], lowercase: Union[str, bool] = 'auto', name: str = '', unk_index: Optional[int] = None)

A Reach variant that does not require tokenization.

It uses the aho-corasick algorithm to build an automaton, which is then used to find candidates in strings. These candidates are then selected using a “word rule” (see is_valid_token). This rule is now used to languages that delimit words using spaces. If this is not the case, please subclass this and write rules that fit your language of choice.

Parameters:

vectors (numpy array) – The vector space.
items (list) – A list of items. Length must be equal to the number of vectors, and aligned with the vectors.
lowercase (bool or str) – This determines whether the string should be lowercased or not before searching it. If this is set to ‘auto’, the items in the vector space are used to determine whether this attribute should be true or false.
name (string, optional, default '') – A string giving the name of the current reach. Only useful if you have multiple spaces and want to keep track of them.
unk_index (int or None, optional, default None) – The index of the UNK item. If this is None, any attempts at vectorizing OOV items will throw an error.

unk_index

The integer index of your unknown glyph. This glyph will be inserted into your BoW space whenever an unknown item is encountered.

Type:: int

name

The name of the Reach instance.

Type:: string

property lowercase: bool: Whether to lowercase a string before searching it.

is_valid_token(token: str, tokens: str, end_index: int) → bool: Checks whether a token is valid in the current context.

bow(tokens: Iterable[Hashable], remove_oov: bool = True) → List[int]

Create a bow representation from a string.

Parameters:

tokens (str.) – The string from which to extract in vocabulary tokens
remove_oov (bool.) – Not used.

Returns:

bow – A BOW representation of the list of items.

Return type:

list