🗣 Retrieval Vocabulary

Chunk: A piece of data or information, often a subset of a larger document, that will handled as a single unit by a retrieval system
Chunking: The act of splitting your long body of text into smaller parts. Similar to text splitting.
Cosine Similarity: A metric used to measure the similarity between two vectors (often representing text embeddings) in a multi-dimensional space, by calculating the cosine of the angle between them
Dimension Size: The number of features or axes in the space, which corresponds to the size of the embeddings used for representing documents or chunks
Document Loader: A component of a retrieval system that is responsible for importing documents into the system and preparing them for indexing and retrieval
Document Store (DocStore): A specialized database for storing, managing, and retrieving documents within a retrieval system
Document: A unit of data or information that can be text, image, audio, or video, which the system can retrieve and present in response to a query
Embedding: A mathematical representation of a document or chunk, often in a high-dimensional space, where each dimension represents a feature such as a word or phrase. Similar to vector
Full Stack Retrieval: The entire retrieval system that handles the everything from data ingestion, processing to query handling and information delivery
Index: An index is a data structure that allows for fast retrieval of documents or chunks within a large dataset. It maps key terms or features to their locations in a dataset
Knowledge Base: A structured database of facts, information, and rules that a retrieval system can draw upon to answer queries or perform tasks
Maximum Marginal Relevance (MMR): An algorithm used to provide a set of search results that are both relevant to the query and diverse, minimizing content overlap to offer a broader information range
Reranker: A model that improves the precision of document retrieval by reevaluating and scoring the relevance of a pre-selected set of documents to a specific query, aiming to refine the results for higher accuracy.
Retriever: In the context of retrieval systems, a retriever is a component that fetches relevant documents from a corpus or database based on a query, often using embeddings and similarity measures.
Sentiment: The emotional tone or meaning behind a series of words, used to understand the attitudes, opinions, and emotions expressed in a chunk of text
Text Splitting: Another way to say chunking. The act of splitting up your long body of text into smaller parts
Vector Store (VectorStore): A database or storage system where vectors are kept. It allows for efficient retrieval and comparison of vectors for operations like similarity searching
Vector: A mathematical representation of a document or chunk, often in a high-dimensional space, where each dimension represents a feature such as a word or phrase.