The high-performance .NET search engine based on pattern recognition
Infidex is a search engine based on pattern recognition. Learning from your data, Infidex automatically extracts features like frequency and rarity and embeds them into a multi-dimensional hypersphere for intelligent matching. This enables fuzzy querying with unparalleled typo tolerance, without any manual tuning. Zero dependencies, blazingly fast, built for developers who need search that just works.
- Blazingly Fast - Index thousands of documents per second, search in milliseconds
- Intelligent Matching - Finds what you're looking for even with typos and variations
- Per-Term Coverage - Ranks documents by how many query terms they match (more terms = higher rank)
- Rich Filtering - SQL-like query language (Infiscript) for complex filters
- Faceted Search - Build dynamic filters and aggregations
- Smart Ranking - Lexicographic (coverage, quality) scoring for principled relevance
- Multi-Field Search - Search across multiple fields with configurable weights
- Incremental Indexing - Add or update documents without rebuilding the entire index
- Fully Thread-Safe - Multiple concurrent readers, writers block readers and other writers
- Production Ready - Comprehensive test coverage, clean API, zero dependencies
- Easy Integration - Embeds directly into your .NET application
Install via NuGet:
dotnet add package Infidexusing Infidex;
using Infidex.Core;
// Create search engine
var engine = SearchEngine.CreateDefault();
// Index documents
var documents = new[]
{
new Document(1L, "The quick brown fox jumps over the lazy dog"),
new Document(2L, "A journey of a thousand miles begins with a single step"),
new Document(3L, "To be or not to be that is the question")
};
engine.IndexDocuments(documents);
// Search with typos - still finds matches!
var results = engine.Search("quik fox", maxResults: 10);
foreach (var result in results.Records)
{
Console.WriteLine($"Doc {result.DocumentId}: Score {result.Score}");
}using Infidex.Api;
// Define fields with weights
var matrix = new DocumentFields();
matrix.AddField("title", "The Matrix", Weight.High);
matrix.AddField("description", "A computer hacker learns about the true nature of reality", Weight.Low);
var inception = new DocumentFields();
inception.AddField("title", "Inception", Weight.High);
inception.AddField("description", "A thief who steals corporate secrets through dream-sharing", Weight.Low);
var movies = new[]
{
new Document(1L, matrix),
new Document(2L, inception)
};
engine.IndexDocuments(movies);Infiscript is a simple filtering language used to write intuitive filters that compile to optimized bytecode:
using Infidex.Api;
// Simple comparison
var filter = Filter.Parse("genre = 'Sci-Fi'");
// Boolean logic
var filter = Filter.Parse("genre = 'Sci-Fi' AND year >= 2000");
// Complex expressions with grouping
var filter = Filter.Parse("(genre = 'Fantasy' AND year >= 2000) OR (genre = 'Horror' AND year >= 1980)");
// String operations
var filter = Filter.Parse("title CONTAINS 'matrix'");
var filter = Filter.Parse("title STARTS WITH 'The'");
var filter = Filter.Parse("description LIKE '%dream%'");
// Range checks
var filter = Filter.Parse("year BETWEEN 2000 AND 2020");
var filter = Filter.Parse("rating >= 8.0");
// List membership
var filter = Filter.Parse("genre IN ('Sci-Fi', 'Fantasy', 'Adventure')");
// Null checks
var filter = Filter.Parse("director IS NOT NULL");
// Regex matching
var filter = Filter.Parse("email MATCHES '^[\\w\\.-]+@[\\w\\.-]+\\.\\w+$'");
// Ternary expressions (conditional logic)
var filter = Filter.Parse("age >= 18 ? 'adult' : 'minor'");
var filter = Filter.Parse("score >= 90 ? 'A' : score >= 80 ? 'B' : 'C'");
// Use filters in queries
var query = new Query("matrix", maxResults: 20)
{
Filter = Filter.Parse("year >= 2000 AND rating > 7.0")
};
var results = engine.Search(query);Full EBNF specification is available here.
Comparison: =, !=, <, <=, >, >=
Boolean: AND (or &&), OR (or ||), NOT (or !)
String: CONTAINS, STARTS WITH, ENDS WITH, LIKE (% wildcard)
Special: IN, BETWEEN, IS NULL, IS NOT NULL, MATCHES (regex)
Conditional: ? : (ternary operator)
All operators are case-insensitive. Use parentheses for grouping.
Filters compile to portable bytecode for performance and serialization:
// Compile once, use many times
var filter = Filter.Parse("genre = 'Sci-Fi' AND year >= 2000");
var bytecode = filter.CompileToBytes();
// Save to disk
File.WriteAllBytes("filter.bin", bytecode);
// Load and use later
var loaded = Filter.FromBytecode(File.ReadAllBytes("filter.bin"));
var query = new Query("space") { CompiledFilterBytecode = bytecode };Build dynamic filters and navigate your data:
var query = new Query("science fiction", maxResults: 50)
{
EnableFacets = true
};
var results = engine.Search(query);
// Get facet counts
if (results.Facets != null)
{
foreach (var (fieldName, values) in results.Facets)
{
Console.WriteLine($"\n{fieldName}:");
foreach (var (value, count) in values)
{
Console.WriteLine($" {value}: {count} documents");
}
}
}
// Output:
// genre:
// Sci-Fi: 15 documents
// Fantasy: 8 documents
// Action: 5 documents
//
// year:
// 2020: 10 documents
// 2019: 8 documents
// 2018: 10 documentsIncrease relevance scores for specific documents:
// Boost recent movies
var recentBoost = new Boost(
Filter.Parse("year >= 2020"),
BoostStrength.Large // +20 to score
);
// Boost highly-rated content
var ratingBoost = new Boost(
Filter.Parse("rating >= 8.0"),
BoostStrength.Medium // +10 to score
);
var query = new Query("action movie", maxResults: 20)
{
EnableBoost = true,
Boosts = new[] { recentBoost, ratingBoost }
};
var results = engine.Search(query);Boost strengths: Small (+5), Medium (+10), Large (+20), Extreme (+40)
Sort results by any field:
// Sort by year (descending)
var query = new Query("thriller", maxResults: 20)
{
SortBy = fields.GetField("year"),
SortAscending = false
};
// Sort by rating (ascending)
var query = new Query("comedy", maxResults: 20)
{
SortBy = fields.GetField("rating"),
SortAscending = true
};Infidex uses a lexicographic ranking model where:
- Precedence is driven by structural and positional properties (coverage, phrase runs, anchor positions, etc.).
- Semantic score is refined using corpus-derived weights (inverse document frequency over character n‑grams), without any per-dataset manual tuning.
Concretely, each query term
where
Stage 1: BM25+ Candidate Generation
- Tokenizes text into character n-grams (2-grams + 3-grams)
- Builds inverted index with document frequencies
- BM25+ scoring backbone with L2-normalized term weights:
Formally, let
such that for each
Prefix and suffix queries are then evaluated as:
with time complexity
Stage 2: Lexical Coverage Analysis
- Applied to top-K candidates from Stage 1
- Tracks per-term coverage for each query word using 5 algorithms:
- Exact whole-word matching
- Fuzzy word matching (Damerau–Levenshtein with an edit radius adapted from a binomial typo model)
- Joined/split word detection
- Prefix/suffix matching (prefixes weighted higher than suffixes)
- LCS (Longest Common Subsequence) fallback when no word-level match exists
- For each query term
$q_i$ , computes per-term coverage:
where
- Derives coordination coverage across all
$n$ query terms:
- Extracts structural features: phrase runs, anchor token positions, lexical perfection
On top of raw per-term coverage, Infidex tracks how much information mass from the query is actually matched:
-
For each query term
$q_i$ , we compute a coverage score$c_i \in [0,1]$ and an information weight$I_i$ as above:$$ C_{\text{info}} = \frac{\sum_i c_i I_i}{\sum_i I_i} $$
This information view is used for two key behaviors:
-
Type-ahead detection: the last query term is treated as “still being typed” when its information share is small:
$$ \frac{I_{\text{last}}}{\sum_i I_i} \le \frac{1}{n+1} $$
where
$n$ is the number of unique query terms. Intuitively, the suffix is informationally weaker than an average term, so we avoid over-committing to it. -
Position-independent precedence boost: when exactly one term is unmatched, we compare the fraction of missing terms (
$1 - C_{\text{coord}}$ ) to the fraction of missing information (derived from$C_{\text{info}}$ ). If we have lost fewer bits of information than raw term coverage suggests, a precedence bit is set so that documents matching the rarer, more informative term outrank those matching only common terms.
If
Stage 3: Lexicographic Score Fusion
The final ordering is a lexicographic triple
Precedence and Semantic are encoded into a 16-bit ushort:
where:
- Precedence (high byte, 8 bits): Discrete match-quality tiers
- Semantic (low byte, 8 bits): Continuous similarity within a tier
-
Tiebreaker
$\tau \in [0,255]$ : Additional ordering signal used only when both Precedence and Semantic are equal (e.g., derived from word hits or LCS features)
Documents are ranked by a strict precedence order—higher bits always dominate lower bits:
| Bit | Value | Condition | Rationale |
|---|---|---|---|
| 7 | 128 | All query terms found in document | Completeness is paramount |
| 6 | 64 | All terms fully matched (whole word OR exact prefix) | Clean matches beat partial |
| 5 | 32 | Single-term: Strict whole-word match Multi-term: Lexically perfect doc (every doc token explained by query) |
Exact intent signal |
| 4 | 16 | Single-term: Perfect doc Multi-term: Strict whole-word match |
Secondary exactness |
| 3 | 8 | First match at position 0 (title starts with query) | Positional primacy |
| 2 | 4 | Precise prefix match (query is prefix of token) | Partial but anchored |
The semantic component provides smooth differentiation within precedence tiers:
For single-term queries
For single-term queries Infidex uses a heuristic blend of:
-
Per-term coverage
$C_{\text{avg}}$ (how completely the query token is matched), and -
A lexical similarity score
$L_{\text{lex}}$ that takes the maximum over several signals:-
$L_{\text{substr}}$ : substring containment (with a bias toward matches earlier in the query), -
$L_{\text{prefix}}$ : overlap between query suffix and token prefix, -
$L_{\text{fuzzy}}$ : Damerau–Levenshtein similarity (with transpositions), -
$L_{\text{2seg}}$ : a simple two-segment check for concatenated queries.
-
The final single-term semantic score is just a convex combination of
For multi-term queries:
where:
-
$C_{\text{avg}}$ = average per-term coverage -
$T_{\text{tfidf}}$ = normalized TF-IDF score from Stage 1 -
$R_{\text{phrase}}$ = phrase run bonus (consecutive query terms in document order) -
$\alpha + \beta + \gamma = 1$ ($\alpha$ ,$\beta$ ,$\gamma$ are adjustable constants)
For queries that appear to be concatenated words, Infidex detects two-segment alignment:
Given query
- Prefix fragment:
$q[0:\ell]$ where$\ell = \min(3, |q|/2)$ - Suffix fragment:
$q[|q|-\ell:|q|]$
If distinct document tokens
-
$t_i$ starts with or is a prefix of the prefix fragment -
$t_j$ ends with or is a suffix of the suffix fragment
Then:
For multi-term queries, the first query token acts as an anchor when:
- It has length ≥ 3 characters
- It appears as a substring in the document text
- There is a phrase run of ≥ 2 consecutive query terms
This is a pure positional/string rule—no rarity or IDF is used.
The 16-bit score component is computed as:
where
The overall ranking key is then the triple
This ensures:
- Completeness dominates (all terms found beats partial matches)
- Match quality refines (whole words beat prefixes beat fuzzy)
- Similarity differentiates (smooth ordering within tiers)
Infidex supports efficient binary serialization for persisting and loading search indexes:
var engine = SearchEngine.CreateDefault();
// Index your documents
engine.IndexDocuments(documents);
// Save to disk (binary format)
engine.Save("my-index.bin");// Load from disk with the same configuration
var engine = SearchEngine.Load(
filePath: "my-index.bin",
indexSizes: new[] { 3 },
startPadSize: 2,
stopPadSize: 0,
enableCoverage: true
);
// Search immediately (no re-indexing needed)
var results = engine.Search("query text", maxResults: 10);// Save asynchronously
await engine.SaveAsync("my-index.bin");
// Load asynchronously
var engine = await SearchEngine.LoadAsync(
filePath: "my-index.bin",
indexSizes: new[] { 3 },
startPadSize: 2,
stopPadSize: 0
);The binary format includes:
- All indexed documents (text, metadata, fields)
- Complete inverted index (terms and postings)
- TF-IDF weights (pre-normalized, byte-quantized)
- Document frequencies and statistics
- The FST term index for exact/prefix/suffix lookups
- The short-query prefix index (positional postings for 1–3 character prefixes)
- WordMatcher indices (exact words, LD1 deletions, and affix FST)
On disk, the file is laid out as:
- A fixed-size header (magic, format version, flags, document count, term count, header checksum computed via a simple CRC-style function)
- A length-prefixed data block containing documents, terms, FST, and short-query index, plus a CRC32-style checksum for that block
- A trailing section with the serialized
WordMatcherdata
The header and data block are sufficient to fully reconstruct the VectorModel; the trailing section is only used to restore the WordMatcher.
Index Size: Typically much smaller than source data due to byte-quantized weights and compressed postings. Example: 40k movie titles = < 5 MB index. See this test.
SearchEngine.Searchis thread-safe and can be called from many threads concurrently.- Indexing operations (
IndexDocuments,IndexDocumentsAsync,IndexDocument,CalculateWeights,Save,Load) acquire an exclusive writer lock and block other operations while they run. - Filters and boosts are compiled once and cached in a thread-safe dictionary; execution uses thread-local virtual machines to avoid shared state.
- Share a single
SearchEngineinstance per application/service and let multiple threads use it concurrently.
var engine = SearchEngine.CreateDefault();
// - Multi-size n-grams: [2, 3]
// - Coverage: Enabled (multi-algorithm model: exact, LD1 fuzzy, prefix/suffix, join/split, LCS)
// - Balanced for speed and accuracyvar engine = SearchEngine.CreateMinimal();
// - Single n-gram: [3]
// - Coverage: Disabled
// - Faster but less accuratevar engine = new SearchEngine(
indexSizes: new[] { 2, 3 }, // Character n-gram sizes
startPadSize: 2, // Pad start of words
stopPadSize: 0, // Pad end of words
enableCoverage: true, // Enable lexical matching
textNormalizer: TextNormalizer.CreateDefault(),
tokenizerSetup: TokenizerSetup.CreateDefault(),
coverageSetup: CoverageSetup.CreateDefault(),
wordMatcherSetup: new WordMatcherSetup
{
SupportLD1 = true, // Enable fuzzy matching
SupportAffix = true, // Enable prefix/suffix
MinimumWordSizeExact = 2,
MaximumWordSizeExact = 50,
MinimumWordSizeLD1 = 4,
MaximumWordSizeLD1 = 20
},
fieldWeights: new[] { 1.0f, 2.0f } // Per-field weights
);Infidex ships with a comprehensive test suite of 400+ tests, including:
- Multilingual query relevancy tests
- Concurrency tests exercising parallel search, indexing, and save/load patterns
- Persistence, performance, and core API behavior tests
dotnet testContributions of additional query relevancy tests from speakers of other languages are very welcome, especially for languages with rich morphology or non-Latin scripts. These tests help keep the ranking model language-agnostic and robust without per-dataset tweaks.
SearchEngine - Main search engine
Document - Document with ID and fields
Query - Search query configuration
Filter - Base class for filters
DocumentFields - Field schema definition
Boost - Boost configuration
// Indexing
engine.IndexDocuments(IEnumerable<Document> docs)
await engine.IndexDocumentsAsync(IEnumerable<Document> docs)
// Searching
SearchResult Search(string text, int maxResults = 10)
SearchResult Search(Query query)
// Document management
Document? GetDocument(long documentKey)
List<Document> GetDocuments(long documentKey) // All segments
// Schema
engine.DocumentFieldSchema = fields;Check out Infidex.Example for complete working examples:
- MovieExample.cs - Multi-field movie search with faceting
- FilterExample.cs - Advanced Infiscript filtering
- BoostExample.cs - Document boosting strategies
This library is licensed under the MIT license. 💜