Reduce string copies cow #2075

eyalleshem · 2025-10-27T09:57:11Z

This continues the preparation for a borrowed Tokenizer (#2036) and adds more internal functions that borrow strings during tokenization.
This commit also handles places where the Tokenizer could modify the original strings. In such cases, the strategy is to create functions that return Cow<'a, str> and create an owned version of the string when modification is needed.
This commit is rebased on PR #2073 and will remain in draft until #2073 is reviewed.

alamb · 2025-10-29T12:22:33Z

FYI @iffyio and @yoavcloud

eyalsatori · 2025-11-25T13:57:19Z

Hi @iffyio , could you mark this PR as ready for review? (This option is only available to those with write access to the repo)

iffyio · 2025-11-27T14:11:29Z

@eyalleshem I updated the target branch for the PR could you update the PR to only include the new diff? I think its still picking up some of the earlier tokenizer changes after all

eyalleshem · 2025-11-29T21:58:56Z

rebased :)

iffyio · 2025-12-01T12:44:40Z

src/tokenizer.rs

+        chars: &mut State<'a>,
+        tag_prefix: Option<&'a str>,
+    ) -> Result<(Cow<'a, str>, Option<Cow<'a, str>>), TokenizerError> {
+        chars.next(); // consume $ after tag (or second $ for $$)


Can we explicitly check here that we have a $ token being consumed?i.e. something like

if chars.peek() != Some('$') { return Err(...) } chars.next(); // Consume '$'

It looks incorrect otherwise from the method's pov that its potentially blindly skipping a token

iffyio · 2025-12-01T12:49:16Z

src/tokenizer.rs

+        &self,
+        chars: &mut State<'a>,
+    ) -> Result<Token, TokenizerError> {
+        chars.next(); // consume first $


Similar comment as below that we can explicitly check what the token is before consuming it

iffyio · 2025-12-01T13:10:32Z

src/tokenizer.rs

+                        chars.next(); // consume final $
+                                      // content_end is before the first $ of $$
+                        let content_end = chars.byte_pos - 2;
+                        let value = &chars.source[content_start..content_end];


can we do a pass through the places where we index into the buffer, to ensure that the indexing is safe and we return an error otherwise. Maybe we could have a helper function for it for reuse

Added a safe_slice method that returns an error when the index is out of bounds, and replaced otherquery slicing operations to use it

iffyio · 2025-12-01T13:15:04Z

src/tokenizer.rs

+/// Scans a single-quoted string and returns either a borrowed slice or an unescaped owned string.
+///
+/// Strategy: Scan once to find the end and detect escape sequences.
+/// - If no escapes exist (or unescape=false), return Cow::Borrowed


Suggested change

/// - If no escapes exist (or unescape=false), return Cow::Borrowed

/// - If no escapes exist (or unescape=false), return [Cow::Borrowed]

iffyio · 2025-12-01T13:15:22Z

src/tokenizer.rs

+///
+/// Strategy: Scan once to find the end and detect escape sequences.
+/// - If no escapes exist (or unescape=false), return Cow::Borrowed
+/// - If escapes exist and unescape=true, reprocess using existing Unescape logic


Suggested change

/// - If escapes exist and unescape=true, reprocess using existing Unescape logic

/// - If escapes exist and unescape=true, reprocess using existing [Unescape] logic

iffyio · 2025-12-01T13:16:22Z

src/tokenizer.rs

+                    // If no unescaping needed, return borrowed (without quotes)
+                    if !unescape || !has_escapes {
+                        // Strip opening and closing quotes
+                        return Some(Cow::Borrowed(&full_content[1..full_content.len() - 1]));


Similar comment here re ensureing that the indexing is safe

Add internal _borrowed() functions that return Cow<'a, str> to prepare for zero-copy tokenization. When the source string needs no transformation (no escaping), return Cow::Borrowed. When transformation is required, return Cow::Owned. The Token enum still uses String, so borrowed values are converted via to_owned() for now. This maintains API compatibility while preparing the codebase for a future refactor where Token can hold borrowed strings. Optimized: comments, quoted strings, dollar-quoted strings, quoted identifiers. Add safe_slice() helper function for bounds-checked string slicing.

eyalleshem force-pushed the reduce-string-copies-cow branch from fa0db77 to d92e39a Compare October 27, 2025 10:15

eyalleshem force-pushed the reduce-string-copies-cow branch 2 times, most recently from b5e4533 to 954176e Compare November 7, 2025 08:19

eyalsatori mentioned this pull request Nov 10, 2025

Improve performance by reducing string copies #2036

Open

eyalleshem force-pushed the reduce-string-copies-cow branch from 954176e to 075e2f0 Compare November 21, 2025 10:53

alamb marked this pull request as ready for review November 25, 2025 14:08

iffyio changed the base branch from main to reduce-string-copying November 27, 2025 14:10

eyalleshem force-pushed the reduce-string-copies-cow branch from 075e2f0 to 58891ae Compare November 29, 2025 21:51

iffyio reviewed Dec 1, 2025

View reviewed changes

eyalleshem force-pushed the reduce-string-copies-cow branch from 58891ae to 2f9416a Compare December 1, 2025 16:12

	/// - If no escapes exist (or unescape=false), return Cow::Borrowed
	/// - If no escapes exist (or unescape=false), return [Cow::Borrowed]

	/// - If escapes exist and unescape=true, reprocess using existing Unescape logic
	/// - If escapes exist and unescape=true, reprocess using existing [Unescape] logic

Reduce string copies cow #2075

Are you sure you want to change the base?

Reduce string copies cow #2075

Uh oh!

Conversation

eyalleshem commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Oct 29, 2025

Uh oh!

eyalsatori commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iffyio commented Nov 27, 2025

Uh oh!

eyalleshem commented Nov 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eyalleshem commented Oct 27, 2025 •

edited

Loading

eyalsatori commented Nov 25, 2025 •

edited

Loading