Skip to content

Conversation

@eyalleshem
Copy link
Contributor

@eyalleshem eyalleshem commented Oct 27, 2025

This continues the preparation for a borrowed Tokenizer (#2036) and adds more internal functions that borrow strings during tokenization.
This commit also handles places where the Tokenizer could modify the original strings. In such cases, the strategy is to create functions that return Cow<'a, str> and create an owned version of the string when modification is needed.
This commit is rebased on PR #2073 and will remain in draft until #2073 is reviewed.

@eyalleshem eyalleshem force-pushed the reduce-string-copies-cow branch from fa0db77 to d92e39a Compare October 27, 2025 10:15
@alamb
Copy link
Contributor

alamb commented Oct 29, 2025

FYI @iffyio and @yoavcloud

@eyalleshem eyalleshem force-pushed the reduce-string-copies-cow branch 2 times, most recently from b5e4533 to 954176e Compare November 7, 2025 08:19
@eyalleshem eyalleshem force-pushed the reduce-string-copies-cow branch from 954176e to 075e2f0 Compare November 21, 2025 10:53
@eyalsatori
Copy link
Contributor

eyalsatori commented Nov 25, 2025

Hi @iffyio , could you mark this PR as ready for review? (This option is only available to those with write access to the repo)

@alamb alamb marked this pull request as ready for review November 25, 2025 14:08
@iffyio iffyio changed the base branch from main to reduce-string-copying November 27, 2025 14:10
@iffyio
Copy link
Contributor

iffyio commented Nov 27, 2025

@eyalleshem I updated the target branch for the PR could you update the PR to only include the new diff? I think its still picking up some of the earlier tokenizer changes after all

@eyalleshem eyalleshem force-pushed the reduce-string-copies-cow branch from 075e2f0 to 58891ae Compare November 29, 2025 21:51
@eyalleshem
Copy link
Contributor Author

rebased :)

chars: &mut State<'a>,
tag_prefix: Option<&'a str>,
) -> Result<(Cow<'a, str>, Option<Cow<'a, str>>), TokenizerError> {
chars.next(); // consume $ after tag (or second $ for $$)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we explicitly check here that we have a $ token being consumed?i.e. something like

if chars.peek() != Some('$') {
  return Err(...)
}
chars.next(); // Consume '$'

It looks incorrect otherwise from the method's pov that its potentially blindly skipping a token

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

&self,
chars: &mut State<'a>,
) -> Result<Token, TokenizerError> {
chars.next(); // consume first $
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment as below that we can explicitly check what the token is before consuming it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

src/tokenizer.rs Outdated
chars.next(); // consume final $
// content_end is before the first $ of $$
let content_end = chars.byte_pos - 2;
let value = &chars.source[content_start..content_end];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do a pass through the places where we index into the buffer, to ensure that the indexing is safe and we return an error otherwise. Maybe we could have a helper function for it for reuse

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a safe_slice method that returns an error when the index is out of bounds, and replaced otherquery slicing operations to use it

src/tokenizer.rs Outdated
/// Scans a single-quoted string and returns either a borrowed slice or an unescaped owned string.
///
/// Strategy: Scan once to find the end and detect escape sequences.
/// - If no escapes exist (or unescape=false), return Cow::Borrowed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// - If no escapes exist (or unescape=false), return Cow::Borrowed
/// - If no escapes exist (or unescape=false), return [Cow::Borrowed]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

src/tokenizer.rs Outdated
///
/// Strategy: Scan once to find the end and detect escape sequences.
/// - If no escapes exist (or unescape=false), return Cow::Borrowed
/// - If escapes exist and unescape=true, reprocess using existing Unescape logic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// - If escapes exist and unescape=true, reprocess using existing Unescape logic
/// - If escapes exist and unescape=true, reprocess using existing [Unescape] logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// If no unescaping needed, return borrowed (without quotes)
if !unescape || !has_escapes {
// Strip opening and closing quotes
return Some(Cow::Borrowed(&full_content[1..full_content.len() - 1]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here re ensureing that the indexing is safe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Add internal _borrowed() functions that return Cow<'a, str> to prepare for
zero-copy tokenization. When the source string needs no transformation
(no escaping), return Cow::Borrowed. When transformation is required,
return Cow::Owned.

The Token enum still uses String, so borrowed values are converted via
to_owned() for now. This maintains API compatibility while preparing the
codebase for a future refactor where Token can hold borrowed strings.

Optimized: comments, quoted strings, dollar-quoted strings, quoted identifiers.

Add safe_slice() helper function for bounds-checked string slicing.
@eyalleshem eyalleshem force-pushed the reduce-string-copies-cow branch from 58891ae to 2f9416a Compare December 1, 2025 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants