-
Notifications
You must be signed in to change notification settings - Fork 673
Reduce string copies cow #2075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: reduce-string-copying
Are you sure you want to change the base?
Reduce string copies cow #2075
Conversation
fa0db77 to
d92e39a
Compare
|
FYI @iffyio and @yoavcloud |
b5e4533 to
954176e
Compare
954176e to
075e2f0
Compare
|
Hi @iffyio , could you mark this PR as ready for review? (This option is only available to those with write access to the repo) |
|
@eyalleshem I updated the target branch for the PR could you update the PR to only include the new diff? I think its still picking up some of the earlier tokenizer changes after all |
075e2f0 to
58891ae
Compare
|
rebased :) |
| chars: &mut State<'a>, | ||
| tag_prefix: Option<&'a str>, | ||
| ) -> Result<(Cow<'a, str>, Option<Cow<'a, str>>), TokenizerError> { | ||
| chars.next(); // consume $ after tag (or second $ for $$) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we explicitly check here that we have a $ token being consumed?i.e. something like
if chars.peek() != Some('$') {
return Err(...)
}
chars.next(); // Consume '$'
It looks incorrect otherwise from the method's pov that its potentially blindly skipping a token
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| &self, | ||
| chars: &mut State<'a>, | ||
| ) -> Result<Token, TokenizerError> { | ||
| chars.next(); // consume first $ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment as below that we can explicitly check what the token is before consuming it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/tokenizer.rs
Outdated
| chars.next(); // consume final $ | ||
| // content_end is before the first $ of $$ | ||
| let content_end = chars.byte_pos - 2; | ||
| let value = &chars.source[content_start..content_end]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do a pass through the places where we index into the buffer, to ensure that the indexing is safe and we return an error otherwise. Maybe we could have a helper function for it for reuse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a safe_slice method that returns an error when the index is out of bounds, and replaced otherquery slicing operations to use it
src/tokenizer.rs
Outdated
| /// Scans a single-quoted string and returns either a borrowed slice or an unescaped owned string. | ||
| /// | ||
| /// Strategy: Scan once to find the end and detect escape sequences. | ||
| /// - If no escapes exist (or unescape=false), return Cow::Borrowed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// - If no escapes exist (or unescape=false), return Cow::Borrowed | |
| /// - If no escapes exist (or unescape=false), return [Cow::Borrowed] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/tokenizer.rs
Outdated
| /// | ||
| /// Strategy: Scan once to find the end and detect escape sequences. | ||
| /// - If no escapes exist (or unescape=false), return Cow::Borrowed | ||
| /// - If escapes exist and unescape=true, reprocess using existing Unescape logic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// - If escapes exist and unescape=true, reprocess using existing Unescape logic | |
| /// - If escapes exist and unescape=true, reprocess using existing [Unescape] logic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| // If no unescaping needed, return borrowed (without quotes) | ||
| if !unescape || !has_escapes { | ||
| // Strip opening and closing quotes | ||
| return Some(Cow::Borrowed(&full_content[1..full_content.len() - 1])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment here re ensureing that the indexing is safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Add internal _borrowed() functions that return Cow<'a, str> to prepare for zero-copy tokenization. When the source string needs no transformation (no escaping), return Cow::Borrowed. When transformation is required, return Cow::Owned. The Token enum still uses String, so borrowed values are converted via to_owned() for now. This maintains API compatibility while preparing the codebase for a future refactor where Token can hold borrowed strings. Optimized: comments, quoted strings, dollar-quoted strings, quoted identifiers. Add safe_slice() helper function for bounds-checked string slicing.
58891ae to
2f9416a
Compare
This continues the preparation for a borrowed Tokenizer (#2036) and adds more internal functions that borrow strings during tokenization.
This commit also handles places where the Tokenizer could modify the original strings. In such cases, the strategy is to create functions that return
Cow<'a, str>and create an owned version of the string when modification is needed.This commit is rebased on PR #2073 and will remain in draft until #2073 is reviewed.