-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
API: microsecond resolution for Timedelta strings #63196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
rhshadrach
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
pandas/_libs/tslibs/timedeltas.pyx
Outdated
| # TODO: more performant way of doing this check? | ||
| if ival % 1000 != 0: | ||
| return True | ||
| return re.search(r"\.\d{9}", item) or "ns" in item or "nano" in item |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the 9 correct here? Shouldn't 7 be enough to have something below microseconds?
(although it is not entirely clear what this is exactly used for, will first check the rest of the PR, but some additional docstring might be useful)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, is this meant for something that ends on 0's to still infer as nanosecond?
i.e.
>>> pd.Timedelta("1.123456000 seconds").unit
'ns'
i.e. that could be microseconds (no loss of data), but because the string included the nanos, we infer as nanosecond unit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose 7 digits should then also still count as nanosconds?
Right now we get:
>>> pd.Timedelta("1.1234561 seconds").unit
'ns'
>>> pd.Timedelta("1.1234560 seconds").unit
'us'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. 7 digits should be enough. will update
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
pandas/tests/groupby/test_groupby.py
Outdated
| def test_groupby_timedelta_median(): | ||
| # issue 57926 | ||
| expected = Series(data=Timedelta("1D"), index=["foo"]) | ||
| expected = Series(data=Timedelta("1D"), index=["foo"], dtype="m8[ns]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering why ns, but it is the other PR that will preserve the unit of the Timedelta object when converting to an array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the dtype here preserves the current dtype for expected. The other PR will change the dtype for df["timedelta"] below, so an update will be needed after one of them gets merged
| "ufunc 'divide' cannot use operands", | ||
| "Invalid dtype object for __floordiv__", | ||
| r"unsupported operand type\(s\) for /: 'int' and 'str'", | ||
| r"unsupported operand type\(s\) for /: 'datetime.timedelta' and 'str'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, how is this caused by the changes here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was only on the dev builds and when box=True so we are dividing by np.array(["1"], dtype=object), so i suspect that when td.to_timedelt64() has a "us" unit it tries casting to a pytimedelta to operate
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
|
Updating this to also handle the array case (now that #63018 is merged) breaks a bunch of tests. I spent some time today updating those |
pandas/_libs/tslibs/timedeltas.pyx
Outdated
| # TODO: more performant way of doing this check? | ||
| if ival % 1000 != 0: | ||
| return True | ||
| return re.search(r"\.\d{7}", item) or "ns" in item or "nano" in item |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return re.search(r"\.\d{7}", item) or "ns" in item or "nano" in item | |
| return re.search(r"\.\d{7}", item) or "ns" in item or "nano" in item.lower() |
(or add or "Nano" in item)
?
I was checking the Timedelta tests for the other PR, and seeing that we allow title case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I'm still chasing down the failures that are caused when we add this to the array inference. there are some actual bugs that uncovered.
| isinstance(two_hours, Timedelta) | ||
| and two_hours.unit == "ns" | ||
| and box_with_array is not pd.array | ||
| ): | ||
| expected = expected.as_unit("ns") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because with a TimedeltaArray, the inplace += is actually inplace preserving the dtype, while for the other objects it is just doing rng = rng + two_hours under the hood?
Is that an intentional difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes and yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment about that to the test (it is not obvious as a reader that this is the reason for that check) ?
|
The docstrings seem to have uncovered a bug: The parsed integer for the first element is not cast from micro to nano when the second element infers it needs nanos (or it should just reparse the full array?) |
Looks like it is reparsing the full array, but not respecting the "creso" arg when we do. Should now be fixed. |
|
Hmm, some conda issue. Restarted, hopefully will pas now That last commit before your latest update also has one remaining mypy failures, not sure if that is fixed now: |
| >>> pd.to_timedelta(np.arange(5), unit="D") | ||
| TimedeltaIndex(['0 days', '1 days', '2 days', '3 days', '4 days'], | ||
| dtype='timedelta64[ns]', freq=None) | ||
| dtype='timedelta64[us]', freq=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| dtype='timedelta64[us]', freq=None) | |
| dtype='timedelta64[ns]', freq=None) |
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.AGENTS.md.Ideally we'd do the check for needs_nano_unit at a lower level and avoid effectively re-parsing the string, but the existing parse_timedelta_string function is a tough nut to crack.