-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't use numeric IDs for internal PHP serialization. #716
Conversation
pinging @manicki @thiemowmde |
src/Entity/EntityIdValue.php
Outdated
@@ -31,8 +31,8 @@ public function __construct( EntityId $entityId ) { | |||
*/ | |||
public function serialize() { | |||
return json_encode( [ | |||
$this->entityId->getEntityType(), | |||
$this->getNumericId() | |||
get_class( $this->entityId ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? Kinda scary to do new $userInput()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's not pretty. But all we use this for is cloning statements.
The change is here because the old code assumes two things: a) the ID can be represented as a number (it can't) and b) we have a well known set of entity types (we don't).
So, having the choice to try and access some kind of entity id type registry via a global, or putting the class name into the serialization, I opted for the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My remark is not about prettiness, it is about security.
But all we use this for is cloning statements.
That does no reassure me. Kinda like saying "no need to escape this SQL argument since right now only we call this function with non-user-input".
So, having the choice to try and access some kind of entity id type registry via a global, or putting the class name into the serialization, I opted for the latter.
Why not just (de)serialize the ID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't deserialize the ID without access to a service. How do we get the right service instance? This is particularly tricky for entity parsing, since we use different EntityIdParsers for data coming from different repos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I appreciate your concern for security. I'm not very happy about this solution. I just don't see an alternative, apart from not using serialize/unserialize for cloning statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't deserialize the ID without access to a service.
Huh?
deserialize( $entityId );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I uploaded a draft utilizing this approach as #718.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what is worth: if we didn't want to use PHP serialization for EntityIdValue that @thiemowmde is suggesting in #718 (I am not sure what are implications with that approach, if this serialization is never stored) another option (theoretically) could be using EntityIdComposer
which is meant as a entity-type-agnostic (kinda) replacement for LegacyIdInterpreter
. The obvious problem is how to pull EntityIdComposer
in here, so it makes sense (it is now bound to Wikibase git repo and entity type definitions out there). I cannot think of good way to achieve this, so this comment is mostly noise.
Also theoretically, another option would be to use value of EntityId::getSerialization in EntityIdValue::serialize, and use EntityIdParser in EntityIdValue::unserialize but that seems so wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JeroenDeDauw Yes, I realized this morning that I misunderstood. deserialize( $this->entityId ) will work, sure.
src/Entity/EntityIdValue.php
Outdated
@@ -31,8 +31,8 @@ public function __construct( EntityId $entityId ) { | |||
*/ | |||
public function serialize() { | |||
return json_encode( [ | |||
$this->entityId->getEntityType(), | |||
$this->getNumericId() | |||
get_class( $this->entityId ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something we are heavily trying to avoid, because this makes implementation details like the (entire!) namespace and the class name leak into the database. That's why we have entity type identifiers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I definitely do not want this in the database, or in our external representation! As far as I am aware, this is not used in any way with our JSON encoding. We only use the php-serialization temporarily, to do deep clones. PHP-Serialization is extremely brittle (sensitive against changes to private members, etc), so it should never be used for persistence.
The alternative would be to change the way we do cloning, and avoid copying immutable objects when we clone. That would be nice, but I can't think of a reliable and easy way of doing this. serialize/unserialize is a very convenient method to do deep cloning safely. But it's somewhat wasteful, since it also copies immutable objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's correct that these serialize
formats are not used in our JSON encoding, but they are used to build summaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/Entity/EntityIdValue.php
Outdated
|
||
try { | ||
$entityId = LegacyIdInterpreter::newIdFromTypeAndNumber( $entityType, $numericId ); | ||
if ( is_string( $id ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs at least a class_exists
check and a fallback solution (or return null
?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the class doesn't exist, this should fail hard. Ok, we could throw an exception instead of a fatal error.
src/Snak/SnakObject.php
Outdated
return PropertyId::newFromNumber( $unserialized ); | ||
} elseif ( is_string( $unserialized ) ) { | ||
return new PropertyId( $unserialized ); | ||
} elseif ( $unserialized instanceof PropertyId ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which code path needs this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably none atm, it's here for completeness. I don't have strong feelings about having it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still want me to remove this? This is currently blocking the pull request...
I talked to Thiemo about this. We decided to use his patch #718 for EntityIdValue and the EntityId classes, and my changes for SnakObject and PropertyValueSnak. I will remove the changes to EntityIdValue from this pull request to avoid conflicts. The two pull requests do not depend on each other, but are conceptually related, and should become part of the same release. Note that this is a breaking change only if we consider the php serialization to be a stable external interface. I don't think we have an explicite statement about this anywhere. |
db964fb
to
2d0fff1
Compare
NOTE: this changes snak hashes, and consequently reference hashes! This may break reference update/removal! |
src/Snak/SnakObject.php
Outdated
@@ -102,7 +102,7 @@ public function equals( $target ) { | |||
* @return string | |||
*/ | |||
public function serialize() { | |||
return serialize( $this->propertyId->getNumericId() ); | |||
return serialize( $this->propertyId->getSerialization() ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit pointless. Please replace with a straight return $this->propertyId->getSerialization();
.
NOTE: once this is merged, we should make a release. |
src/Snak/SnakObject.php
Outdated
} | ||
|
||
/** | ||
* @param string $serialized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string|int
src/Snak/PropertyValueSnak.php
Outdated
list( $numericId, $dataValue ) = unserialize( $serialized ); | ||
$this->__construct( $numericId, $dataValue ); | ||
list( $propertyId, $dataValue ) = unserialize( $serialized ); | ||
$this->__construct( self::newPropertyId( $propertyId ), $dataValue ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mark the date in your calendar. ;-) This might be the first time I don't like the fact that this relies on a protected method from a base class. Can be much simpler:
$this->__construct( is_int( $propertyId ) ? $propertyId : new PropertyId( $propertyId ), $dataValue );
src/Snak/SnakObject.php
Outdated
} | ||
|
||
$unserialized = unserialize( $serialized ); | ||
if ( is_int( $unserialized ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but this code is weird. It does an is_int
check twice? The reason is that this method is used for two very different use cases:
- In Value snaks this can be called with either
1
or"P1"
. - In NoValue and SomeValue snaks this can be called with
"i:42;"
or"P1"
.
The problem is: there is no need to make these two use cases compatible with each other.
NOTE: it's unclear if this completely fixes T157442 Bug: T157442
8cede90
to
ae3f139
Compare
Thanks @thiemowmde for refactoring the PropertyId instantiation! +1 for that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I squashed a few commits and added one that simplifies the code quite a lot. I did not touched the tests. +2 from my side. I will leave this here for others to see and merge this tomorrow.
+1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me. I think I've spotted one cosmetic thing. I'll fix it and then merge this.
Please also note my comment on @SInCE 7.0 tag. @thiemowmde it was you who changed, right? I think it is OK to possibly change this tag back if needed before the release, I am not willing to block merging this pull request on this discussion.
return [ | ||
'legacy' => [ | ||
new PropertyValueSnak( $p2, $value ), | ||
'a:2:{i:0;i:2;i:1;C:22:"DataValues\StringValue":1:{b}}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not affect test result but I believe to be 100% accurate "i:2" here should become "i:1". Minor thing, I am going to fix this, and squash into other commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ignore this comment. I looked at wrong spot in the old code. All good here.
@@ -97,12 +97,12 @@ public function equals( $target ) { | |||
/** | |||
* @see Serializable::serialize | |||
* | |||
* @since 0.1 | |||
* @since 7.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder is this correct to change this number here. Change here is indeed breaking change but the interface of the methods does not change. So I guess version number change is to show that serialization format has changed in the release 7.0. I am find with that I just wonder if this is "how it should be done" (tm).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the returned value is a crucial part of the contract of a method. If the format of the return value changes, it's a different method. The idea of the since tag is to make this obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some places where we bumped the number but then mention the method was already there in some other form or something along those lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, something like this seems reaonable.
This changes Snak serialization to use the full string representation of property IDs.
NOTE: serialize/unserialize is used mainly when cloning Statements.
NOTE: it's unclear if this completely fixes T157442
NOTE: this changes snak hashes, and consequently reference hashes! This may break reference update/removal!
Bug: T157442