Types of CoS Descriptions

The data shows that generally turkers described changes of state in one of three ways. (1) They describe the attribute directly, e.g. cut-cucumber: \The size of the cucumber changes”. (2) They describe the change of state with a resultative phrase, e.g. cut-cucumber: \The cucumber is cut into small pieces.” Here, the CoS is indicated by the semantics of small pieces. And, (3) the CoS can be described with another verb that denotes the same or similar CoS as the verb presented to the turker, e.g. stir-ingredients: \The ingredients are

Table 2: Attributes and result values for change of state

mixed together”.

Multiple CoS Labels per Verb

A turker’s change of state description does not necessarily only contain a single change of state. In fact, all the descriptions described between 0 and 3 changes of state, as seen in Figure 2. Most descriptions (43%) contained only a single change of state. Also, a large percentage (36%) contained no change of state. In actuality, some of the descriptions that were annotated as containing no change of state, described changes of state with high level attributes which do not t into our categories (e.g. Cleanliness). Others contained verbal descriptions of CoS (e.g. Stir-ingredients: \The ingredients are mixed together.”). We did
not annotate descriptions which contained these circular denitions. For each verb, we calculated the distribution of CoS annotations over each attribute. Figure 3 shows the CoS attribute distributions for two verbs, clean and rinse. The CoS

Figure 2: Percentage of samples describing 0 to 3 changes of state (pilot and cucumber datasets)

Figure 3: CoS distributions over attributes for clean and rinse (pilot dataset)

distributions are closely related to the semantics of the verbs they label. For example, CoS labels with the attributes Wetness and PresenceOfObject (referring to dirt that is removed) are more frequent for the verbs clean and rinse than CoS labels with other attributes. This is because the semantics of these verbs indicate some object is cleaned away, possibly with water. Notice that clean, the result verb, has a much lower frequency of theWetness attribute (which is related more to the manner of cleaning) and a higher frequency of c (which is related to the intended result). On the other hand, the manner verb rinse has these distributions the other way around.
Another observation regarding the CoS distributions in the pilot dataset is that not all the descriptions describe the same attributes. For example, for clean, most of the descriptions describe Wetness and PresenceOfObject, but there is also some distribution over the attributes Texture, OcclusionBySecondObject, Color, etc. One reason this may happen is because when a verb-object pair is presented to a turker without an accompanying scene, the turker may rely more on their imagination when describing the change of state, whereas when the scene is shown, they can see the change directly. For example, if the verb object pair is shake broccoli but the turker does not see that the broccoli is covered with water, they will not describe the water droplets that come off as it is shaken. Moreover, even in conditions when the scene is shown, the turkers may describe the same change of state in different ways resulting in different CoS annotations. For example, clean dishes: \Food is removed from the dishes” describes the PresenceOfObject attribute of the AssociatedObject, while clean dishes: \Dish surface is cleared of debris and/or muck.” describes the OcclusionBySecondObject attribute of the DirectObject.

The Role of Visual Context

To determine how the presence of a scene or how the object of a verb affects the types of CoS turkers describe in their responses, we dened a Jensen-Shannon divergence based metric variability. The JSD of two distributions P and Q is given by the formula below.

1,2

D is the Kullback-Leibler divergence, a non-symmetric measure of the difference between two distributions.

The advantage of using JSD is that it is a symmetric measure. It shows similarity between two distributions, equaling 0 when the distributions are the same and approaching 1 as they become more different.

Table 3: Variability between CoS frequencies per object and scene conditions (pilot dataset)

The variability describes how the CoS distribution of a verb differ depending on a certain variable (+/-scene or object). We compute the variability by averaging the sum of JSD for each pair of CoS distributions of the verb, where the distributions of each pair are taken over different values of the variable. For example, the variability of the verb shake over the scene conditions is found by dividing the JSD of the CoS distributions in the +scene and -scene conditions by 1 (the number of pairs of conditions). Moreover, the variability over the object conditions is found by summing the JSD of the CoS distributions for each pair of object conditions (three unique pairs), and dividing by 3. The general variability formula is shown in Equation 4.

The variabilities between various conditions for each verb in the pilot dataset are shown in Table 3. The variability metric shows that there is indeed some difference between the CoS distributions in the +scene and -scene conditions. Moreover, the variability is much higher for some verbs (shake 0.18, add 0.12) than others (cut 0.01, rinse 0.01). This may be because for verbs like shake and add, without the accompanying scene it is not clear how the state of the object will change.

Table 4: Label cardinality per verb (pilot dataset)

Table 4 shows the average number of CoS labels for each of the turkers’ descriptions. For most verbs (6 of 10), more changes of state are described when the scene is shown, indicating that the scene presents more information about CoS to the turker for these verbs. Taken together this data shows that visual context is important to determine the change of state denoted by a verb.

Effects of Direct Object on CoS

Table 3 also shows the variability for each verb in the pilot dataset computed over the three object conditions. The variability among objects is different for each verb, showing that for some verbs in this kitchen domain the CoS depends more on the object of the verb than for others. The verb with the highest variability is again shake (0.42). The resulting state change from shaking a (wet) piece of broccoli is very different than shaking a container of spices over food, or a bowl lled with eggs. This shows that even though the verb sense is the same for in all these descriptions, the CoS indicated by the verb may depend on the object of the verb.

Verb Semantic Similarity based on CoS

To compare the CoS distributions between each pair of verbs in the pilot dataset we computed the Jensen-Shannon divergence of each pair. Table 5 shows that the distributions for verbs from the pilot data are very similar for verbs with similar se-

Table 5: Jensen-Shannon divergence between CoS distributions for verb pairs (pilot dataset)

mantics (e.g. JSD(cut,chop)=0.01 and JSD(mix,stir)=0.03 vs. JSD(cut,shake)=0.59 and JSD(rinse,chop)=0.68). This shows that the CoS frame is capturing relevant semantic information.

Types of CoS Descriptions