Skip to Main Content
Table 6 
Feature set for our axiom parsing model.
Content Span Similarity Proportion of (a) words, (b) geometry relations, and (c) relation-arguments shared by the two spans. 
Number of Relations Number of geometry relations represented in the two spans. We use the Lexicon Map from GEOS to compute the number of expressed geometry relations. 
Span Lengths The distribution of the two text spans is typically dependent on their lengths. We use the ratio of the length of the two spans as an additional feature. 
Relative Position Relative position of the two lexical heads and the text split in the discourse element sentence. We use the difference between the lexical head position and the text split position as the feature. 
 
Discourse (Typography) Discourse Markers Discourse markers (connectives, cue-words, or cue-phrases, etc.) have been shown to give good indications on discourse structure (Marcu 2000). We build a list of discourse markers using the training set, considering the first and last tokens of each span, culled to top 100 by frequency. We use these 100 discourse markers as features. We repeat the same procedure by using part-of-speech (POS) instead of words and use them as features. 
Punctuation Punctuation at the segment border is another excellent cue for the segmentation. We include indicator features to show whether there is punctuation at the segment border. 
Text Organization Indicator that the two text spans are part of the same (a) sentence, (b) paragraph. 
RST Parse We use an off-the-shelf RST parser (Feng and Hirst 2014) and include an indicator feature that shows that the segmentation matches the parse segmentation. We also include the RST label as a feature. 
Soricut and Marcu Segmenter Soricut and Marcu (2003) (section 3.1) presented a statistical model for deciding elementary discourse unit boundaries. We use the probability given by this model retrained on our training set as a feature. This feature uses both lexical and syntactic information. 
Head/ Common Ancestor/ Attachment Node Head node is defined as the word with the highest occurrence as a lexical head in the lexicalized tree among all the words in the text span. The at- tachment node is the parent of the head node. We use features for the head words of the left and right spans, the common ancestor (if any), the attachment node, and the conjunction of the two head node words. We repeat these features with part-of-speech (POS) instead of words. 
Syntax Distance to (a) root, and (b) common ancestor for the nodes spanning the respective spans. We use these distances and the difference in the distances as features. 
Dominance Dominance (Soricut and Marcu 2003) is a key idea in discourse that looks at syntax trees and studies sub-trees for each span to infer a logical nesting order between the two. We use the dominance relationship as a feature. See Soricut and Marcu (2003) for details. 
JSON structure Indicator that the two spans are in the same node in the JSON hierarchy. Conjoined with the indicator feature that shows that the two spans are part of the same paragraph. 
Content Span Similarity Proportion of (a) words, (b) geometry relations, and (c) relation-arguments shared by the two spans. 
Number of Relations Number of geometry relations represented in the two spans. We use the Lexicon Map from GEOS to compute the number of expressed geometry relations. 
Span Lengths The distribution of the two text spans is typically dependent on their lengths. We use the ratio of the length of the two spans as an additional feature. 
Relative Position Relative position of the two lexical heads and the text split in the discourse element sentence. We use the difference between the lexical head position and the text split position as the feature. 
 
Discourse (Typography) Discourse Markers Discourse markers (connectives, cue-words, or cue-phrases, etc.) have been shown to give good indications on discourse structure (Marcu 2000). We build a list of discourse markers using the training set, considering the first and last tokens of each span, culled to top 100 by frequency. We use these 100 discourse markers as features. We repeat the same procedure by using part-of-speech (POS) instead of words and use them as features. 
Punctuation Punctuation at the segment border is another excellent cue for the segmentation. We include indicator features to show whether there is punctuation at the segment border. 
Text Organization Indicator that the two text spans are part of the same (a) sentence, (b) paragraph. 
RST Parse We use an off-the-shelf RST parser (Feng and Hirst 2014) and include an indicator feature that shows that the segmentation matches the parse segmentation. We also include the RST label as a feature. 
Soricut and Marcu Segmenter Soricut and Marcu (2003) (section 3.1) presented a statistical model for deciding elementary discourse unit boundaries. We use the probability given by this model retrained on our training set as a feature. This feature uses both lexical and syntactic information. 
Head/ Common Ancestor/ Attachment Node Head node is defined as the word with the highest occurrence as a lexical head in the lexicalized tree among all the words in the text span. The at- tachment node is the parent of the head node. We use features for the head words of the left and right spans, the common ancestor (if any), the attachment node, and the conjunction of the two head node words. We repeat these features with part-of-speech (POS) instead of words. 
Syntax Distance to (a) root, and (b) common ancestor for the nodes spanning the respective spans. We use these distances and the difference in the distances as features. 
Dominance Dominance (Soricut and Marcu 2003) is a key idea in discourse that looks at syntax trees and studies sub-trees for each span to infer a logical nesting order between the two. We use the dominance relationship as a feature. See Soricut and Marcu (2003) for details. 
JSON structure Indicator that the two spans are in the same node in the JSON hierarchy. Conjoined with the indicator feature that shows that the two spans are part of the same paragraph. 
Close Modal

or Create an Account

Close Modal
Close Modal