Data-driven Parsing Evaluation for Child-Parent Interactions

Abstract We present a syntactic dependency treebank for naturalistic child and child-directed spoken English. Our annotations largely follow the guidelines of the Universal Dependencies project (UD [Zeman et al., 2022]), with detailed extensions to lexical and syntactic structures unique to spontaneous spoken language, as opposed to written texts or prepared speech. Compared to existing UD-style spoken treebanks and other dependency corpora of child-parent interactions specifically, our dataset is much larger (44,744 utterances; 233,907 words) and contains data from 10 children covering a wide age range (18–66 months). We conduct thorough dependency parser evaluations using both graph-based and transition-based parsers, trained on three different types of out-of-domain written texts: news, tweets, and learner data. Out-of-domain parsers demonstrate reasonable performance for both child and parent data. In addition, parser performance for child data increases along children’s developmental paths, especially between 18 and 48 months, and gradually approaches the performance for parent data. These results are further validated with in-domain training.


Introduction
Research on syntactic dependency parsing has undoubtedly enjoyed tremendous progress with the continuous development of the Universal Dependencies project (Zeman et al., 2022) (hereafter UD).That said, of the 228 treebanks in the latest version of UD (v2.10 at the time of writing), only 12 are treebanks of fully spoken data (see also Dobrovoljc (2022)), while the rest focus on different genres within the written domain (e.g.news, wikipedia texts).This means that most (if not all) off-the-shelf dependency parsers that are considered state-of-theart are oriented towards written texts, rather than tailored specifically for spontaneous speech.Therefore a natural question arises: how well would parsing systems developed for written data perform when it comes to spontaneous speech?
Over the past decade, there have been efforts devoted into dependency parsing for the spoken domain, especially for (a subset of) the Switchboard corpus (Godfrey et al., 1992), which contains transcripts of telephone conversations in English; while some focused on parsing the full subset (Yoshikawa et al., 2016;Rasooli and Tetreault, 2013;Miller and Schuler, 2008), others attended to specific phenomena common in spoken data, such as speech repairment (Miller, 2009a,b).In addition to English, dependency treebanks have been developed for speech of other languages, including but not limited to French (Gerdes and Kahane, 2009;Bazillon et al., 2012), Czech (Mikulová et al., 2017), Russian (Kovriguina et al., 2018), Japanese (Ohno et al., 2005) and Mandarin Chinese (He et al., 2018).These treebanks, including Switchboard, however, were not always built on the basis of the UD guidelines; the annotations and trained parsers are not always publicly available, making it less straightforward to perform parsing evaluation with the data, especially given the fact that the majority of dependency parsers are specific to UD-formatted treebanks.
This paper presents a wide-coverage dataset of spontaneous child-parent interactions (MacWhinney, 2000), annotated with syntactic dependencies largely following the UD standards.We illustrate careful annotation guidelines that hopefully would be useful to other developments of spoken dependency treebanks.Compared to previous studies, our work goes beyond in several respects (see Section 2).First, compared to most of the other spo-ken dependency treebanks that contain telephone conversations (Bechet et al., 2014), interactions between adults (Dobrovoljc and Martinc, 2018;Dobrovoljc and Nivre, 2016), or user-generated content (Davidson et al., 2019), our dataset attends to child and child-directed speech.
Second, in contrast to other spoken treebanks in the UD project, our dataset as a whole is of considerable size: there are 26,098 utterances (N of words = 116,428) for child speech, and 18,646 utterances (N of words = 117,479) for parent speech.
Third, while there are some dependency corpora of child-parent interactions in English (Sagae et al., 2010), Japanese (Miyata et al., 2013), and Hebrew (Gretz et al., 2013), they only include data from one or two children, whereas we provide annotations for the speech of 10 children across a wider age range, therefore covering in more details lexical/syntactic phenomena that are more unique to child speech (e.g., repetition, speech disfluency) and spoken language more broadly.
With this dataset, in ongoing work, we ask two additional questions: (1) How do state-of-the-art dependency parsing architectures trained on out-ofdomain data perform when it comes to naturalistic speech of separate interlocutors?To address this question, we evaluate parsers trained on three genres within the written domain: news texts, tweets and learner data, all in English.(2) What is the relationship, if any, between parser performance and the developmental stage of the child?In a way one might foresee a positive correlation between the two, with the expectation that as the child continues to develop their language skills, they would utilize more and more cohesive syntactic structures, instead of, for example, produce speech with unintelligible speech or word omissions.On the other hand, it is possible that parser performance would increase as the child reaches a certain developmental stage, then start decreasing, since the child might start articulating sentences with more complex or expressive syntactic structures, which thereby are potentially harder to analyze.

Related Work
Earlier work on dependency parsing of child-parent conversations (Sagae et al., 2001(Sagae et al., , 2004) ) focused on the Eve corpus (Brown, 1973) from the CHILDES project (MacWhinney, 2000), though the annotations did not follow those of UD.Subsequent research extended the annotation guidelines for the Eve corpus (Sagae et al., 2010) to child and childdirected speech in Japanese (Miyata et al., 2013) and Hebrew (Gretz et al., 2013).
We note two studies that carried out UD-style dependency parse annotations for child and/or childdirected speech.. Liu and Prud'hommeaux (2021) took a semi-automatic approach to convert part of the existing dependency parses from the Eve corpus (Brown, 1973) to UD standards; specifically, they focused on child and parent utterances when the child is within 18-27 months old.Concurrent work by Szubert et al. (2021) annotated dependency parses and semantic logical forms for two languages: English (a large portion from the Adam corpus (Brown, 1973)) and Hebrew (The Hagar corpus (BERMAN, 1990)), although they only looked at child-directed speech.

Meet the Data
For dataset construction, we borrowed transcripts of English naturalistic parent-child interactions from the childes-db interface (Sanchez et al., 2019), which contains data from the CHILDES (MacWhinney, 2000) database.As we are interested in how parser evaluation results change at different developmental stages of a child, we used age as a proxy for developmental stage and set 6-month as one age bin.For every unique child from each corpus of English, we calculated the total number of words produced by the child (N _child) and by the parent(s) (N _parent; excluding data from other care-givers) respectively within each age bin of the child.From each age bin, for both child and parent speech, we randomly sampled a number of utterances (mostly without replacement) that amounted to approximately 2,000 words; the criteria was relaxed somewhat in order to include data across a wide range of age bins.This resulted in spoken data of ten children from 6 corpora (Table 1).In what follows, we briefly describe each of the selected corpora.Kuczaj (Kuczaj II, 1977) The Kuczaj corpus includes speech from diary study of the child, Abe; each original recording lasts ∼30 minutes.Brown (Brown, 1973) From the Brown corpus we included naturalistic speech from Adam and Sarah.Thomas (Lieven et al., 2009) The Thomas corpus contains spoken interactions between the child, Thomas, and primarily his mother at their house; each initial audio recording is about an-hour long.Weist (Weist and Zevenbergen, 2008) The data of Emma and Roman came from the Weist corpus, which includes caregiver-child interactions recorded in either a laboratory setting or in their own homes twice a month for around 30 minutes.Braunwald We used recorded longitudinal speech of Laura when interacting with different interlocutors; the speech was produced under naturalistic environments.Providence We took data of three children, Naima, Lily and Violet, from the Providence corpus (Demuth et al., 2006); the data contains longitudinal spontaneous interactions at home between the children and mostly their mothers.

Annotation process
Our annotations largely followed those of UD (Zeman et al., 2022).Annotator A, who has advanced training in dependency syntax, initially annotated data of age 18-24 months and 24-30 months from Abe and Sarah, as a way to take notes of any domain-specific phenomenon or cases that might not be straightforward to annotate (see Section 5).These guidelines were discussed with annotator B and modified if needed.Then given each age bin of every child, the two annotators annotated 10% of data from both child and parent speech.We calculated agreement scores using Cohen's Kappa (Artstein and Poesio, 2008).The overall agreement score taking into account all syntactic head and dependency relation annotations of all data is 0.97; the average agreement score across each depen-dency parse is 0.96.(We also computed agreement scores for each child and the results are around 0.97).Final annotations of all data were performed and checked by annotator A.

Annotation guidelines
Here we describe in details our approach to transcription orthography, tokenization and dependency annotations for syntactic constructions that are unique or more common in child speech and spoken data more broadly.

Orthography and tokenization
Regarding orthography of the transcripts, we made four decisions, all of which are on the basis of a principle that we call "annotate what is actually there".First, we did not perform any orthographic normalization of most intelligible words in the speech (e.g., she wana eat); in other words, these words stayed true to their original forms taken from CHILDES.That said, the tokenizations of certain cases were updated following UD.These cases include: (1) possessives (e.g., Daddy's → Daddy 's); (2) shortened copulas (e.g., I'm eating → I 'm eating); (3) combined conjunctives (e.g., in_spite_of → in spite of ); (4) combined adverbs (e.g., as_well → as well); (5) negation (e.g., don't → do n't); (6) other informal contraction (e.g., gonna → gon na); (7) childish expressions (e.g., poo_poo, choo_choo) Second, unintelligible speech words (e.g., xxx, yyy) were removed as it is hard to tell whether the words existed in the first place and whether they have any syntactic roles in the utterances.
Third, we kept the initial capitalization in the transcripts, since most of the times only proper names or words such as Mommy were capitalized.
Lastly, we omitted all punctuations with the exceptions of apostrophe (to abide with UD's standards), since punctuations tend to not be explicitly articulated in actual spontaneous speech; they are instead added manually during the transcription process based on judgments by the annotators.Admittedly, under certain circumstances it is relatively easy to make choices of punctuations; for instance, one could use a question mark at the end of an utterance with a raising pitch contour.That said, deciding which punctuation to use is not always easy and could be quite time-consuming; for example, for spoken data, there could always be the problem of setting utterance boundary (see Section 5.2), and accordingly, whether to insert, say, a period or a semi-colon is not as simple as one might think.Therefore the process of adding punctuations in transcripts is quite subjective itself.

(Vague) utterance boundary
Each instance in the dataset was originally treated as one utterance in CHILDES, therefore we tried to annotate them as one sentence most of the times.That said, the initial utterance boundaries are not always adequate.This means that one extracted instance can be considered as having "side-by-side" sentences (Figure 1a) 1 ; we abided by the instructions of UD and annotated the first sentence to be the root; then later sentences in the instance were treated as parataxis of the root.

Creative lexical usage
It is common that children make certain lexical choices that do not necessarily follow the standards 1 Examples presented in this paper are often modified from the original utterances for ease of presentation.
of parent speech or (formal) written data (e.g., mine pillow; my mummy telephones me).On the other hand, these cases in a way reflect children's world, in the sense that they capture children's own understanding of these words and their lexical (and syntactic) development at different stages.Therefore here we refer to such cases as creative usage; for each case, we analyzed their syntactic usage given the remaining structure of the sentence (Lee et al., 2017;Santorini, 1990), then assigned dependency parses accordingly.For example, in Figure 2a, the word magicked is creatively used as a verb that links the subject I and object it.For some instances, it is relatively difficult to decide whether the utterance contains the child's creative usage of some lexical items, or potential transcription errors induced by annotators.Given that transcribing spoken data manually requires large amounts of time and energy, it is not against expectations that the resulting transcriptions might have errors.For instance, with Figure 2b, it is not exactly clear whether the child really said I wan na pencil, or the transcription should have been I want a pencil.Our approach is that we compared how often each alternative occurs in our dataset, then made the final decision.Therefore between the two alternatives above, we chose to annotate na as the determiner of pencil.

Possible lexical omission
As we are taking a data-driven approach, we tried to perform annotations based on how words or phrases are used within an utterance; this means that we avoided assuming potential word omissions as much as possible.We assigned dependency structures to an utterance if a reasonable parse (given the context) could be derived without the assumption that certain words are missing.
In other cases, we deemed the utterance as having lexical omission if the omitted word could be automatically retrievable; this way other re-searchers could formulate a different analysis for the utterance as they see fit.In our annotations, we considered two types of lexical omissions.The first type is copula omission, where the syntactic head of the copula is mostly a noun (Figure 3a), an adjective ((Figure 3b), or an adposition ((Figure 4a); other times we assumed that a copula is omitted if the utterance can be interpreted as an expletive structure (e.g., there book, with there as the expl dependent of book).The second type is adposition omission, mostly the adposition that is the function head of an oblique phrase (Figure 4a) or the infinitival-to in a complement clause (Figure 4b).

Nominal phrases
For certain nominal phrases that serve as adverbial modifiers in a given utterance (e.g., Figure 5a), and/or express time and dates, we tried to annotate them more carefully using subtypes of specific dependency relations (e.g., nmod:tmod or obl:tmod) (Schneider and Zeldes, 2021).For example, in Figure 5b, depending on their respective role, morning should be an oblique phrase of go whereas tomorrow modifies morning).

Ambiguity
The syntactic structure of a sentence can be ambiguous when looking at the sentence by itself.Therefore we took into account the surrounding context of an utterance when performing annotations.In some cases, contexts could be helpful; for example, in (1), like can be treated as the verb of the sentence, rather than an adposition.
(1) Parent: do you like this

Child: like this
In other (rare) cases, context might not be quite useful; for instance, in (2), it is not clear whether rain should be a verb and the relation between the two words is obl:tmod, or a noun and the dependency relation is nmod:tmod.For these examples we opted for the simpler analysis given characteristics of child speech and treated rain as a noun.
(2) Parent: eat your soup please Child: rain tonight Another source of ambiguity comes from whether to treat proper names or words like mommy and daddy as vocative or not, e.g., Momma try it.For these cases, we decided to consider them as vocative if this interpretation is reasonable, since subject omission is common in early child speech (Hughes and Allen, 2006).

Non-canonical word order
Often times in child speech or spoken data more broadly, an utterance does not have the canonical word order that is presumed in (formal) written data.In our dataset, cases as such usually involve post-posed subject and (copula) verb (Figure 6a).For these cases, we assigned dependency relations based on the syntactic function of a word/phrase which is not constrained by their relative orderings.

Speech repairment
For speech repair, which captures one type of disfluency (Ferreira and Bailey, 2004), we used reparandum in the same way as suggested by the UD guidelines, that is, the speech repair is the syntactic head of the subtree that constitues the disfluent speech (e.g., seven in Figure 7a).If the disfluent speech contains discourse fillers or editing terms such as uh or um, these elements are annotated to be the syntactic dependents of the repair with the relation discourse (also to avoid unnecessary crossing dependencies).In some more complicated cases, the disfluency subtrees are word fragments that do not form a whole coherent phrase together (e.g., grab the in Figure 7b); for these cases, we used the principle of promotion to analyze elements within the subtree structure of the disfluency if needed.For example, with Figure 7b, the word grab is most likely to be the head within the disfluency subtree, therefore we promoted the following the to be the object of grab, then analyzed the dependency relations of the residual structures in the instance.
To separate repairment from speech restart or abandonment (Section 5.9), we only categorized an utterance as having speech repairment if the repairment occurs in-between the utterance / is sentence-medial (as opposed to the beginning of the sentence).

Speech restart
Another type of disfluency is speech restart.We generally considered an instance as having speech restart or abandonment if the abandoned elements occur at the beginning of the instance and do not form a coherent phrase together; what's more, the abandoned elements need to be different from the speech restart.For these cases, given that speech restart falls broadly under the umbrella of disfluency, and in order to distinguish restart from repairment above, we extended reparandum with a new dependency relation subtype: reparandum:restart to connect the abandoned elements as the dependents of the speech restart.This way the dependency relation will also go "right-to-left" (Dobrovoljc, 2022), following the usage of reparandum.

Repetition
Overall we identify three major kinds of repetitions.For the first type, an utterance consists of repetitions of the same dependency subtree and the repeated subtree is a coherent phrase by itself.
Examples include cases such as discursive repetition (e.g., no no mommy; Figure 9a), onomatopoeia (e.g., honk honk), or repetition of other kinds of word or phrase (e.g., this is my truck my truck; Figure 9b).For these cases, we treated the first appearance of the repeated subtree as the syntactic head with the following repetitions as the dependents connected with the relation conj.A special case is when the instance repeats a full sentence (or just containing a verbal phrase?), e.g., I did it I did it (Figure 9c); for these examples we used parataxis to adhere to the annotations of side-byside sentences noted by UD.
For the second type, repetition is used to emphasize the characteristics of certain objects or conditions (or serves as an intensifier (Szubert et al., 2021)); in these cases the repeated element is usually a single word with a POS tag of adjective or adverb.These cases were annotated similarly as those from the first type above (Figure 9d).
The third type of repetition pertains disfluency.Whether to interpret an instance as disfluent repetitions or not could be challenging when the instance does not have a corresponding audio, where the prosody could help guide the interpretation.Therefore to distinguish the two types mentioned before, we considered an instance as having disfluent rep- etitions if the repetition appears at the beginning or in-between a sentence; in addition, the repeated element has to be either word fragments that do not form a whole coherent phrase together when taking the sentential context into account (Figure 9e), or a single word and its POS tag is neither an adjective or an adverb (Figure 9f).For these cases, in order to not go against the fact that conj is usually applied left-to-right (i.e., syntactic head precedes its dependents), we again extended the usage of reparandum and applied a new dependency relation subtype: reparandum:repetition to describe repetition in disfluency; speech repairment, restart and disfluent repetition hence would also be easily distinguishable in an automatic fashion.

Other structures
The last three types of structures to be mentioned are serial verb construction (SVC), tag question and unintelligible structure where we could not decide on a clear (or sometimes any) interpretation.For SVC, we followed Szubert et al. (2021); for an utterance such as he came see me play, see was treated as the dependent of came and the relation is compound:svc.Tag questions are mostly used in parent speech (e.g., you like that book do n't you); for these examples we abided by the UD annotations and used parataxis to connect the tag question to the main clause of the utterance.
Lastly, for utterances without a clear syntactic structure, we used dep to connect the individual word fragments.Given that UD advises to avoid applying this relation as much as possible, we restricted its usage to mostly utterances produced in early developmental stages of the child (e.g., 18-24 months), where the utterances are composed of usually two to three words in total.
For each of the aforementioned datasets, we trained the parsers described in Section 6.1 with their default parameters.We calculated micro unlabeled attachment scores and labeled attachment scores (LAS) to evaluate parser performance.Due  to space limitation, throughout this paper we focus on reporting LAS.Parser evaluation results across three random seeds for all out-of-domain datasets are reported in Table 3.

Evaluation of out-of-domain parsers
We first performed automatic part-of-speech tagging for all child and parent speech using the publicly-open NLP library Stanza (Qi et al., 2020).
We then applied each of the out-of-domain parsers to child and parent speech within each age range of the child; parser performance was again index by LAS across 3 random seeds.We foresee two directions regarding the parsing results.On one hand, EWT has significantly much more data in contrast to Tweebank and ESL, which might lead to overall better results on child and parent speech.That said, among the three out-of-domain datasets, the domains of Tweebank and ESL are possibly most relevant or more similar to child-parent interactions, in the sense that they are less "formal" compared to EWT.This potentially means that parsers trained from Tweebank or ESL might outperform those based on EWT.Note that to be fair for the out-of-domain parsers, here the evaluations did not consider new dependency relation subtypes that were introduced in our annotations, which are reparandum:restart and reparandum:repetition.

Results and analysis
We present analysis mainly for results based on MaChamp; though we performed the same analysis for the other two parsers as well and there were no noticeable discrepancies in the patterns.(While there are differences in numerical results given that UUParser was not trained with neural LMs).

Parent's speech
When looking at parser performance for all parents across different age ranges of their children, on average parsers trained on the out-of-domain ESL dataset with twitter achieved the best result (83.88).By contrast, the best parsers from EWT, which were also trained with twitter notably performed worse; the average difference in LAS ranges from 2.11 between the age range of 48-54 months, to 3.95 between the age range of 18 to 24.On the other hand, the best parsers from Tweebank, trained with bert, achieved comparable performance (83.48) to the best parsers based on the ESL dataset.What's worth noting here, is that comparing the parsing results of different LMs, the discrepancies in LAS for Tweebank are smaller than those for ESL.For example, the biggest difference in LAS for Tweebank is 0.80 at the age range of 60-66 months between bert and roberta; whereas for ESL, the smallest discrepancy is 1.51 (48-54 months), and can be as large as 3.42 (18-24 months).This means that parsers trained from Tweebank, which contains less than 1/8 of the amount of data from EWT, and around 1/3 from ESL, are able to lead to good and reliable performance.
So where do the discrepancies between parsers trained from different out-of-domain treebanks come from, especially during early ages of the children?Comparing the best performing parsers based on EWT and those from Tweebank and ESL, it appears that for some utterances where the copula takes the form of ''s', parsers trained from EWT errorneously annotated the copula as the subject, therefore assigning two subjects to the same syntactic head; this accounts for around 17.07% of all differences between the parsers' and our manual annotations.This raises the worrisome question of why a structure where the syntactic head has two subjects would arise.The most plausible answer is that such structures exist in the training set.To check that, we looked into the training data of the EWT treebank and found five relevant sentences.In these cases, the first subject between the two were incorrectly annotated as headed by the verb of the subordinate clause at a lower level (e.g., in it is not about how much you earn (adapted), it was annotated as the subject of earn).Similarly, in cases where ''s' has a dependency relation of aux, the parsers trained from EWT also tended to assign it a relation of nsubj instead; this led to around 7.69% of all the discrepancies between parser and manual annotations.The patterns described above seem to be the main and consistent explanation for the performance discrepancies between the best parsers trained from different out-of-domain treebanks, even when the children get older; between the age range of 54-60 months, around 11.30% of all differences when comparing parsers's annotations to those done manually came from the parsers assigning nsubj instead of cop; and the number is approximately 11.99% when the age range is between 60 and 66 months.Now let us turn to the question of where do best- performing parsers trained from out-of-domain treebanks fall short in general?We note four cases here.During early ages of the children, for parent speech, the dependency relation that results in the biggest discrepancy between parser and manual annotations is nmod:poss, e.g., my book, where the parsers annotated the relation as nmod, which is less preferred in latest annotation guidelines of UD.The second case that caused confusion for the parsers is when the sentences contain elements that should be annotated as discourse and/or vocative (e.g., hahaha I see, Roman oh that is beautiful) whereas the parsers appeared to more consistently annotate the first word in these instance as the root of the sentence.
One other noteworthy example is when the parsers think of a vocative as the subject (e.g., Adam eat your soupe) or sometimes object (e.g., sit Sarah) of the utterance, since there is no punctuation in the annotations and the main clause of the sentence is subjectless or should not take an object at all.The last one is conj, which we used for cases of repetition or when the speaker appears to be listing individual nouns (e.g., orange grape apple); the parsers, however, annotated the relation between pairs of nouns to be compound, which was common in the training sets of the out-of-domain treebanks given that they were annotated (largely) based on UD guidelines.
As the children age, the aforementioned dependency relations still turned out to be the main reasons to explain parser performance, though to lesser extents.Note this was not because the parsers started correctly annotating these relations, but due to the fact that the relevant instances occur less frequently in parent speech at later age ranges of the children, at least in the data samples that we selected and annotated.In addition, another dependency relation that the parsers tended to not annotate correctly at later developmental stages of the children is compound:prt, in cases such as hurry up; instead the parsers assigned compound, which is not entirely wrong, but less precise.

Child's speech
Overall parser performance for child's speech demonstrates similar patterns to those found for parent's speech.Across all age ranges, again the best parsers trained on the ESL treebank using twitter and the best parsers trained on the Tweebank with bert arrived at comparable performance (79.35 vs. 79.39);these parsers also outperformed the best parsers based on the EWT treebank (76.48), indicating that ESL and Tweebank, especially the latter, suffice for yielding reasonable parser performance despite their much smaller dataset scales in contrast to EWT.
In terms of what dependency relations resulted in the discrepancies between parser performance and manual annotations, we identified four main categories that are applicable especially during early ages of the children, which again are similar to those noted for parent speech.The first one is conj; the parsers tended to assign compound for an utterance that consists mostly of individual nouns (book table pencil) where none of the nouns modifies each other.
The other three categories appear to be mainly caused by the utterances potentially having word omissions or ambiguous structures.For instance, when the utterances consist of two words, where the first one was treated as the subject of the second (Adam home) in our annotations, given the length of the utterances, the parsers again preferred to annotate Adam as the compound of home.Another case pertains nmod:poss; in addition to assigning nmod for utterances that contain possessives such as my, the parsers also used nmod for utterances that possibly involve the omission of the possisive marker ''s'.The last one concerns the treatment of vocative, which again sometimes were annotated by the parsers as nsubj or obj.
Notice there is small fluctuation in LAS between the age of 48-54 months.When looking at the differences between parser and manual annotations for this age range, we see an additional type of instance where the parsers committed errors, which involve a head verb with an adverbial modifier, such as bring them back; rather than parsing back as advmod of bring, the parsers assigned aux.
In all, contrasting results for child speech to those for parent speech, not so surprisingly, parser performance is better overall for the latter.That said, when children reach later ages, the parsers' evaluation results (∼87.83)approach those for parent speech (∼89.13),suggesting that children's development of syntactic structures are becoming more parent-like.While parser scores for parent speech slowly increase between the age range of 18-66 months, this progress is much more pronounced for child speech; in other words, we do see an overall improvement of parser performance as children progress along the syntactic developmental trajectory, particularly within 18-42 months.

Ongoing in-domain parser training and evaluation
After observing the performance of out-of-domain parsers, we now turn to evaluating the performance of parsers trained with in-domain data of childparent interactions; in particular, these parsers are expected to pick up the new dependency relations for disfluencies that we introduced, at least to some extent.Based on results from Section 7, on average parsers trained with MaChamp using the twitter LM seemed to achieve the most stable performance, therefore we adopted the same parser setup here.
Our training scheme is as follows.Say we want to evaluate parser performance for Adam from the Brown Corpus.We first trained parsers using all the data from the other nine child-parent pairs; we then measured parser performance for both child and parent speech at each age range of the child using LAS averaged across 3 random seeds.

Conclusion
We present a wide-coverage dataset of child-parent interactions annotated with syntactic dependencies, along with detailed annotation guidelines extending the Universal Dependencies project.The dataset covers child and parent speech from the age range of 18-66 months for the children.Evaluations from graph-based and transition based dependency parsers with varying hyperparameters demonstrate that parsers trained using a relatively small amount of English tweets (Tweebank) are able to yield comparable or even outperform parsers trained from much larger dependency treebanks.In addition, we observed the general trend that on average, parser performance increases as the children reach older ages, indicating that as children progress along their syntactic developmental trajectory, they start producing more cohesive structures but not too complex for the parsers to handle.We wait to verify this again with our ongoing work of in-domain parser training and evaluation..

Figure 3 :
Figure 3: Examples of possible lexical omission: omission of copula.

Figure 7 :Figure 8 :
Figure 7: Examples of repairment; the dependency relation for speech repair in each example is in blue.

Figure 9 :
Figure 9: Examples of repetition; the dependency relation between repeated elements in each example is in teal.

Figure 10 :
Figure 10: Evaluation of out-of-domain Tweebank parsers; at each age range of the children, the parser score for child (or parent) speech was averaged across all children (or parents) within that age range.

Table 1 :
N of words for child and parent speech at different age ranges of the children.

Table 2 :
Descriptive statistics for out-of-domain data sets.

Table 3 :
Parser evaluation results for out-of-domain data sets.