Abstract
                                                                        Lack of encyclopedic text contributors, especially on Wikipedia,  makes automated text generation for low resource (LR) languages a  critical problem. Existing work on Wikipedia text generation has  focused on English only where English reference articles are sum-  marized to generate English Wikipedia pages. But, for low-resource  languages, the scarcity of reference articles makes monolingual  summarization inefective in solving this problem. Hence, in this  work, we propose XWikiGen, which is the task of cross-lingual  multi-document summarization of text from multiple reference ar-  ticles, written in various languages, to generate Wikipedia-style  text. Accordingly, we contribute a benchmark dataset, XWikiRef,  spanning ∼69K Wikipedia articles covering fve domains and eight  languages. We harness this dataset to train a two-stage system  where the input is a set of citations and a section title and the  output is a section-specifc LR summary. The proposed system is  based on a novel idea of neural unsupervised extractive summariza-  tion to coarsely identify salient information followed by a neural  abstractive model to generate the section-specifc text. Extensive  experiments show that multi-domain training is better than the  multi-lingual setup on average. We make our code and dataset  publicly availableLack of encyclopedic text contributors, especially on Wikipedia,  makes automated text generation for low resource (LR) languages a  critical problem. Existing work on Wikipedia text generation has  focused on English only where English reference articles are sum-  marized to generate English Wikipedia pages. But, for low-resource  languages, the scarcity of reference articles makes monolingual  summarization inefective in solving this problem. Hence, in this  work, we propose XWikiGen, which is the task of cross-lingual  multi-document summarization of text from multiple reference ar-  ticles, written in various languages, to generate Wikipedia-style  text. Accordingly, we contribute a benchmark dataset, XWikiRef,  spanning ∼69K Wikipedia articles covering fve domains and eight  languages. We harness this dataset to train a two-stage system  where the input is a set of citations and a section title and the  output is a section-specifc LR summary. The proposed system is  based on a novel idea of neural unsupervised extractive summariza-  tion to coarsely identify salient information followed by a neural  abstractive model to generate the section-specifc text. Extensive  experiments show that multi-domain training is better than the  multi-lingual setup on average. We make our code and dataset  publicly available