IIITH

Multilingual Multi-Domain NMT for Indian languages

Association for Computational Linguistics and the 11th International Joint Conference on Natural Lan, ACL -IJCNLP SRW, 2021

Abs PDF bibTex

@inproceedings{bib_Mult_2021, AUTHOR = {Kumar, Sourav and Aggarwal, Salil and Sharma, Dipti Mishra }, TITLE = {Multilingual Multi-Domain NMT for Indian languages}, BOOKTITLE = {Association for Computational Linguistics and the 11th International Joint Conference on Natural Lan}. YEAR = {2021}}

Multilingual Multi-Domain NMT for Indian languages

Abstract

India is known as the land of many tongues. There is no single language called Indian. India speaks hundreds of languages and dialects. Some are extinct, while some are still in use with considerable speakers. Despite having a lot of different scripts, most of the Indian languages still share a lot of lexical features which can be utilized to help improve the quality of Multilingual NMT systems trained on them. So, in this paper, we present an extensive study of Multilingual as well as Multilingual Multi Domain NMT involving languages of the Indian subcontinent. We draw four major conclusions from our experiments: (i) Multilingual Multi Domain models can significantly improve the accuracy of all the individual languages within their domains, resulting in improving the overall performance of the Multilingual Multi Domain system, (ii) Encoder representation of different languages based on their family helps Multilingual models gain an average improvement of 3.25 BLEU points, (iii) Our new technique of incorporating domain information into the language tokens results in getting a significant improvement of 6 BLEU points on an average as compared to the baselines, (iv) Multistage Fine-tuning further helps in improvement of (1-1.5) BLEU points.

Fine-grained domain classification using Transformers

International Conference on Natural Language Processing (ICON): TechDOfication:Shared Task, ICON- W, 2020

Core Rank : - Google Rank :-

Abs PDF bibTex

@inproceedings{bib_Fine_2020, AUTHOR = {Gahoi, Akshat and Chhajer, Akshat and Sharma, Dipti Mishra }, TITLE = {Fine-grained domain classification using Transformers}, BOOKTITLE = {International Conference on Natural Language Processing (ICON): TechDOfication:Shared Task}. YEAR = {2020}}

Fine-grained domain classification using Transformers

Abstract

The introduction of transformers in 2017 and successively BERT in 2018 brought about a revolution in the field of natural language processing. Such models are pretrained on vast amounts of data, and are easily extensible to be used for a wide variety of tasks through transfer learning. Continual work on transformer based architectures has led to a variety of new models with state of the art results. RoBERTa (Liu et al., 2019) is one such model, which brings about a series of changes to the BERT (Devlin et al., 2018) architecture and is capable of producing better quality embeddings at an expense of functionality. In this paper, we attempt to solve the well known text classification task of fine-grained domain classification using BERT and RoBERTa and perform a comparative analysis of the same. We also attempt to evaluate the impact of data preprocessing specially in the context of fine-grained domain classification. The results obtained outperformed all the other models at the ICON TechDOfication 2020 (subtask-2a) Fine-grained domain classification task and ranked first. This proves the effectiveness of our approach.

N-Grams TextRank : A Novel Domain Keyword Extraction Technique

International Conference on Natural Language Processing., ICON, 2020

Core Rank : - Google Rank :5

Abs PDF bibTex

@inproceedings{bib_N-Gr_2020, AUTHOR = {Rajput, Saransh and Gahoi, Akshat and Reddy, Manvith Muthukuru and Sharma, Dipti Mishra }, TITLE = {N-Grams TextRank : A Novel Domain Keyword Extraction Technique}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2020}}

N-Grams TextRank : A Novel Domain Keyword Extraction Technique

Abstract

The rapid growth of the internet has given us a wealth of information and data spread across the web. However, as the data begins to grow we simultaneously face the grave problem of an Information Explosion. An abundance of data can lead to large scale data management problems as well as the loss of the true meaning of the data. In this paper, we present an advanced domain specific keyword extraction algorithm in order to tackle this problem of paramount importance. Our algorithm is based on a modified version of TextRank(Mihalcea and Tarau, 2004) algorithm - an algorithm based on PageRank(Page et al., 1998) to successfully determine the keywords from a domain specific document. Furthermore, this paper proposes a modification to the traditional TextRank algorithm that takes into account bigrams and trigrams and returns results with an extremely high precision. We observe how the precision and f1-score of this model outperforms other models in many domains and the recall can be easily increased by increasing the number of results without affecting the precision. We also discuss about

Enhanced Urdu Word Segmentation using Conditional Random Fields and Morphological Context Features

Widening Natural Language Processing Workshop, WiNLP, 2020

Core Rank : - Google Rank :-

Abs PDF DOI bibTex

@inproceedings{bib_Enha_2020, AUTHOR = {Farhan, Aamir and Islam, Mashrukh and Sharma, Dipti Mishra }, TITLE = {Enhanced Urdu Word Segmentation using Conditional Random Fields and Morphological Context Features}, BOOKTITLE = {Widening Natural Language Processing Workshop}. YEAR = {2020}}

Enhanced Urdu Word Segmentation using Conditional Random Fields and Morphological Context Features

Abstract

Word segmentation is a fundamental task for most of the NLP applications. Urdu adopts Nastalique writing style which does not have a concept of space. Furthermore, the inherent non-joining attributes of certain characters in Urdu create spaces within a word while writing in digital format. Thus, Urdu not only has space omission but also space insertion issues which make the word segmentation task challenging. In this paper, we improve upon the results of Zia, Raza and Athar (2018) by using a manually annotated corpus of 19,651 sentences along with morphological context features. Using the Conditional Random Field sequence modeler, our model achieves F1 score of 0.98 for word boundary identification and 0.92 for sub-word boundary identification tasks. The results demonstrated in this paper outperform the state-of- the-art methods.

NMT based Similar Language Translation for Hindi - Marathi

Conference on Machine Translation, WMT, 2020

Core Rank : - Google Rank :-

Abs PDF bibTex

@inproceedings{bib_NMT__2020, AUTHOR = {Mujadia, Vandan and Sharma, Dipti Mishra }, TITLE = {NMT based Similar Language Translation for Hindi - Marathi}, BOOKTITLE = {Conference on Machine Translation}. YEAR = {2020}}

NMT based Similar Language Translation for Hindi - Marathi

Abstract

This paper describes the participation of team F1toF6 (LTRC, IIIT-Hyderabad) for the WMT 2020 task, similar language translation. We experimented with attention based recurrent neural network architecture (seq2seq) for this task. We explored the use of different linguistic features like POS and Morph along with back translation for Hindi-Marathi and Marathi-Hindi machine translation.

Cross-Lingual Transfer for Hindi Discourse Relation Identification

Speech and Dialogue Conference, TSD, 2020

Core Rank : - Google Rank :-

Abs PDF bibTex

@inproceedings{bib_Cros_2020, AUTHOR = {DAHIYA, ANIRUDH and Srivastava, Manish and Sharma, Dipti Mishra }, TITLE = {Cross-Lingual Transfer for Hindi Discourse Relation Identification}, BOOKTITLE = {Speech and Dialogue Conference}. YEAR = {2020}}

Cross-Lingual Transfer for Hindi Discourse Relation Identification

Abstract

Discourse relations between two textual spans in a document attempt to capture the coherent structure which emerges in language use. Automatic classification of these relations remains a challenging task especially in case of implicit discourse relations, where there is no explicit textual cue which marks the discourse relation. In low resource languages, this motivates the exploration of transfer learning approaches, more particularly the cross-lingual techniques towards discourse relation classification. In this work, we explore various cross-lingual transfer techniques on Hindi Discourse Relation Bank (HDRB), a Penn Discourse Treebank styled dataset for discourse analysis in Hindi and observe performance gains in both zero shot and finetuning settings on the Hindi Discourse Relation Classification task. This is the first effort towards exploring transfer learning for Hindi Discourse relation classification to the best of our knowledge.

MEE: An Automatic Metric for Evaluation Using Embeddings for Machine Translation

International Conference on Data Science and Advanced Analytics, DSAA, 2020

Core Rank : B Google Rank :26

Abs PDF bibTex

@inproceedings{bib_MEE:_2020, AUTHOR = {Mukherjee, Ananya and Hema, Ala and Srivastava, Manish and Sharma, Dipti Mishra }, TITLE = {MEE: An Automatic Metric for Evaluation Using Embeddings for Machine Translation}, BOOKTITLE = {International Conference on Data Science and Advanced Analytics}. YEAR = {2020}}

MEE: An Automatic Metric for Evaluation Using Embeddings for Machine Translation

Abstract

We propose MEE, an approach for automatic Machine Translation (MT) evaluation which leverages the similarity between embeddings of words in candidate and reference sentences to assess translation quality. Unigrams are matched based on their surface forms, root forms and meanings which aids to capture lexical, morphological and semantic equivalence. We perform experiments for MT from English to four Indian Languages (Telugu, Marathi, Bengali and Hindi) on a robust dataset comprising simple and complex sentences with good and bad translations. Further, it is observed that the proposed metric correlates better with human judgements than the existing widely used metrics.

Linguistically Informed Hindi-English Neural Machine Translation

International Conference on Language Resources and Evaluation, LREC, 2020

Core Rank : B Google Rank :59

Abs PDF DOI bibTex

@inproceedings{bib_Ling_2020, AUTHOR = {Goyal, Vikrant and MISHRA, PRUTHWIK and Sharma, Dipti Mishra }, TITLE = {Linguistically Informed Hindi-English Neural Machine Translation}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2020}}

Linguistically Informed Hindi-English Neural Machine Translation

Abstract

Hindi-English Machine Translation is a challenging problem, owing to multiple factors including the morphological complexity and relatively free word order of Hindi, in addition to the lack of sufficient parallel training data. Neural Machine Translation (NMT) is a rapidly advancing MT paradigm and has shown promising results for many language pairs, especially in large training data scenarios. To overcome the data sparsity issue caused by the lack of large parallel corpora for Hindi-English, we propose a method to employ additional linguistic knowledge which is encoded by different phenomena depicted by Hindi. We generalize the embedding layer of the state-of-the-art Transformer model to incorporate linguistic features like POS tag, lemma and morph features to improve the translation performance. We compare the results obtained on incorporating this knowledge with the baseline systems and demonstrate significant performance improvements. Although, the Transformer NMT models have a strong efficacy to learn language constructs,we show that the usage of specific features further help in improving the translation performance.

Checkpoint Reranking: An Approach To Select Better Hypothesis For Neural Machine Translation Systems

Conference of the Association for Computational Linguistics Workshops, ACL-W, 2020

Core Rank : - Google Rank :-

Abs PDF DOI bibTex

@inproceedings{bib_Chec_2020, AUTHOR = {VINAY, PANDRAMISH and Sharma, Dipti Mishra }, TITLE = {Checkpoint Reranking: An Approach To Select Better Hypothesis For Neural Machine Translation Systems}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2020}}

Checkpoint Reranking: An Approach To Select Better Hypothesis For Neural Machine Translation Systems

Abstract

In this paper, we propose a method of reranking the outputs of Neural Machine Translation (NMT) systems. After the decoding process, we select a few last iteration outputs in the training process as the N-best list. After training a Neural Machine Translation (NMT) baseline system, it has been observed that these iteration outputs have an oracle score higher than baseline up to 1.01 BLEU points compared to the last iteration of the trained system.We come up with a ranking mechanism by solely focusing on the decoder’s ability to generate distinct tokens and without the usage of any language model or data. With this method, we achieved a translation improvement up to +0.16 BLEU points over baseline.We also evaluate our approach by applying the coverage penalty to the training process.In cases of moderate coverage penalty, the oracle scores are higher than the final iteration up to +0.99 BLEU points, and our algorithm gives an improvement up to +0.17 BLEU points.With excessive penalty, there is a decrease in translation quality compared to the baseline system. Still, an increase in oracle scores up to +1.30 is observed with the re-ranking algorithm giving an improvement up to +0.15 BLEU points is found in case of excessive penalty.The proposed re-ranking method is a generic one and can be extended to other language pairs as well.

Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages

Conference of the Association for Computational Linguistics Workshops, ACL-W, 2020

Core Rank : - Google Rank :-

Abs PDF DOI bibTex

@inproceedings{bib_Effi_2020, AUTHOR = {Goyal, Vikrant and Kumar, sourav and Sharma, Dipti Mishra }, TITLE = {Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2020}}

Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages

Abstract

A large percentage of the world’s population speaks a language of the Indian subcontinent,comprising languages from both Indo-Aryan (e.g. Hindi, Punjabi, Gujarati, etc.) and Dravidian (e.g. Tamil, Telugu, Malayalam, etc.) families. A universal characteristic of Indian languages is their complex morphology, which, when combined with the general lack of sufficient quantities of high-quality parallel data, can make developing machine translation (MT) systems for these languages difficult. Neural Machine Translation (NMT) is a rapidly advancing MT paradigm and has shown promising results for many language pairs, especially in large training data scenarios. Since the condition of large parallel corpora is not met for Indian-English language pairs, we present our efforts towards building efficient NMT systems between Indian languages (specifically Indo-Aryan languages) and English via efficiently exploiting parallel data from the related languages. We propose a technique called Unified Transliteration and Subword Segmentation to leverage language similarity while exploiting parallel data from related language pairs. We also propose a Multilingual Transfer Learning technique to leverage parallel data from multiple related languages to assist translation for lowresource language pair of interest. Our experiments demonstrate an overall average improvement of 5 BLEU points over the standard Transformer-based NMT baselines.