Studying machine translation technologies for large-data CLIR tasks: a patent prior-art search case study

Walid Magdy, Gareth J F Jones

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Prior-art search in patent retrieval is concerned with finding all existing patents relevant to a patent application. Since patents often appear in different languages, cross-language information retrieval (CLIR) is an essential component of effective patent search. In recent years machine translation (MT) has become the dominant approach to translation in CLIR. Standard MT systems focus on generating proper translations that are morphologically and syntactically correct. Development of effective MT systems of this type requires large training resources and high computational power for training and translation. This is an important issue for patent CLIR where queries are typically very long sometimes taking the form of a full patent application, meaning that query translation using MT systems can be very slow. However, in contrast to MT, the focus for information retrieval (IR) is on the conceptual meaning of the search words regardless of their surface form, or the linguistic structure of the output. Thus much of the complexity of MT is not required for effective CLIR. We present an adapted MT technique specifically designed for CLIR. In this method IR text pre-processing in the form of stop word removal and stemming are applied to the MT training corpus prior to the training phase. Applying this step leads to a significant decrease in the MT computational and training resources requirements. Experimental application of the new approach to the cross language patent retrieval task from CLEF-IP 2010 shows that the new technique to be up to 23 times faster than standard MT for query translations, while maintaining IR effectiveness statistically indistinguishable from standard MT when large training resources are used. Furthermore the new method is significantly better than standard MT when only limited translation training resources are available, which can be a significant issue for translation in specialized domains. The new MT technique also enables patent document translation in a practical amount of time with a resulting significant improvement in the retrieval effectiveness.

Original languageEnglish
Pages (from-to)492-519
Number of pages28
JournalInformation Retrieval
Volume17
Issue number5-6
DOIs
Publication statusPublished - 2015

Fingerprint

Query languages
information retrieval
patent
art
Information retrieval
Linguistics
language
Processing
resources

Keywords

  • Cross-language information retrieval
  • Cross-language patent retrieval
  • Large-data CLIR
  • Machine translation
  • Prior-art Patent search

ASJC Scopus subject areas

  • Information Systems
  • Library and Information Sciences

Cite this

Studying machine translation technologies for large-data CLIR tasks : a patent prior-art search case study. / Magdy, Walid; Jones, Gareth J F.

In: Information Retrieval, Vol. 17, No. 5-6, 2015, p. 492-519.

Research output: Contribution to journalArticle

@article{e71f84a907464efe83d1de24a0061fb0,
title = "Studying machine translation technologies for large-data CLIR tasks: a patent prior-art search case study",
abstract = "Prior-art search in patent retrieval is concerned with finding all existing patents relevant to a patent application. Since patents often appear in different languages, cross-language information retrieval (CLIR) is an essential component of effective patent search. In recent years machine translation (MT) has become the dominant approach to translation in CLIR. Standard MT systems focus on generating proper translations that are morphologically and syntactically correct. Development of effective MT systems of this type requires large training resources and high computational power for training and translation. This is an important issue for patent CLIR where queries are typically very long sometimes taking the form of a full patent application, meaning that query translation using MT systems can be very slow. However, in contrast to MT, the focus for information retrieval (IR) is on the conceptual meaning of the search words regardless of their surface form, or the linguistic structure of the output. Thus much of the complexity of MT is not required for effective CLIR. We present an adapted MT technique specifically designed for CLIR. In this method IR text pre-processing in the form of stop word removal and stemming are applied to the MT training corpus prior to the training phase. Applying this step leads to a significant decrease in the MT computational and training resources requirements. Experimental application of the new approach to the cross language patent retrieval task from CLEF-IP 2010 shows that the new technique to be up to 23 times faster than standard MT for query translations, while maintaining IR effectiveness statistically indistinguishable from standard MT when large training resources are used. Furthermore the new method is significantly better than standard MT when only limited translation training resources are available, which can be a significant issue for translation in specialized domains. The new MT technique also enables patent document translation in a practical amount of time with a resulting significant improvement in the retrieval effectiveness.",
keywords = "Cross-language information retrieval, Cross-language patent retrieval, Large-data CLIR, Machine translation, Prior-art Patent search",
author = "Walid Magdy and Jones, {Gareth J F}",
year = "2015",
doi = "10.1007/s10791-013-9231-6",
language = "English",
volume = "17",
pages = "492--519",
journal = "Information Retrieval",
issn = "1386-4564",
publisher = "Springer Netherlands",
number = "5-6",

}

TY - JOUR

T1 - Studying machine translation technologies for large-data CLIR tasks

T2 - a patent prior-art search case study

AU - Magdy, Walid

AU - Jones, Gareth J F

PY - 2015

Y1 - 2015

N2 - Prior-art search in patent retrieval is concerned with finding all existing patents relevant to a patent application. Since patents often appear in different languages, cross-language information retrieval (CLIR) is an essential component of effective patent search. In recent years machine translation (MT) has become the dominant approach to translation in CLIR. Standard MT systems focus on generating proper translations that are morphologically and syntactically correct. Development of effective MT systems of this type requires large training resources and high computational power for training and translation. This is an important issue for patent CLIR where queries are typically very long sometimes taking the form of a full patent application, meaning that query translation using MT systems can be very slow. However, in contrast to MT, the focus for information retrieval (IR) is on the conceptual meaning of the search words regardless of their surface form, or the linguistic structure of the output. Thus much of the complexity of MT is not required for effective CLIR. We present an adapted MT technique specifically designed for CLIR. In this method IR text pre-processing in the form of stop word removal and stemming are applied to the MT training corpus prior to the training phase. Applying this step leads to a significant decrease in the MT computational and training resources requirements. Experimental application of the new approach to the cross language patent retrieval task from CLEF-IP 2010 shows that the new technique to be up to 23 times faster than standard MT for query translations, while maintaining IR effectiveness statistically indistinguishable from standard MT when large training resources are used. Furthermore the new method is significantly better than standard MT when only limited translation training resources are available, which can be a significant issue for translation in specialized domains. The new MT technique also enables patent document translation in a practical amount of time with a resulting significant improvement in the retrieval effectiveness.

AB - Prior-art search in patent retrieval is concerned with finding all existing patents relevant to a patent application. Since patents often appear in different languages, cross-language information retrieval (CLIR) is an essential component of effective patent search. In recent years machine translation (MT) has become the dominant approach to translation in CLIR. Standard MT systems focus on generating proper translations that are morphologically and syntactically correct. Development of effective MT systems of this type requires large training resources and high computational power for training and translation. This is an important issue for patent CLIR where queries are typically very long sometimes taking the form of a full patent application, meaning that query translation using MT systems can be very slow. However, in contrast to MT, the focus for information retrieval (IR) is on the conceptual meaning of the search words regardless of their surface form, or the linguistic structure of the output. Thus much of the complexity of MT is not required for effective CLIR. We present an adapted MT technique specifically designed for CLIR. In this method IR text pre-processing in the form of stop word removal and stemming are applied to the MT training corpus prior to the training phase. Applying this step leads to a significant decrease in the MT computational and training resources requirements. Experimental application of the new approach to the cross language patent retrieval task from CLEF-IP 2010 shows that the new technique to be up to 23 times faster than standard MT for query translations, while maintaining IR effectiveness statistically indistinguishable from standard MT when large training resources are used. Furthermore the new method is significantly better than standard MT when only limited translation training resources are available, which can be a significant issue for translation in specialized domains. The new MT technique also enables patent document translation in a practical amount of time with a resulting significant improvement in the retrieval effectiveness.

KW - Cross-language information retrieval

KW - Cross-language patent retrieval

KW - Large-data CLIR

KW - Machine translation

KW - Prior-art Patent search

UR - http://www.scopus.com/inward/record.url?scp=84943588649&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84943588649&partnerID=8YFLogxK

U2 - 10.1007/s10791-013-9231-6

DO - 10.1007/s10791-013-9231-6

M3 - Article

AN - SCOPUS:84943588649

VL - 17

SP - 492

EP - 519

JO - Information Retrieval

JF - Information Retrieval

SN - 1386-4564

IS - 5-6

ER -