Generation of Regular Expression from Aligned Sequences of Text Snippets

Main Article Content

Girishkumar K. Patnaik, Dinesh D. Puri, Akash D. Waghmare


Introduction: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Vitae sapien pellentesque habitant morbi tristique senectus et netus. Dignissim cras tincidunt lobortis feugiat vivamus at augue eget arcu. At risus viverra adipiscing at in. Cras semper auctor neque vitae tempus quam. Sed cras ornare arcu dui. Turpis massa sed elementum tempus. Risus commodo viverra maecenas accumsan lacus vel facilisis volutpat est.

Objectives: Effective data management has become an absolute necessity in today's world as a direct result of the widespread adoption of electronic medical records for patients. For medicine and healthcare domains, particularly effective web healthcare providers, computer-assisted categorization of such information into functional classifications such as medical issues or illnesses may save time and efforts. However, a substantial volume of data in the healthcare domain is still unstructured.

Methods: The utilization of unstructured text datasets is a challenge in comparison with structured text datasets. Many researchers in various domains have proposed to convert unstructured text into structured text. The text needs to be built in a fixed and aligned pattern for easy comparison, matching, and classification. In the text classification process, accuracy is one of the challenges when text is unstructured. The text with a fixed pattern is easy to classify. The regular expressions are sequences of characters with fixed patterns of text. This strength of regular expression is used in the proposed text classification. Finding patterns in unstructured text is possible with the use of sequence alignment. Basically, sequence alignment is a concept of biomedical research, but the same concept is used to align text snippets that are not perfectly similar to each other.

Results: The sequence alignment technique detects the similarity score assigned to each potential alignment in order to choose among the numerous local alignments of sequences. The proposed local pairwise alignment method is used to get sequence alignments, which are useful to generate regular expressions.

Conclusions: A regular expression is a string of characters used to describe a text pattern. The regular expressions are typically created manually by the domain experts. The proposed work is the automatic generation of regular expressions from aligned sequences with a bottom up approach. The generated regular expressions are used as a dataset on which various machine learning algorithms are applied for text classification.

Article Details