Hosted by NIH.AI and National Library of Medicine (NLM), this highly interactive workshop will offer opportunities to exchange expertise and collaborate with NIH researchers at all career levels who are utilizing natural language processing technologies in their work. This four-hour workshop will include targeted presentations and will offer time for open discussion among peers and across disciplines. The workshop will be held at NLM Visitors Center, NIH Building 38A, Room 127, on Thursday, May 9, 2019, from 1-5 PM * * Workshop recording will be made available following the event. Can't Attend in Person? WebEx is available via https://cbiit.webex.com/cbiit/j.php?MTID=m9552e7d1d54c51a0710326402c4f355c
1 - 1:40 pm |
Text Mining and Deep Learning for Biology and Healthcare: An Introduction – Lana Yeganova/Qingyu Chen, NCBI/NLM This session covers fundamentals of NLP and deep learning. We will start from the basic Natural Language Processing components such as tokenization and stemming, discuss the tf - idf term weighting technique, and touch on the BM25 retrieval function. We will transition from traditional word representations to word embeddings and demonstrate how they advance our ability to analyze the relationships across words, sentences, and documents. We will discuss popular word embedding techniques, including Word2Vec, Glove, and FastText, and extend computed word embeddings to create sentence embeddings. Finally, we will discuss deep neural networks such as CNN and LSTM. Running examples will be provided for each topic. |
1:40 – 2:15 |
Automatic information extraction from free-text pathology reports using multi-task convolutional neural networks – Hong-Jun Yoon, Oak Ridge National Laboratory We introduce two different information extraction techniques from free-text cancer pathology reports; Multi-Task Convolutional Neural Network (MT-CNN) and Hierarchical Convolutional Attention Network (MT-HCAN). The models attempt to tackle document information extraction by learning to identify multiple characteristics simultaneously. We will demonstrate how the models trained, how the latent representation captures the key phrases and concepts, and how the inference is made. |
2:15 - 2:30 |
Break |
2:30 - 3:05 |
Biomedical Named Entity Recognition and Information Extraction – Robert Leaman/Shankai Yan, NCBI/NLM biomedical text mining applications require locating and identifying concepts of interest - the tasks of named entity recognition (NER) and normalization (NEN). Both tasks have a long history in biomedical text mining, using techniques that have evolved from primarily lexical and rule-based, to include machine learning with rich feature sets and currently deep learning with learned feature representations. Our PubTator Central (PTC) system provides on-demand NER and NEN annotations for six biomedical concept types - genes/proteins, genetic variants, diseases, chemicals, species and cell lines - in both biomedical abstracts and full text articles. PTC processes input text through multiple NER/NEN systems, combining their output with a disambiguation module based on deep learning. The module uses a convolutional neural network (CNN) to determine the most likely concept type for overlapping annotations based on the syntax and semantics of both the span being classified and the surrounding context. The disambiguation model is trained using a weakly supervised approach and provides a significant accuracy improvement. Currently, we are benchmarking deep learning methods for NER and NEN. Deep learning methods for NER have matured significantly, primarily using variations of long short-term memory networks (LSTMs). Normalization methods with deep learning are still an area of active development, and we describe some recent progress. |
3:05 - 3:40 |
Neural Approaches to Medical Question Understanding – Asma Ben Abacha/Yassine Mrabet, LHC/NLM Online resources are increasingly used by consumers to meet their health information needs. According to surveys from the Pew Research Center, one of three U.S. adults (35%) looks for information on a medical condition online and 15% of internet users posted questions, comments or information about health-related issues on the web. Consumer health questions are often very challenging to automated processing and answering due to their high proximity to open-domain language models, high rates of misspellings and ungrammatical sentences, and the frequent insertion of background information. In this talk, we will describe our approaches to understand and answer automatically consumer health questions. We will present our efforts to summarize long consumer health questions to short questions that are more efficient for answer retrieval, and to infer entailment relations between new user questions and existing, already answered questions. We will then talk about our approaches to extract key information from the user question such as the main topic and question type and how to use them in answer retrieval. Finally, we will present a first prototype built by the combination of several approaches to tackle the question understanding and answer retrieval tasks. |
3:40 - 3:50 |
Break |
3:50 - 4:25 |
Transfer Learning in Biomedical NLP: A Case Study with BERT – Yifan Peng, NCBI/NLM BERT (Bidirectional Encoder Representations from Transformers) is a recent language representation model proposed by researchers at Google AI Language. It has achieved state-of-the-art results in a wide variety of NLP tasks. Here we introduce how to pre-train the BERT model on large-scale biomedical and clinical corpora (PubMed and MIMIC-III) and how to fine-tune the BERT model on specific tasks such as named entity recognition and relation extraction. |
4:25 - 5:00 |
Guided Discussion |
Please contact either George Zaki (george.zaki@nih.gov), Miles Kimbrough (miles.kimbrough@nih.gov) or Yifan Peng (yifan.peng@nih.gov).