About ExpressionGenesis

🧬 ExpressionGenesis in a Nutshell

ExpressionGenesis automatically adds standardized disease annotations to GEO Series (GSEs) using large language models and the Disease Ontology to give you clean, validated disease tags for quick dataset discovery or to reuse in your own research.

🔍 Search

Find GEO Series by disease, keyword, or organism.

Browse datasets →

📥 Download & Reuse

Get validated disease annotations (GSE → DOID mappings) for your research.

Download CSV →

📊 Explore

Visualize GEO submission trends by organism and sample size.

View trends →

Abstract

Public repositories, such as the Gene Expression Omnibus (GEO), host hundreds of thousands of transcriptomic datasets; however, inconsistent and unstructured metadata limit their reuse and integration. We developed ExpressionGenesis, an automated platform that generates standardized, ontology-linked metadata for GEO Series (GSE) records using large language models (LLMs) and retrieval-augmented generation (RAG).

ExpressionGenesis extracts structured information, including disease names, experimental design, study summaries, and keywords, from free-text GEO metadata and validates disease annotations using the Disease Ontology (DO). The system is implemented on Amazon Web Services (AWS) with a fully serverless architecture combining Lambda, Step Functions, and Athena for data processing, and a Next.js web interface on AWS Amplify for interactive exploration.

Evaluation of 200 GEO Series demonstrated that ExpressionGenesis achieves higher accuracy and F1-scores than previous NLP-based annotation methods. The public web application provides a searchable interface for enriched GEO metadata and submission trends, and a downloadable CSV of disease annotations for all indexed GEO Series.

ExpressionGenesis demonstrates the potential of LLMs, combined with ontology-grounded validation, to enhance the accessibility and reusability of publicly available gene expression data.

Keywords: Gene Expression Omnibus; metadata curation; disease annotation; large language models; retrieval-augmented generation; Disease Ontology.

Methods

Implementation Overview

For scalability, ExpressionGenesis was built and deployed on Amazon Web Services (AWS). All processing runs on a serverless infrastructure to reduce maintenance. The processing code is written in Python and executes in AWS Lambda and AWS Step Functions, eliminating the need for server management. Data integration is performed using AWS Athena and SQL. The web application is a Next.js application that runs on AWS Amplify with data served by DynamoDB.

Gene Expression Omnibus (GEO) Data

ExpressionGenesis processes GEO Series (GSE) and sample (GSM) data. This data is retrieved using Biopython and GEOparse. Biopython accesses the NCBI Entrez API to obtain a list of publicly available GSE entries. GEOparse downloads and parses the associated SOFT files from the NCBI SFTP server to extract detailed metadata for individual GEO Series, including titles, submission dates, summaries, overall experimental design, and associated PubMed article IDs.

LLM-Based Metadata Generation

ExpressionGenesis uses large language models (LLMs) to extract structured metadata from the free-text fields of GEO entries. These foundational models are accessed via AWS Bedrock and are prompted to generate standardized summaries, disease annotations, experimental design details, and keyword lists in a JSON format. For production metadata generation, we use the Meta Llama 3.3 70B instruct model via Bedrock.

A standardized prompt is used to ensure consistency across LLM output responses. The prompt specifies the required fields and includes formatting instructions to help ensure consistent JSON output, including summary of the experiment, keywords, experimental design (study type, groups, sample size, comparisons), and disease information (name, DOID, stage).

Disease Ontology Mapping

The Disease Ontology (DO) database offers standardized human disease terms, unique identifiers (DOIDs), and a hierarchical structure that categorizes diseases. DO terms enable consistent annotation for disease information. ExpressionGenesis validates disease annotations using retrieval-augmented generation (RAG), which links LLM predictions to the Disease Ontology database, significantly reducing hallucinations and inconsistent terminology.

Evaluation

Evaluation of 200 GEO Series demonstrated that ExpressionGenesis achieves higher accuracy and F1-scores than previous NLP-based annotation methods. The LLM-based pipeline outperforms prior approaches in both precision and recall, resulting in fewer false positives and false negatives.

Technical Architecture

Data Processing Pipeline

1

Ingest GEO Data

Biopython queries the NCBI Entrez API for publicly available GSE entries. GEOparse downloads and parses SOFT files from NCBI's SFTP server to extract titles, summaries, experimental design, sample metadata, and PubMed IDs.

2

LLM Metadata Generation

Meta Llama 3.3 70B (via AWS Bedrock) processes free-text GEO metadata using a standardized prompt to extract structured JSON output: study summaries, keywords, experimental design (study type, groups, sample sizes), and candidate disease annotations.

3

RAG Validation

Retrieval-Augmented Generation validates disease candidates against the Disease Ontology database. This maps predictions to standardized DOIDs, significantly reducing hallucinations and inconsistent terminology.

4

Store & Serve

Enriched metadata is stored in S3 and DynamoDB. AWS Athena enables SQL-based data integration. The Next.js web application runs on AWS Amplify for interactive exploration.

AWS Serverless Stack

Lambda

Compute

Step Functions

Orchestration

Bedrock

LLM Access

S3

Storage

DynamoDB

Database

Athena

Analytics

Cost & Scalability

~$50

monthly operating cost

includes processing, LLM inference, RAG validation, and web hosting

$0.0025

per GEO Series annotation

~$5/month for new submissions

267K+

GEO Series indexed

auto-scales to handle daily submission volumes

Data Sources

Gene Expression Omnibus (GEO)

GEO is a public functional genomics data repository at NCBI supporting MIAME-compliant data submissions. It contains array- and sequence-based data from the research community.

Visit GEO →

Disease Ontology (DO)

The Disease Ontology provides a standardized ontology for human disease terms, phenotype characteristics, and related medical vocabulary, enabling cross-database interoperability.

Visit Disease Ontology →

Citation

Spohn, D. R. (2026). ExpressionGenesis: Automated disease annotation and metadata generation for the Gene Expression Omnibus using large language models. [Manuscript in preparation]. Brandeis University.

Data Availability

ExpressionGenesis and all enriched annotations are freely available:

Web Application

The ExpressionGenesis web application provides a searchable interface for enriched GEO metadata and submission trends.

Browse ExpressionGenesis →

Downloadable Dataset

Disease annotations for all indexed GEO Series are available as a downloadable CSV file containing GEO Series accessions, disease names, and Disease Ontology identifiers.

Download Data →

🔬 Use Our Annotations in Your Research

We encourage researchers to reuse the disease annotations generated by ExpressionGenesis. All annotations are validated against the Disease Ontology, providing clean, standardized disease tags that can be integrated into your own workflows.

Potential use cases include:

📊 Benchmarking Datasets

Build disease-specific benchmark sets for method development or validation.

🔗 Meta-analyses

Quickly identify and aggregate GEO datasets for large-scale meta-analyses.

🧠 Machine Learning

Use disease labels as training data for ML models or classification tasks.

📚 Literature Reviews

Find relevant expression studies for systematic reviews by disease.

🧬 Reproducibility

Identify datasets with similar experimental conditions for replication studies.

🔍 Data Integration

Link GEO datasets to other resources using standardized DOIDs.

Download the full annotation dataset (CSV) →

Acknowledgments

The author thanks Karol Estrada, Ph.D., Brandeis University, for mentorship and guidance throughout this project.

Contact

Daniel R. Spohn, M.S.

Graduate Program in Bioinformatics, Brandeis University

Email: dspohn@gmail.com

LinkedIn Profile