The next session will take place on May 26-29, 2015. Below is a list of the courses that will be offered. Stay tuned for more details and to register!

The Center for Computational Biology and Bioinformatics (CCBB) at The University of Texas at Austin is proud to host an annual Summer School for Big Data in Biology. This event is organized in collaboration with the Genome Sequencing and Analysis Facility (GSAF) and the Center for Systems and Synthetic Biology (CSSB) at UT Austin and aims to:


The UT Summer School for Big Data in Biology in 2014 offered 11 intensive, hands-on four-day workshops across diverse topics. The course offerings were designed for students with all levels of experience, from those who are new to computational biology, bioinformatics, genomics, and proteomics all the way to advanced practitioners.

In general, each course meets for four half-days in either the mornings or afternoons, for a total of twelve hours. Participants have access to a course website where instructors post lectures, datasets, exercises, and other useful course information. There are no examinations or tests, and upon request, participants receive a certificate upon completion of each course. Academic credit is not issued.

All courses are held on the UT Campus in the GDC Building.

Offered courses for 2014:

MORNING 9 a.m. -12 p.m. AFTERNOON 1:30 - 4:30

Bioinformatics for Beginners

Introduction to Python

Core NGS Analysis Tools at TACC

Genome Variant Analysis

Introduction to Biological Networks

Introduction to Phylogenomic Analysis

Computational Proteomics

Introduction to RNA-seq

Open Source SoftWARe:What's it good for?

Protein Modeling Using Rosetta

Introduction to Proteomics and Metabolomics

For more information on course, click course title.

Summer School courses run for 4 x 3 hours, cost $175 per student and are taught by one or more instructors.


Registration is open Feb. 1 – May 3, 2014


All major credit cards accepted. We cannot accept IDT as payment at this time. Thank you.

Affiliations with UT will be confirmed by our staff. Non-UT students must send us a copy of their current student identification. Please email a copy of your ID to CCBB Administration.

*All registrants must have a UT EID to register. This is needed in order to set up computer lab access if necessary for course. Click here to obtain a UT EID.

● Bioinformatics for Beginners

Using common bioinformatics tools effectively requires comfort with a UNIX command line, as well as basic programming skills. Taking advantage of the superclusters at TACC requires some basic understanding of what a computing cluster is. This course offers an introduction to UNIX, TACC, Python, and R basics, which are necessary to use bioinformatics tools on one of the world's most powerful computing systems. The goal of the course will not be making you a command-line Jedi; rather the focus will be on providing a conceptual framework (using real-world examples) to let you explore farther on your own.


Back to top

● Core NGS Analysis Tools at TACC

  • Time: 9:00 a.m. - 12:00 p.m.
  • Location: FAC 101B
  • Instructors: Anna Battenhouse (Research Associate, Iyer Lab), Daechan Park (Graduate Student, Iyer Lab), Nathan Abell (Research Assistant, Iyer Lab)

This workshop provides an introduction to common analysis tools and file formats currently used in NGS, with emphasis on read mapping (bwa, bowtie2), the Sequence Alignment Map (SAM) format, and tools for manipulating BAM files (samtools, bedtools). Participants will gain hands-on experience using these and other NGS tools in the Linux command line environment at TACC, as well as exposure to the many bioinformatics resources TACC makes available.


Back to top

● Introduction to Biological Networks

  • Time: 9:00 a.m. - 12:00 p.m.
  • Location: RLM 7.122
  • Instructor: Dr. Kris McGary (Research Associate, CSSB)

A guide to biological networks: construction, analysis, and application. In the post-genomic era, data sets in biology have grown ever larger and harder to interpret. Biological networks provide conceptual tools for understanding large amounts of experimental data and curated annotation within a coherent framework. This course will introduce the foundational concepts and tools needed to create biological networks and use them to understand the biological meaning concealed in large data sets. The course will include networks based on gene expression, protein-protein interactions, phenotype profiles, etc., and include methods for integrating multiple type of data into a single network.

Prerequisites: Prior experience with Linux/Unix is required.

Other requirements: A computer lab has been reserved so that everyone can work through defined problems without complication, but power users (those who don't need technical support) may bring their own laptops.


Back to top

● Computational Proteomics

  • Time: 9:00 a.m. - 12:00 p.m.
  • Location: GDC 4.304
  • Instructor: Dr. Daniel Boutz, (Research Associate, CSSB)

This course will teach the fundamental methods and skillsets needed to understand, analyze, and interpret mass spectrometry-based proteomic data.  Designed for researchers with a limited-to-intermediate background in protein mass spectrometry, we will cover key concepts and practical applications of experimental design, instrumentation, data processing, and interpretation of results. Through tutorials, participants will learn how to build a standard data analysis pipeline from freely-available software and use it to process real datasets from start to finish.  By no means exhaustive, the goal of this course is to establish a foundational understanding of proteomic data and a working knowledge of standard tools and methods on which to build.


Back to top

● Introduction to Phylogenomic Analysis

  • Time: 1:30 p.m. - 4:00 p.m.
  • Location: GDC 4.302
  • Instructors: Dr. Tandy Warnow (PI, Department of Computer Science), Dr. Luay Nakhleh (PI, Computer Science, Rice University)

Phylogenomics is the estimation of evolutionary histories (trees or networks) using genes sampled from throughout genomes. Estimating species trees is challenging, because gene trees can differ from each other due to biological processes such as incomplete lineage sorting. In this course, we will teach the foundations of phylogenetic analysis, beginning with multiple sequence alignment, but continuing to gene tree estimation, species tree estimation from multiple gene trees, and ending with phylogenetic network estimation. We will provide hands on training in the most accurate software for each step, so that the course participants will be able to analyze their datasets with the best accuracy. We will also provide a brief introduction to metagenomic analysis, focusing on taxon identification and phylogenetic profiling of short reads from a metagenomic dataset.

Preprequisites: Prior experience with multiple sequence alignment and phylogeny estimation software.

Other requirements: Course participants should bring their own laptop, so they can
install the course software on the laptop.


Back to top

● Open Source SoftWARe: What's it good for?

  • Time: 9:00 a.m. - 12:00 p.m.
  • Location: course has been dropped, please check back in the fall
  • Instructor: Dr. Chris Simmons, (Research Associate, ICES)

Absolutely Everything! However, it can be quite daunting to start using open source software. The tools and the language of the FLOSS (Free/Libre Open Source Software) movement might be quite different from those that you use for other work. Additionally, many open source software communities are notoriously rude to new comers who don’t meet their “standards”, don’t do things their way or who don’t RTFM. Consequently, many first-time FLOSS users simply stop trying to use these powerful tools. In this course, we will bridge the gap between tool developers and tool consumers. We will cover the basics of open source software: what is it, where to get it, how to build it, how to use it, how to identify “good” software from “poor” software” and how to contribute back to the project. Even nonprogrammers can contribute to ANY PROJECT! We will discuss the basic elements of software engineering and what a healthy software community should look like. We will also introduce several community codes in Bioinformatics (Mappers/Aligners, Genome Assemblers, Phylogenetic analysis and more) that fully embrace the FLOSS methodology, are resilient to a few developers leaving and should be around for years to come.


Back to top

● Introduction To Python

  • Time: 1:30 p.m. - 4:30 p.m.
  • Location: GDC 6.202
  • Instructor: April Wright (Graduate Student, David Hillis Lab), Stephanie Spielman (Graduate Student, Claus Wilke Lab)

This course is aimed at novice programmers and will give an introduction to programming in the Python language. Concepts covered are also applicable to many other modern languages. Basic topics will include assigning and manipulating variables and lists, control flow for modification and management of large data sets and basic file input and output for flexible file management. Other topics will include the handling of Excel data and plotting with MatPlotLib. Further topic choices will be determined by participant interest.

Requirements: No programming experience is necessary. Participants are expected to provide their own computer, and will be instructed to download several free software packages.


Back to top

● Introduction to RNA-seq

  • Time: 1:30 p.m. - 4:30 p.m.
  • Location: FAC 101B
  • Instructor: Dhivya Arasappan, MS (Bioinformatics Consultant, CCBB)

This workshop provides an introduction to methods for analysis of RNA-seq data. It assumes familiarity and comfort with Linux command line and TACC. A typical RNA-seq workflow will be featured, starting from quality assessment of raw data, mapping (bwa, tophat2), differential expression analysis (DESeq, cuffdiff), splice variant analysis (cufflinks) and downstream analyses and visualization (cummeRbund). The course also describes analysis methods for non-traditional RNA-seq experiments such as RipSeq. Participants will gain hands-on experience using these tools in a Linux command line environment at TACC.

Prerequisite Courses: Introduction to UNIX, Introduction to TACC (or) a working knowledge of the Unix command line and TACC.


Back to top

● Genome Variant Analysis

  • Time: 1:30 p.m. - 4:30 p.m.
  • Location: RLM 7.306
  • Instructors: Dr. Scott Hunicke-Smith (Director, GSAF), Jeff Barrick (PI, ICMB, and CCBB)

This workshop is designed for researchers with next-generation DNA sequencing data assessing genetic variation in one or many samples relative to a reference genome. We will discuss the various types of genome variants (SNPs, in/dels, structural variants) and then interactively explore real datasets with current-generation tools. Participants will gain first-hand knowledge of mappers (BWA, Bowtie), variant callers for both "pure" (e.g. patient trios or bacterial strains) and "mixed" (e.g. cancer and microbial communities) samples (GATK, Samtools/bcftools, freeBayes, deepSNV, SVDetect), variant annotation tools (annovar, VAT, and others), variant visualization, and integrated variant prediction pipelines (such as breseq). Participants will gain a working knowledge of the theory and practice of variant calling and are encouraged to bring their own datasets. A working knowledge of linux is strongly recommended, but no other pre-requisites are required.


Back to top

● Introduction to Proteomics and Metabolomics

  • Time: 9:00 a.m. - 12:00 p.m.
  • Location: GDC 4.304
  • Instructor: Dr. Viswanadham Sridhara (Bioinformatics Consultant, CCBB)

This workshop is designed both for undergraduates and graduates to understand the basic simulation/modeling techniques in computational proteomics and systems biology. The topics covered include (a) setting-up structural models of lipid bilayers for molecular dynamics simulations (~ns) (b) analyzing mass-spectrometry based proteomics raw data for identifying proteins and their post-translational modifications (c) introduction to flux balance analysis (FBA) methods using genome-scale metabolic network models (for metabolic engineering purposes) and (d) basics of interpreting high-dimensional proteomics data by plotting heat maps, scatter plots and by using simple regression techniques etc.

Tools used in the course include GROMACS, VMD, TPP, COBRA/SBML toolboxes.

The attendees at the end of the course can set-up their own MD simulations of lipid bilayers with their program of interest (GROMACS/NAMD/CHARMM etc), analyze mass-spec data and can do computational prediction of cellular phenotypes for metabolic engineering purposes etc.

Prequisites: Basic knowledge of MATLAB and/or R is required.

Back to top

● Protein Modeling Using Rosetta

This course is intended to give students a strong foundation and working knowledge for understanding computational protein design, a tool of growing importance in protein science. Computational protein design has grown in prominence as successful applications of this tool, such as the design of enzymes that carry out reactions not seen in nature, and the redesign of protein surfaces for enhanced stability, have been published. The course will introduce students to working with the Rosetta suite of software for macromolecular structure prediction and design, the premier software suite for computational protein design. Course topics will include visualizing and assessing protein structures; understanding key concepts of scoring and sampling in computational protein design; using various protein design strategies; loop modeling; and homology modeling.

Prerequisites: Knowledge of general biochemistry, and basic understanding of x-ray crystallography and NMR methods for protein structure determination are prerequisites for this course.

Computer requirements: Familiarity with using Linux or Unix terminals, Pymol, and some coding are strongly encouraged but not required.


Back to top



Short Courses

Informal Semester-long Classes


For-Credit Courses

Division for Statistics and Scientific Computing (SCS)

TACC (Texas Advanced Computing Center)

Return to main CCBB training page