nathansmith.io

Scraping Enrollment Data From the UCLA Registrar, Part Two

· 25 minute read

In part one of this series, I discussed my initial exploration into scraping the various pages and APIs of the UCLA Registrar’s online Schedule of Classes in order to extract enrollment data on classes at UCLA. After experimenting with a couple of different languages and libraries, I settled on writing the scraper in Go. I finished part one by writing the code to scrape all of the subject areas offered for a term and all of the courses for a given subject area, but did not finish the code to scrape a section, as I realized that scraping sections meant I’d have to store the courses I previously scraped somewhere.

Part two of the post will finish up the discussion of the development of the scraper by covering how I stored the data I scraped – and finished up scraping all sections. I’ll then discuss how I deployed the scraper onto AWS Lambda, various post-deployment bugs I encountered, and future improvements that could be made to the scraper. That’s a lot, so let’s hop right on in!

Storing the data and designing the schema

It was finally time to put the data I was scraping into a database. I had heard a lot of great things about PostgreSQL, so I decided to try it.

Although I was initially only interested in the data of individual sections, I realized that’d it’d make sense to have three tables: one for subject areas, one for courses, and one for sections. This way, I could use the information from one table in scraping the data for another, reducing the number of network requests to the site I’d have to make.

My initial stab at writing the schema looked like this:

CREATE TABLE subject_areas (
  id SERIAL PRIMARY KEY,
  label TEXT,
  value TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE courses (
  id SERIAL PRIMARY KEY,
  subject_area_id INTEGER REFERENCES subject_areas(id),
  title TEXT,
  number TEXT,
  model TEXT UNIQUE,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE sections (
  id SERIAL PRIMARY KEY,
  -- section_id is a unique id assigned by the registrar, different from the autogenerated postgres id
  section_id TEXT UNIQUE,
  course_id INTEGER REFERENCES courses(id),
  days TEXT,
  time TEXT,
  location TEXT,
  units TEXT,
  instructor TEXT,
  enrollment_status TEXT,
  enrollment_count INT,
  enrollment_capacity INT,
  waitlist_status TEXT,
  waitlist_count INT,
  waitlist_capacity INT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

An early issue I ran into was the size of the sections table: because I was scraping sections every hour, I had millions of rows in the table after a couple weeks. On the advice of Alex, a coworker at Keybase and SQL wizard, I ended up splitting the sections table into two tables: one of section information (titled sections) and one of just enrollment data (titled enrollment_data). sections would contain information about a section’s time, location, instructor, etc. – things that don’t change frequently – while enrollment_data would track the ever-changing enrollment data. The split gave me:

CREATE TABLE sections (
  id SERIAL PRIMARY KEY,
  -- section_id is unique per term
  section_id TEXT UNIQUE,
  course_id INTEGER REFERENCES courses(id),
  days TEXT,
  time TEXT,
  location TEXT,
  units TEXT,
  instructor TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE enrollment_data (
  id SERIAL PRIMARY KEY,
  section_id TEXT REFERENCES sections(section_id),
  enrollment_status TEXT,
  enrollment_count INT,
  enrollment_capacity INT,
  waitlist_status TEXT,
  waitlist_count INT,
  waitlist_capacity INT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Actually using Postgres with Go turned out to be very simple; there’s a popular pq driver for Go’s built-in sql package. Jon Calhoun has a great tutorial series on how to get it set up.

Most of my database logic is encapsulated in a couple of functions. The ConnectToDatabase function is shared across the various scraping functions.

func ConnectToDatabase() (*sql.DB, error) {
	host := os.Getenv("DB_HOST")
	port := 5432
	user := os.Getenv("DB_USER")
	password := os.Getenv("DB_PASS")
	dbname := os.Getenv("DB_NAME")
	var psqlInfo string
	if password != "" {
		psqlInfo = fmt.Sprintf("host=%s port=%d user=%s password=%s dbname=%s sslmode=disable",
			host, port, user, password, dbname)
	} else {
		psqlInfo = fmt.Sprintf("host=%s port=%d user=%s dbname=%s sslmode=disable",
			host, port, user, dbname)
	}
	db, err := sql.Open("postgres", psqlInfo)
	if err != nil {
		return nil, err
	}
	err = db.Ping()
	if err != nil {
		return nil, err
	}
	return db, nil
}

Saving subject areas and courses

Armed with a way to persistently save data, I went back to ScrapeSubjectAreas and ScrapeCourses to save their data.

The result was two new functions, SaveSubjectAreas and SaveCourses. The code for SaveCourses is below, SaveSubjectAreas is pretty similar.

func SaveCourses(db *sql.DB, courses []Courses) {
    insertCourse := `
    INSERT INTO courses (subject_area_id, title, number)
    VALUES ($1, $2, $3)
    ON CONFLICT (subject_area_id, title, number)
    DO UPDATE
    SET updated_at = NOW()
    RETURNING id
    `
    for _, course := range courses {
    var id int
    err := db.QueryRow(insertCourse, course.SubjectAreaID, course.Title, course.Numbe
r).Scan(&id)
    if err != nil {
        log.Error(err)
    }
    _, err = db.Exec(insertTerm, id, term)
    if err != nil {
        log.Error(err)
    }
}

Retrieving subject areas

Previously, in the ScrapeCourses function, there was the following line used to get all of the subject areas to scrape courses from:

var subjectAreas = ScrapeSubjectAreas()

Now that subject areas were stored in the database, I went back and replaced this line with a function to actually retrieve the subject areas from the database instead of rescraping the subject areas.

func RetrieveSubjectAreas(db *sql.DB) (subjectAreas []SubjectArea, err error) {
	rows, err := db.Query("SELECT id, name, code FROM subject_areas")
	if err != nil {
		log.Error(err)
		return subjectAreas, err
	}
	defer rows.Close()
	for rows.Next() {
		var id string
		var name string
		var code string
		err = rows.Scan(&id, &name, &code)
		if err != nil {
			log.Error(err)
		}
		subjectAreas = append(subjectAreas, SubjectArea{
			ID:    id,
			Label: name,
			Value: code,
		})
	}
	err = rows.Err()
	if err != nil {
		log.Error(err)
		return subjectAreas, err
	}
	return subjectAreas, nil
}

db, err := ConnectToDatabase()
if err != nil {
  return err
}
var subjectAreas = RetrieveSubjectAreas(db)

Note that the functions reads out the unique ID that Postgres creates for each subject area, which can then be used to relate courses and subject areas when SaveCourses is called.

Finally scraping sections

Now that courses were being saved by ScrapeCourses, I could retrieve them in a RetrieveCourses function that queried the database and returned a Go slice, similar to the previous RetrieveSubjectAreas.

Once I had the courses, scraping sections would be similar to how I scraped courses last time.

The top-level ScrapeSections function established the same parallel scraping that was in ScrapeCourses.

func ScrapeSections() {
	db, err := ConnectToDatabase()
	if err != nil {
		return err
	}

  courses = RetrieveCourses()

	for _, course := range courses {
		wg.Add(1)
		go func(course Course) {
			sem <- struct{}{}
			defer func() { <-sem }()
			defer wg.Done()
			err := FetchAndSaveSections(course, db)
			if err != nil {
				log.Error(err)
			}
		}(course)
	}
	wg.Wait()
}

I’m not going to go over RetrieveCourses, since RetrieveSubjectAreas is pretty similar. Besides, the interesting stuff is in FetchAndSaveSections. In FetchAndSaveSections, I:

  1. Make a HTTP request to the endpoint with the relevant section information
  2. Parse the response
  3. Iterate through each row of the section table
  4. For each row, extract the relevant section information
  5. Store the extracted info into the database
func FetchAndSaveSections(course Course, db *sql.DB) error {
	// Step 1: make HTTP request to /GetCourseSummary endpoint
	const queryURL = "https://sa.ucla.edu/ro/Public/SOC/Results/GetCourseSummary"
	req, err := http.NewRequest("GET", queryURL, nil)
	if err != nil {
		return err
	}
	params := req.URL.Query()
	params.Add("model", course.Model)
	params.Add("FilterFlags", FilterFlags)
	req.URL.RawQuery = params.Encode()
	response, err := client.Do(req)
	if err != nil {
		return err
	}
	defer response.Body.Close()

	// Step 2: parse the response into a goquery object
	doc, err := goquery.NewDocumentFromReader(response.Body)
	if err != nil {
		return err
	}

	// Step 3: iterate through each row of the section table
	table := doc.Find("div[id$=-children]")
	rows := table.ChildrenFiltered("div")
	for i := range rows.Nodes {
		row := rows.Eq(i)
		rowID, exists := row.Attr("id")
		if !exists {
			return errors.New("could not find row id")
		}
		idRegex := regexp.MustCompile(`([0-9]+)_`)
		matches := idRegex.FindStringSubmatch(rowID)

		// Step 4: extract the section information
		sectionID := matches[1]
		enrollmentData := strings.TrimSpace(row.Find("div[id$=-status_data]").Text())
		enrollmentStatus, enrollmentCount := ParseEnrollmentData(sectionID, enrollmentData)
		days := strings.TrimSpace(row.Find("div[id$=-days_data]").First().Text())
		// ...

		// Step 5: save section to database
		section := Section{
			SectionID:        sectionID,
			CourseID:         course.ID,
			EnrollmentStatus: enrollmentStatus,
			EnrollmentCount:  enrollmentCount,
			Days:             days,
			// ...
		}
		SaveSection(section, db)
	}
	return nil
}

ParseEnrollmentData employs a lot of conditional logic and regexes to properly parse the numbers and status of a class.1 I’ll get to it in a bit, but just pretend it works for now.

The final bit was to define SaveSection, which both upserts a row in the sections table as well as creates a new enrollment_data row.

func SaveSection(section Section, db *sql.DB) error {
	insertSection := `
	INSERT INTO sections (
		section_id,
		course_id,
		days,
		times,
		...
	)
	VALUES (...)
	ON CONFLICT (section_id)
	DO UPDATE SET
	updated_at = NOW(),
	days = EXCLUDED.days,
	time = EXCLUDED.time,
	...
	`

insertEnrollmentData := `
INSERT INTO enrollment_data (
		section_id,
		enrollment_count,
		waitlist_count,
		...
	)
	VALUES (...)
	`

	var id int
	err := db.QueryRow(insertSection,
		section.SectionID,
		section.CourseID,
		...
	).Scan(&id)
	if err != nil {
		log.Error(err)
		return err
	}

	_, err = db.Exec(insertEnrollmentData,
		id,
		section.EnrollmentCount,
		section.WaitlistCount,
		...
	)
	if err != nil {
		log.Error(err)
		return err
	}

	return nil
}

And with that, the scraper was complete. It locally could scrape subjects, courses, and sections and save them all in a local Postgres database. I tried running it locally and it worked!

It just now needed to be run every hour.

Going serverless

Running the scraper every hour meant I’d need to do some kind of job scheduling on a remote server – it wasn’t practical to keep my laptop running every hour and scrape from my home network. My initial thought was to use cron on a Digital Ocean droplet or similar, but I didn’t like the idea of paying for a server to run 24/7 just so it could execute a ~5 minute script every hour. JavaScript Twitter loves to talk about the magic of running semi-occasional jobs in a serverless manner, so I decided to look into it as a cheaper and simpler option.

There are a lot of different “Functions as a Service” providers out there, but I figured I’d go with AWS Lambda since they had a generous free plan, supported Go, and seemed to be pretty popular. The AWS console is a daunting webpage, so I was also looking for a tool that could allow me to write my infrastructure as code. There are a ton of different options in this space: Serverless, Terraform, and CloudFormation were just some of the options I looked at.

Ultimately, I went with AWS SAM because it came with good examples, was easy to set up, was highly integrated with AWS services, and had support for offline testing. This was my first foray into working with AWS and I found it immensely helpful to use a simpler tool that shared all its terminology with AWS. The biggest con of SAM I found was that it’s less popular than some of the bigger tools, and it was tough to figure out how to utilize some of its lesser known features.

Setting up a lambda function is pretty easy, once you have a solid grasp on various components of AWS. There are AWS-provided libraries for all the supported lambda languages/runtimes that allow you to the define the logical start function, which AWS calls a “handler function”. In Go, you have to call lambda.Start in your main function with the handler function as an argument. Here’s what ScrapeSections looked like, once lambda-ized:

import "github.com/aws/aws-lambda-go/lambda"

func ScrapeSections() {
	db, err := ConnectToDatabase()
	if err != nil {
		return err
	}
	courses = RetrieveCourses(db)

	// Set up semaphore
	maxConnections, err := strconv.Atoi(os.Getenv("DB_MAX_CONNECTIONS"))
	if err != nil {
		return err
	}
	sem := make(chan struct{}, maxConnections)

	for _, course := range courses {
		wg.Add(1)
		go func(course Course) {
			sem <- struct{}{}
			defer func() { <-sem }()
			defer wg.Done()
			err := FetchAndSaveSections(course, db)
			if err != nil {
				log.Error(err)
			}
		}(course)
	}
	wg.Wait()
}

func main() {
	lambda.Start(ScrapeSections)
}

From a design perspective, I decided it made sense to make my three main functions: ScrapeSubjectAreas, ScrapeCourses, and ScrapeSections each their own lambda function. This allowed me to set a different schedule for each function: sections would be scraped hourly, but course and subject area updates could be scraped less frequently.2 In addition to scheduling, this distinction helped distinguish each lambda’s role and prevented the scraper from becoming too monolithic.

Once your function is written, SAM requires a template.yaml file. This file defines your function, any other AWS resources it needs, environment variables, etc. SAM uses this file locally in order to create a Docker environment so that you can test your function, as well as to simplify production deployments.

Here’s what the template.yaml looks like for ScrapeSections:

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31

Globals:
  Function:
    Timeout: 360
Resources:
  ScrapeSectionsFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: s3://ucla-registrar-scraper/5338dd93d471ce67be27b01475911bd2
      Handler: scrape-sections
      Runtime: go1.x
      Tracing: Active
      Events:
        ScrapeWebsiteHourly:
          Type: Schedule
          Properties:
            Schedule: cron(0 * * * ? *)
            Name: Houly
            Description: Top of every hour
            Enabled: True
      Environment:
        Variables:
          DB_MAX_CONNECTIONS: 75
          DB_USER: nathan
          DB_PASS: password
          DB_HOST: my-db.example.com
          DB_NAME: registrar_data
Outputs:
  ScrapeSectionsFunction:
    Description: 'ScrapeSections ARN'
    Value: !GetAtt ScrapeSectionsFunction.Arn
  ScrapeSectionsFunctionIamRole:
    Description: 'Implicit IAM Role created for ScrapeSections function'
    Value: !GetAtt ScrapeSectionsFunctionRole.Arn

A lot of this is boilerplate that I generated from the AWS Go example template.yaml. The interesting bits are:

  • I set environment variables via the Environment.Variables property on the resource. Environment variables include the max connection limit and DB login info.
  • The Events section is where I schedule the scraping, which happens hourly for sections.2 AWS supports cron syntax.

From that, I was ready to test locally (and then deploy!). To locally test your code, SAM provides a command, sam local invoke. invoke takes a --event argument that is a path to a JSON file representing event data passed into the lambda. Since I don’t pass any data into the lambdas, I just made a simple empty JSON file to test with.

For Go, you also need to compile your changes before you can invoke them. I found myself using go build and sam local invoke a lot, so I started a simple makefile.

build:
	GOOS=linux GOARCH=amd64 go build -o scrape-sections ./main.go

local: build
	sam local invoke --event event.json

Now I could just type make local and run the scraper locally.

The makefile also ended up being useful for deployment, where you have to first package the build, then deploy it.3

S3_BUCKET=my-s3-bucket-url

package: build
	sam package \
		--output-template-file packaged.yaml \
		--s3-bucket $(S3_BUCKET)

deploy: package
	sam deploy \
		--template-file packaged.yaml \
		--stack-name fetch-sections \
		--capabilities CAPABILITY_IAM \
		--region us-west-1

I chose us-west-1 as my availability zone, as it was the closest to the UCLA servers which are in Southern California.

Deploying the database

If my scrapers would be running every hour, I needed to have a database that would always be available. Amazon RDS for PostgreSQL was perfect for my needs. The database runs db.t2.micro instance because it’s free and works well enough.

Logging and error detection

After deploying the functions and letting them run for a couple of days, I soon realized that there was no error detection for the functions.

Since I’m pretty familiar with Datadog and they offer a great free student plan, it was a no-brainer that I’d use Datadog. For Go logging, they recommend Logrus, which was super easy to set up.

An interesting quirk I found is that Logrus seems to only be initializable in Go’s init function, not main. So every lambda function has an init that looks like:

func init() {
	InitializeLogging()
}

InitializeLogging is pretty straightforward. I use LOGGING environment variable to distinguish between environments I’m testing in vs production. In production, I want to log in JSON because it is sent and parsed by Datadog. Locally, it’s easier to see log output as text on my screen.

func InitializeLogging() {
	if os.Getenv("LOGGING") == "json" {
		log.SetFormatter(&log.JSONFormatter{})
	}

	log.SetOutput(os.Stdout)

	if os.Getenv("LOGGING") == "json" {
		log.SetLevel(log.WarnLevel)
	} else {
		log.SetLevel(log.InfoLevel)
	}

	log.Info("Logging Initialized")
}

Fine tuning the regular expressions

Over the first few months of scraping the Registrar, I ran into some edge cases that my original regular expressions couldn’t handle. For example: a class that is closed by a department but has enrolled students.

Comm 171, which is closed even with an enrollment of over 100 students. A lot of upper div Comm classes are closed like this, perhaps department policy is to close these sections after second or third week?

Originally, my scraper was just marking this section as closed with an enrollment count of 0! That wouldn’t work at all.

After many tweaks, I’ve settled on the following regular expressions for enrollment and waitlist statuses.

// Enrollment regexes
var tenativeRegex = regexp.MustCompile(`^Tenative`)
var canceledRegex = regexp.MustCompile(`^Cancelled`)
var closedByDeptRegex = regexp.MustCompile(`^Closed by Dept[a-zA-Z,/ ]*(\((?P<Capacity>\d+) capacity, (?P<EnrolledCount>\d+) enrolled, (?P<WaitlistedCount>\d+) waitlisted\))?`)
var classFullRegex = regexp.MustCompile(`ClosedClass Full \((?P<Capacity>\d+)\)(, Over Enrolled By (?P<OverenrolledCount>\d+))?`)
var classOpenRegex = regexp.MustCompile(`Open(\d+) of (\d+) Enrolled(\d+) Spots? Left`)
var waitlistOnlyRegex = regexp.MustCompile(`^Waitlist$`)
var waitlistFullRegex = regexp.MustCompile(`^WaitlistClass Full \((?P<Capacity>\d+)\)(, Over Enrolled By (?P<OverenrolledCount>\d+))?`)
// Waitlist regexes
var waitlistOpenRegex = regexp.MustCompile(`(\d+) of (\d+) Taken`)
var noWaitlistRegex = regexp.MustCompile(`No Waitlist`)
var waitlistClosedRegex = regexp.MustCompile(`Waitlist Full \((\d+)\)`)

Like most regular expressions, these can seem daunting at first but actually don’t use many regular expression features.

  • Beginning of line ^ and end of line $
  • Character ranges []
  • Match 0 or 1 ?
  • Match 1 or more *
  • Match digits \d
  • Capturing groups () and named capturing groups (?P<name>...)

The named capturing groups are especially useful, as I use a function that extracts all named matches out of a string into a map:

func CreateRegexMatchMap(re *regexp.Regexp, str string) (matchMap map[string]string) {
	match := re.FindStringSubmatch(str)
	matchMap = make(map[string]string)
	for i, name := range re.SubexpNames() {
		if i > 0 && i <= len(match) {
			matchMap[name] = match[i]
		}
	}
	return matchMap
}

This makes parsing the info from one of the above regexes much easier to understand. For example, here’s the code for a function that parses and returns the numbers from a section that is closed, like Comm 171.

func ParseClosedStatus(enrollmentData string) (enrollmentStatus string, enrollmentCount, enrollmentCapacity int) {
	enrollmentStatus = "Closed"
	matchMap := CreateRegexMatchMap(closedByDeptRegex, enrollmentData)

	enrollmentCount, err := strconv.Atoi(matchMap["Capacity"])
	if err != nil {
		log.Error("cannot parse capacity")
	}

	count, err := strconv.Atoi(matchMap["EnrolledCount"])
	if err != nil {
		log.Error("cannot parse enrollment count")
	}

	return enrollmentStatus, enrollmentCount, enrollmentCapacity
}

Remember the ParseEnrollmentData function that appeared earlier? It basically just determines which ParseXStatus to run.

func ParseEnrollmentData(sectionID, enrollmentData string) (enrollmentStatus string, enrollmentCount, enrollmentCapacity int) {
	if tenativeRegex.MatchString(enrollmentData) {
		enrollmentStatus = "Tenative"
	} else if canceledRegex.MatchString(enrollmentData) {
		enrollmentStatus = "Cancelled"
	} else if closedByDeptRegex.MatchString(enrollmentData) {
		return ParseClosedStatus(sectionID, enrollmentData)
	} else if classFullRegex.MatchString(enrollmentData) {
		return ParseFullStatus(sectionID, enrollmentData)
	} else if classOpenRegex.MatchString(enrollmentData) {
		return ParseOpenStatus(sectionID, enrollmentData)
	} else if waitlistFullRegex.MatchString(enrollmentData) {
		return ParseWaitlistStatus(sectionID, enrollmentData)
	} else if waitlistOnlyRegex.MatchString(enrollmentData) {
		enrollmentStatus = "Waitlist"
	} else {
		log.WithFields(log.Fields{"enrollmentData": enrollmentData}).Error("unrecognized string")
	}
	return enrollmentStatus, enrollmentCount, enrollmentCapacity
}

I’ve found parsing info from regexes like this to be pretty robust.

Handling multiple locations/times/professors

I initially assumed that a section could only have one location, time, or professor. Not true. For example, take Theater 134E: Dance for Musical Theater III. 4 instructors, 3 days, 3 times, and 2 locations!

Yikes! How would the scraper handle that?

It turns out each new line was created by a <br> tag in the HTML. So, if I split on <br>, then I’d get an array of the times/locations/days/instructors. Since the same logic is reused in four places, I ended up extracting it into a function.

func ParseEntryToArray(rawText string) []string {
	tagMatcher := regexp.MustCompile(`<wbr/>|<(/)?p>|<(/)?a[^>]*>`)
	text := tagMatcher.ReplaceAllString(rawText, "")
	text = strings.TrimSpace(text)
	text = html.UnescapeString(text)
	arr := strings.Split(text, "<br/>")
	log.WithFields(log.Fields{"array": arr, "length": len(arr)}).Info("Parsed array")
	return arr
}

The first few lines clean up the text by stripping extraneous HTML tags, removing whitespace, and unescaping any raw HTML characters. Then the array is created and returned.

Postgres has support for arrays, so all I needed to do then was update the database schema and wrap the arrays in pq.Array() when inserting them.

The database migration ended up looking like this; it uses Postgres’ string_to_array function.

BEGIN;
  ALTER TABLE sections
  RENAME COLUMN time TO times;

  ALTER TABLE sections
  RENAME COLUMN location TO locations;

  ALTER TABLE sections
  ALTER COLUMN times TYPE TEXT[] USING string_to_array(times, '');

  ALTER TABLE sections
  ALTER COLUMN locations TYPE TEXT[] USING string_to_array(locations, '');

  ALTER TABLE sections
  ALTER COLUMN days TYPE TEXT[] USING string_to_array(days, '');
END;

The corresponding code change was simple after the migration.

err := db.QueryRow(insertSection,
	section.SectionID,
	section.Term,
	section.CourseID,
	pq.Array(section.Days),
	pq.Array(section.Times),
	pq.Array(section.Locations),
	section.Units,
	pq.Array(section.Instructors),
)

Scraping multiple terms

I originally wrote the scraper last summer, to scrape Fall 2019 classes at UCLA. As winter rolled around, I realized I needed to update the scraper and database to accommodate multiple terms.

My initial approach to multiple terms was to create a new database for each term. My original database was named to fall2019 and the new database was renamed to winter2020. However, this didn’t seem like a sustainable option, especially if I wanted to compare data across terms.

Around this time, my friend Richard also pointed out an interesting tidbit to me: even though they weren’t listed on the main Schedule of Classes page, you can create course search page urls for any term back to 1999 by modifying the t variable.4 The webpages were all in the same format as the current page, so I could scrape 20 years of course enrollment data. This was exciting.

Subject area changes

I set out to think about how multiple terms would affect my current data, and how the schema could grow to handle more courses/sections. I figured the easiest place to start would be with subject areas, which should be constant throughout terms.

Remember that very first footnote where I hint at that not being the case? It turns out, not all of the subject areas were the same – some subject areas offered in fall, like African Studies, were not offered in winter. Furthermore, I had no idea how to get subject areas for previous terms, which proved to be a problem as subject area names can change over time: a good example of this is the Electrical Engineering department changing their name to Electrical and Computer Engineering.

I was pretty stumped on this, so I sent an email over to the Registrar asking if they had a list of subject areas. I got a very nice response from them and they sent me an Excel spreadsheet of all the subject areas UCLA has ever offered.

Back in business!

After converting the spreadsheet into a csv and writing a quick Python script to parse the csv and insert it into the database5, I was ready to scrape courses.

Course changes

I figured it made sense to have courses be independent of terms, as course titles, numbers, and curriculum don’t change that often. Each section row would point to the correct course via the course_id column, similar to how multiple sections were handled now, except now sections could be across terms.

However, I quickly ran into an issue: models were term dependent. Let’s look at the example from earlier, for CS 111.

{
  "Term": "19F",
  "SubjectAreaCode": "COM SCI",
  "CatalogNumber": "0111    ",
  "IsRoot": true,
  "SessionGroup": "%",
  "ClassNumber": "%",
  "SequenceNumber": null,
  "Path": "COMSCI0111",
  "MultiListedClassFlag": "n",
  "Token": "MDExMSAgICBDT01TQ0kwMTEx"
}

I’d need to generate this blob dynamically to fetch across sections and relate them all to the same course. For the most part, this didn’t seem too bad. Term would obviously change, and I knew what the possible values would be. SubjectAreaCode, CatalogNumber, and Path were just strings that needed to be formatted from a subject area/course number. IsRoot seemed to always be true across the requests I tested; similarly, SequenceNumber was always null and SessionGroup was always %. MultiListedClassFlag just limited the result to being a multi-listed class, but multi-listed classes were still listed when it was "n", so I could always keep it at "n".

The tricky properties would be model and ClassNumber.

For model, the challenge was that every value was unique among the requests I tested and I had no idea how it was created. It took me a lot of time playing with strings and formatting before I discovered that the token was just a Base64 encoding of the course subject area and number. (MDExMSAgICBDT01TQ0kwMTEx decodes to 0111 COMSCI0111.)

For ClassNumber, there was a more subtle issue. There are some courses in UCLA that have variable topics. In the Computer Science department, these courses are usually numbered 188; the number varies by department. In the Management department, I found something interesting:

Two sections of “Applying Science of Happiness to Life Desgin”.

There were multiple sections of the same course, listed as if they were different courses. The two sections of “Applying Science of Happiness to Life Desgin” had ever so slightly different models:

{
  "Term": "20W",
  "SubjectAreaCode": "MGMT   ",
  "CatalogNumber": "0298D   ",
  "IsRoot": true,
  "SessionGroup": "%",
  "ClassNumber": " 010  ",
  "SequenceNumber": null,
  "Path": "MGMT0298D010",
  "MultiListedClassFlag": "n",
  "Token": "MDI5OEQgICBNR01UMDI5OEQwMTA="
}
{
  "Term": "20W",
  "SubjectAreaCode": "MGMT   ",
  "CatalogNumber": "0298D   ",
  "IsRoot": true,
  "SessionGroup": "%",
  "ClassNumber": " 009  ",
  "SequenceNumber": null,
  "Path": "MGMT0298D009",
  "MultiListedClassFlag": "n",
  "Token": "MDI5OEQgICBNR01UMDI5OEQwMDk="
}

I wondered what would happen if I set ClassNumber to % instead of a number. Initially, I got a 404 error, but with some updating of Path and Token so that they didn’t refer to a specific section number, I got a response. It was the markup for all sections of MGMT 298D. Unfortunately, however, there was no title or distinguishing features for each row, just the section’s number. When scraping courses, I’d need to associate each course title listed with a section number, then, when scraping sections, recall these numbers to parse out which section belongs to which course.

So I made a new table in the database, course_section_indices, specifically for this. It keeps track of which courses are offered under which section number for a term. Note that only courses with variable titles, like MGMT 298D, end up in this table.

CREATE TABLE course_section_indices (
  id SERIAL PRIMARY KEY,
  course_id INTEGER REFERENCES courses(id) NOT NULL,
  term TEXT NOT NULL,
  indices TEXT[] NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  UNIQUE (course_id, term)
);

Section changes

After all of these course changes, updating the section scraping was pretty easy. Since sections are concrete offerings of a course, it made sense to add a term column. The section_id is the id given to the section by the Registrar. That id is unique only for a term, so the uniqueness constraint on sections had to be updated.

ALTER TABLE sections DROP CONSTRAINT sections_section_id_key;
ALTER TABLE sections ADD UNIQUE (section_id, term);

Scraping 20 years of data

Even though my scraper was pretty fast compared to the other solutions I knew of, it took me just over an hour to scrape all of the courses from 1999 to present. (I did all the scraping locally, and not on AWS, because I only needed to scrape this data once and not every hour.) I ended up scraping all the sections in batches over the period of a few days.

What are you doing with this data?

As I mentioned in part one, I’m currently doing a digital humanities project exploring enrollments trends over the past 20 years. I’ll be publishing my dataset (this data) as part of that project.

I’ve also heard from my friends who run services such as Bruinwalk or Radius that they’d be interested in using this data to integrate more accurate section offerings into a given quarter. I’m exploring creating an API service so that they and other UCLA student developers can create awesome new things.

My original plan with this data was to create a course alert notification service, so that users are notified if a spot in a course or waitlist opens up. Stay tuned for updates on that.

Future improvements

There is still a lot of interesting work that could be done to improve the scraper. Here are some ideas for what I’ll work on next.

More data

Even though I’m currently scraping a lot of data from the Registrar, there’s still more that could be obtained.

For instance, there’s an endpoint, /ClassDetailTooltip, that’s triggered every time a detail tooltip is opened on a course. This tooltip provides info about prerequisites, enrollment restrictions, grading type, the class webpage, and final exam time.

The class tooltip UI.

I also would love to have more professor information than just the provided “LastName, FirstInitial”, but I’m not sure where this data would be available.

I’m currently working on some improvements to scrape data about departments, divisions, schools, buildings, and classrooms. Stay tuned – there may be another blog post about that.

Summer courses

Another source of data that the scraper doesn’t collect is data on summer courses at UCLA. This is mainly because summer courses vary in length and duration, which makes it hard to classify what a summer term is. Is it all the same term? Are A and C session different terms? What about 6 vs 8 vs 10 week courses?

Multi-listed courses

The same section at UCLA can be listed under two different departments. Right now, the scraper can discern this only by the fact that multi-listed courses share the same name and have a course number prefixed with an “M”. It’d be nice to have a better way to denote which multi-listed courses are related in the database.

Better storage of values

There are also probably better ways the current data could be stored in the database. Days are stored as an array of string, although since there are only seven days (and a couple of other values, such as “Varies”, “Tentative”, “Not Scheduled”, etc.), I’ve wondered if this could become an enum. The tricky part about converting it to an enum would be handling courses that have multiple days/times.

I also wonder if course times could be better stored – as an array of ranges of times, perhaps, instead of as a string array.

Concurrency through lambda functions instead of goroutines

I spent a good amount of time in part one of this series talking about how Go appealed to me as a language because of its accessible concurrency features, which allowed me to scrape endpoints in parallel. While this is true for local runs, it turns out that lambda functions generally only have one CPU core. As Luc Hendriks put it,

Instead of having 1 100-core computer, you have 100 1-core computers.

I think a much better way of lambda-izing my scraper would be to have a lambda function that is given a specific course and just scrapes the sections for that particular course, rather than retrieve and scrape all courses. Of course, this design presents its own problems, such as requiring some type of connection pooling6 to ensure the database is overwhelmed. However, this would mean I could get rid of the maxConnections code I use to ensure that the database isn’t overwhelmed.

Conclusion

Overall, this has been one of my favorite technical projects I’ve worked on. I’m super excited to continue developing it and I hope you found this writeup of it interesting and that it inspires you to go out and scrape a dataset of your own.


  1. There’s also a ParseWaitlistData function that functions similarly. I omitted it from the code sample from brevity. ↩︎

  2. Courses are updated every 24 hours. Subject areas are scraped on a pretty ad-hoc basis. ↩︎ ↩︎

  3. In newer versions of SAM, you actually don’t need to package – deploy implicitly does it for you. ↩︎

  4. E.g., computer science courses from Spring 2002: https://sa.ucla.edu/ro/Public/SOC/Results?t=02S&sBy=subject&subj=COM+SCI ↩︎

  5. I realize there’s probably a better way to get a CSV into a Postgres table, but it was the first way that occurred to me. ↩︎

  6. I’ve been told PgBouncer is the connection pooler to use for Postgres. ↩︎