Spanish in Boston: Sociolinguistic Dataset & Analysis

Overview

This project involved designing and analyzing sociolinguistic datasets to investigate variation in Spanish speech. It demonstrates end-to-end experience in data creation, annotation design, quality assurance, and statistical analysis.

My Role

Designed annotation guidelines for novel linguistic variables
Managed dataset collection, curation, and QA workflows
Supervised and trained student annotators
Led full research lifecycle from data design to statistical modeling

Data & Methods

Built and analyzed datasets of 70k+ tokens
Conducted coding, extraction, and statistical analysis in R
Applied probabilistic modeling to investigate linguistic variation
Developed workflows for annotation consistency and data quality

Outcome

Produced structured datasets for analyzing Spanish variation
Generated findings contributing to dissertation research
Demonstrated scalable approaches to linguistic data annotation and QA

Research Data Linguistics

Authors

Lee-Ann Vidal Covas (she/her)

Language Scientist (PhD, Boston University) with expertise in sociolinguistic research, dataset curation, and applied data science.

← Cogito: Speech Data Annotation for Machine Learning