Spanish in Boston: Sociolinguistic Dataset & Analysis

Overview
This project involved designing and analyzing sociolinguistic datasets to investigate variation in Spanish speech. It demonstrates end-to-end experience in data creation, annotation design, quality assurance, and statistical analysis.
My Role
- Designed annotation guidelines for novel linguistic variables
- Managed dataset collection, curation, and QA workflows
- Supervised and trained student annotators
- Led full research lifecycle from data design to statistical modeling
Data & Methods
- Built and analyzed datasets of 70k+ tokens
- Conducted coding, extraction, and statistical analysis in R
- Applied probabilistic modeling to investigate linguistic variation
- Developed workflows for annotation consistency and data quality
Outcome
- Produced structured datasets for analyzing Spanish variation
- Generated findings contributing to dissertation research
- Demonstrated scalable approaches to linguistic data annotation and QA

Authors
Lee-Ann Vidal Covas
(she/her)
Language Scientist (PhD, Boston University) with expertise in sociolinguistic research, dataset curation, and applied data science.