Julia Silge

Julia Silge https://juliasilge.com/ Recent content on Julia Silge Hugo -- gohugo.io en Tue, 20 Jan 2026 00:00:00 +0000 Use Positron Assistant with GitHub Copilot https://juliasilge.com/blog/copilot-african-languages/ Tue, 20 Jan 2026 00:00:00 +0000 https://juliasilge.com/blog/copilot-african-languages/ The newest monthly release of Positron has a revamp of the integration between GitHub Copilot and Positron Assistant. Explore #TidyTuesday literary prizes with Positron’s Data Explorer https://juliasilge.com/blog/literary-prizes/ Mon, 13 Oct 2025 00:00:00 +0000 https://juliasilge.com/blog/literary-prizes/ The newest monthly release of Positron delivers some fresh new features for the Data Explorer. Release an R package with Positron https://juliasilge.com/blog/r-pkg-release/ Wed, 19 Mar 2025 00:00:00 +0000 https://juliasilge.com/blog/r-pkg-release/ See how you can use Positron, a new, next-generation data science IDE, for R package development tasks and releasing a new version of an R package. Positron in action with #TidyTuesday orca encounters https://juliasilge.com/blog/orcas-positron/ Mon, 14 Oct 2024 00:00:00 +0000 https://juliasilge.com/blog/orcas-positron/ Get to know Positron, a new, next-generation data science IDE, using this week’s Tidy Tuesday data on encounters with orcas. Educational attainment in #TidyTuesday UK towns https://juliasilge.com/blog/educational-attainment/ Wed, 24 Apr 2024 00:00:00 +0000 https://juliasilge.com/blog/educational-attainment/ Let’s walk through the ML lifecycle from EDA to model development to deployment, using tidymodels, vetiver, and Posit Team. Changes in #TidyTuesday US polling places https://juliasilge.com/blog/polling-places/ Wed, 17 Jan 2024 00:00:00 +0000 https://juliasilge.com/blog/polling-places/ Let’s use summarization and visualization to explore how the numbers of polling places have changed in the United States. Empirical Bayes for #TidyTuesday Doctor Who episodes https://juliasilge.com/blog/doctor-who-bayes/ Wed, 29 Nov 2023 00:00:00 +0000 https://juliasilge.com/blog/doctor-who-bayes/ Which writers of Doctor Who episodes are rated the most highly? Let’s use empirical Bayes to find out. Logistic regression modeling for #TidyTuesday US House Elections https://juliasilge.com/blog/house-elections/ Tue, 07 Nov 2023 00:00:00 +0000 https://juliasilge.com/blog/house-elections/ Today is Election Day in the United States, so let’s use logistic regression modeling to explore vote share in US House elections. Topic modeling for #TidyTuesday Taylor Swift lyrics https://juliasilge.com/blog/taylor-swift/ Mon, 23 Oct 2023 00:00:00 +0000 https://juliasilge.com/blog/taylor-swift/ Learn how to fit and interpret an unsupervised text model for all of Taylor Swift’s ERAS. Where are #TidyTuesday haunted cemeteries compared to haunted schools? https://juliasilge.com/blog/haunted-places/ Wed, 11 Oct 2023 00:00:00 +0000 https://juliasilge.com/blog/haunted-places/ Use tidy log odds to compare which US states are more likely to have haunted cemeteries or haunted schools. How often does Roy Kent say "F*CK"? https://juliasilge.com/blog/roy-kent/ Wed, 27 Sep 2023 00:00:00 +0000 https://juliasilge.com/blog/roy-kent/ He’s here, he’s there, he’s every f*cking where, and we’re finding bootstrap confidence intervals. Evaluate multiple modeling approaches for #TidyTuesday spam email https://juliasilge.com/blog/spam-email/ Fri, 01 Sep 2023 00:00:00 +0000 https://juliasilge.com/blog/spam-email/ Use workflowsets to evaluate multiple possible models to predict whether email is spam. Classification metrics for #TidyTuesday GPT detectors https://juliasilge.com/blog/gpt-detectors/ Wed, 19 Jul 2023 00:00:00 +0000 https://juliasilge.com/blog/gpt-detectors/ Learn about different kinds of metrics for evaluating classification models, and how to compute, compare, and visualize them. What tokens are used more vs. less in #TidyTuesday place names? https://juliasilge.com/blog/place-names/ Wed, 05 Jul 2023 00:00:00 +0000 https://juliasilge.com/blog/place-names/ Let’s use byte pair encoding tokenization along with Poisson regression to understand which tokens are more more often (or less often) in US place names. Predict the magnitude of #TidyTuesday tornadoes with effect encoding and xgboost https://juliasilge.com/blog/tornadoes/ Sat, 20 May 2023 00:00:00 +0000 https://juliasilge.com/blog/tornadoes/ How well can we predict the magnitude of tornadoes in the US? Let’s use xgboost along with effect encoding to fit our model. Tune an xgboost model with early stopping and #TidyTuesday childcare costs https://juliasilge.com/blog/childcare-costs/ Thu, 11 May 2023 00:00:00 +0000 https://juliasilge.com/blog/childcare-costs/ Can we predict childcare costs in the US using an xgboost model? In this blog post, learn how to use early stopping for hyperparameter tuning. Deploy a model on AWS SageMaker with vetiver https://juliasilge.com/blog/vetiver-sagemaker/ Thu, 04 May 2023 00:00:00 +0000 https://juliasilge.com/blog/vetiver-sagemaker/ Learn how to train and deploy a model with R and vetiver on AWS SageMaker infrastructure. Use OpenAI text embeddings with #TidyTuesday horror movie descriptions https://juliasilge.com/blog/horror-embeddings/ Wed, 05 Apr 2023 00:00:00 +0000 https://juliasilge.com/blog/horror-embeddings/ High quality text embeddings are becoming more available from companies like OpenAI. Learn how to obtain them and then use them for text analysis. Resampling to understand gender in #TidyTuesday art history data https://juliasilge.com/blog/art-history/ Wed, 08 Feb 2023 00:00:00 +0000 https://juliasilge.com/blog/art-history/ Artists who are women are underrepresented in art history textbooks, and we can use resampling to robustly understand more about this imbalance. To downsample imbalanced data or not, with #TidyTuesday bird feeders https://juliasilge.com/blog/project-feederwatch/ Wed, 18 Jan 2023 00:00:00 +0000 https://juliasilge.com/blog/project-feederwatch/ Will squirrels will come eat from your bird feeder? Let’s fit a model both with and without downsampling to find out. High cardinality predictors for #TidyTuesday museums in the UK https://juliasilge.com/blog/uk-museums/ Fri, 25 Nov 2022 00:00:00 +0000 https://juliasilge.com/blog/uk-museums/ Learn how to handle predictors with high cardinality using tidymodels for accreditation data on UK museums. Delete all your tweets using rtweet https://juliasilge.com/blog/delete-tweets/ Thu, 10 Nov 2022 00:00:00 +0000 https://juliasilge.com/blog/delete-tweets/ Worried about how a certain social media platform is going and want to start removing yourself? Learn how to delete all your tweets. Find high FREX and high lift words for #TidyTuesday Stranger Things dialogue https://juliasilge.com/blog/stranger-things/ Thu, 20 Oct 2022 00:00:00 +0000 https://juliasilge.com/blog/stranger-things/ New functionality in tidytext supports identifying high FREX and high lift words from topic modeling results. Predict the status of #TidyTuesday Bigfoot sightings https://juliasilge.com/blog/bigfoot/ Fri, 23 Sep 2022 00:00:00 +0000 https://juliasilge.com/blog/bigfoot/ Learn how to use vetiver to set up different types of prediction endpoints for your deployed model. Use Docker to deploy a model for #TidyTuesday LEGO sets https://juliasilge.com/blog/lego-sets/ Thu, 08 Sep 2022 00:00:00 +0000 https://juliasilge.com/blog/lego-sets/ After you train a model, you can use vetiver to prepare a Dockerfile and deploy your model in a flexible way. Sliding windows for #TidyTuesday rents in San Francisco https://juliasilge.com/blog/sf-rent/ Thu, 04 Aug 2022 00:00:00 +0000 https://juliasilge.com/blog/sf-rent/ The slider package provides support for flexible sliding window aggregation, and we can use these kinds of sliding windows to analyze rents over time. Three ways to look at #TidyTuesday UK pay gap data https://juliasilge.com/blog/pay-gap-uk/ Thu, 30 Jun 2022 00:00:00 +0000 https://juliasilge.com/blog/pay-gap-uk/ Use summarization, a single linear model, and bootstrapping to understand what economic activities involve a larger pay gap for women. Use resampling to understand #TidyTuesday drought in TX https://juliasilge.com/blog/drought-in-tx/ Wed, 15 Jun 2022 00:00:00 +0000 https://juliasilge.com/blog/drought-in-tx/ The spatialsample package is gaining many new methods this summer, and we can use spatially aware resampling to understand how drought is related to other quantities across Texas. Predict #TidyTuesday NYT bestsellers https://juliasilge.com/blog/nyt-bestsellers/ Wed, 11 May 2022 00:00:00 +0000 https://juliasilge.com/blog/nyt-bestsellers/ Will a book be on the NYT bestseller list a long time, or a short time? We walk through how to use wordpiece tokenization for the author names, and how to deploy your model as a REST API. Handling model coefficients for #TidyTuesday collegiate sports https://juliasilge.com/blog/college-sports/ Sat, 09 Apr 2022 00:00:00 +0000 https://juliasilge.com/blog/college-sports/ Understand how much money colleges spend on sports using linear modeling and bootstrap intervals. Poisson regression for #TidyTuesday counts of R package vignettes https://juliasilge.com/blog/rstats-vignettes/ Wed, 16 Mar 2022 00:00:00 +0000 https://juliasilge.com/blog/rstats-vignettes/ The tidymodels framework provides extension packages for specialized tasks such as Poisson regression. Learn how to fit a zero-inflated model for understanding how R package releases are related to number of vignettes. Inference for #TidyTuesday aircraft and rank of Tuskegee airmen https://juliasilge.com/blog/tuskegee-airmen/ Fri, 11 Feb 2022 00:00:00 +0000 https://juliasilge.com/blog/tuskegee-airmen/ The infer package is part of tidymodels and provides an expressive statistical grammar. Understand how to use infer, and celebrate Black History Month by learning more about the Tuskegee airmen. Predict ratings for #TidyTuesday board games https://juliasilge.com/blog/board-games/ Fri, 28 Jan 2022 00:00:00 +0000 https://juliasilge.com/blog/board-games/ Use custom feature engineering for board game categories, tune an xgboost model with racing methods, and use explainability methods for deeper understanding. Text predictors for #TidyTuesday chocolate ratings https://juliasilge.com/blog/chocolate-ratings/ Fri, 21 Jan 2022 00:00:00 +0000 https://juliasilge.com/blog/chocolate-ratings/ Get started with feature engineering for text data, transforming text to be used in machine learning algorithms. Topic modeling for #TidyTuesday Spice Girls lyrics https://juliasilge.com/blog/spice-girls/ Wed, 15 Dec 2021 00:00:00 +0000 https://juliasilge.com/blog/spice-girls/ Learn how to train, explore, and understand an unsupervised topic model for text data. Predicting viewership for #TidyTuesday Doctor Who episodes https://juliasilge.com/blog/doctor-who/ Sat, 27 Nov 2021 00:00:00 +0000 https://juliasilge.com/blog/doctor-who/ Using a tidymodels workflow can make many modeling tasks more convenient, but sometimes you want more flexibility and control of how to handle your modeling objects. Learn how to handle resampled workflow results and extract the quantities you are interested in. Spatial resampling for #TidyTuesday and the #30DayMapChallenge https://juliasilge.com/blog/map-challenge/ Fri, 05 Nov 2021 00:00:00 +0000 https://juliasilge.com/blog/map-challenge/ Use spatial resampling to more accurately estimate model performance for geographic data. Predict #TidyTuesday giant pumpkin weights with workflowsets https://juliasilge.com/blog/giant-pumpkins/ Fri, 22 Oct 2021 00:00:00 +0000 https://juliasilge.com/blog/giant-pumpkins/ Get started with tidymodels workflowsets to handle and evaluate multiple preprocessing and modeling approaches simultaneously, using pumpkin competitions. Multiclass predictive modeling for #TidyTuesday NBER papers https://juliasilge.com/blog/nber-papers/ Wed, 29 Sep 2021 00:00:00 +0000 https://juliasilge.com/blog/nber-papers/ Tune and evaluate a multiclass model with lasso regulariztion for economics working papers. Dimensionality reduction for #TidyTuesday Billboard Top 100 songs https://juliasilge.com/blog/billboard-100/ Wed, 15 Sep 2021 00:00:00 +0000 https://juliasilge.com/blog/billboard-100/ Songs on the Billboard Top 100 have many audio features. We can use data preprocessing recipes to implement dimensionality reduction and understand how these features are related. Fit and predict with tidymodels for #TidyTuesday bird baths in Australia https://juliasilge.com/blog/bird-baths/ Wed, 01 Sep 2021 00:00:00 +0000 https://juliasilge.com/blog/bird-baths/ In this screencast, focus on some tidymodels basics such as how to put together feature engineering and a model algorithm, and how to fit and predict. Modeling human/computer interactions on Star Trek from #TidyTuesday with workflowsets https://juliasilge.com/blog/star-trek/ Tue, 24 Aug 2021 00:00:00 +0000 https://juliasilge.com/blog/star-trek/ Learn how to evaluate multiple feature engineering and modeling approaches with workflowsets, predicting whether a person or the computer spoke a line on Star Trek. Predict housing prices in Austin TX with tidymodels and xgboost https://juliasilge.com/blog/austin-housing/ Sun, 15 Aug 2021 00:00:00 +0000 https://juliasilge.com/blog/austin-housing/ More xgboost with tidymodels! Learn about feature engineering to incorporate text information as indicator variables for boosted trees. Supervised Machine Learning for Text Analysis in R is now complete https://juliasilge.com/blog/smltar-complete/ Fri, 13 Aug 2021 00:00:00 +0000 https://juliasilge.com/blog/smltar-complete/ Our new book in the Chapman & Hall/CRC Data Science Series is now complete and available for preorder! Tune xgboost models with early stopping to predict shelter animal status https://juliasilge.com/blog/shelter-animals/ Sat, 07 Aug 2021 00:00:00 +0000 https://juliasilge.com/blog/shelter-animals/ Early stopping can keep an xgboost model from overfitting. Use racing methods to tune xgboost models and predict home runs https://juliasilge.com/blog/baseball-racing/ Thu, 29 Jul 2021 00:00:00 +0000 https://juliasilge.com/blog/baseball-racing/ Models like xgboost have many tuning hyperparameters, but racing methods can help identify parameter combinations that are not performing well. Predict which #TidyTuesday Scooby Doo monsters are REAL with a tuned decision tree model https://juliasilge.com/blog/scooby-doo/ Tue, 13 Jul 2021 00:00:00 +0000 https://juliasilge.com/blog/scooby-doo/ Which Scooby Doo monsters are REAL?! Walk through how to tune and then choose a decision tree model, as well as how to visualize and evaluate the results. Create a custom metric with tidymodels and NYC Airbnb prices https://juliasilge.com/blog/nyc-airbnb/ Wed, 30 Jun 2021 00:00:00 +0000 https://juliasilge.com/blog/nyc-airbnb/ Predict prices for Airbnb listings in NYC with a data set from a recent episode of SLICED, with a focus on two specific aspects of this model analysis: creating a custom metric to evaluate the model and combining both tabular and unstructured text data in one model. Class imbalance and classification metrics with aircraft wildlife strikes https://juliasilge.com/blog/sliced-aircraft/ Mon, 21 Jun 2021 00:00:00 +0000 https://juliasilge.com/blog/sliced-aircraft/ Handling class imbalance in modeling affects classification metrics in different ways. Learn how to use tidymodels to subsample for class imbalance, and how to estimate model performance using resampling. Partial dependence plots with tidymodels and DALEX for #TidyTuesday Mario Kart world records https://juliasilge.com/blog/mario-kart/ Fri, 28 May 2021 00:00:00 +0000 https://juliasilge.com/blog/mario-kart/ Tune a decision tree model to predict whether a Mario Kart world record used a shortcut, and explore partial dependence profiles for the world record times. Predict availability in #TidyTuesday water sources with random forest models https://juliasilge.com/blog/water-sources/ Thu, 06 May 2021 00:00:00 +0000 https://juliasilge.com/blog/water-sources/ Walk through a tidymodels analysis from beginning to end to predict whether water is available at a water source in Sierra Leone. Estimate change in #TidyTuesday CEO departures with bootstrap resampling https://juliasilge.com/blog/ceo-departures/ Wed, 28 Apr 2021 00:00:00 +0000 https://juliasilge.com/blog/ceo-departures/ Are more CEO departures involuntary now than in the past? We can use tidymodels' bootstrap resampling and generalized linear models to understand change over time. Which #TidyTuesday Netflix titles are movies and which are TV shows? https://juliasilge.com/blog/netflix-titles/ Fri, 23 Apr 2021 00:00:00 +0000 https://juliasilge.com/blog/netflix-titles/ Use tidymodels to build features for modeling from Netflix description text, then fit and evaluate a support vector machine model. Which #TidyTuesday post offices are in Hawaii? https://juliasilge.com/blog/hawaii-post-offices/ Wed, 14 Apr 2021 00:00:00 +0000 https://juliasilge.com/blog/hawaii-post-offices/ Use tidymodels to predict post office location with subword features and a support vector machine model. Dimensionality reduction of #TidyTuesday United Nations voting patterns https://juliasilge.com/blog/un-voting/ Wed, 24 Mar 2021 00:00:00 +0000 https://juliasilge.com/blog/un-voting/ Explore country-level UN voting with a tidymodels approach to unsupervised machine learning. Bootstrap confidence intervals for #TidyTuesday Super Bowl commercials https://juliasilge.com/blog/superbowl-conf-int/ Thu, 04 Mar 2021 00:00:00 +0000 https://juliasilge.com/blog/superbowl-conf-int/ Estimate how commercial characteristics like humor and patriotic themes change with time using tidymodels functions for bootstrap confidence intervals. Getting started with k-means and #TidyTuesday employment status https://juliasilge.com/blog/kmeans-employment/ Wed, 24 Feb 2021 00:00:00 +0000 https://juliasilge.com/blog/kmeans-employment/ Use tidy data principles to understand which kinds of occupations are most similar in terms of demographic characteristics. Understand your models with #TidyTuesday inequality in student debt https://juliasilge.com/blog/student-debt/ Fri, 12 Feb 2021 00:00:00 +0000 https://juliasilge.com/blog/student-debt/ Explore results of models with convenient tidymodels functions. Learn tidytext with my new learnr course https://juliasilge.com/blog/learn-tidytext-learnr/ Tue, 02 Feb 2021 00:00:00 +0000 https://juliasilge.com/blog/learn-tidytext-learnr/ I am happy to announce that this free, open source, interactive course on text mining with tidy data principles is now published! Explore art media over time in the #TidyTuesday Tate collection dataset https://juliasilge.com/blog/tate-collection/ Fri, 15 Jan 2021 00:00:00 +0000 https://juliasilge.com/blog/tate-collection/ Check residuals and other model diagnostics for regression models trained on text features, all with tidymodels functions. Predicting injuries for Chicago traffic crashes https://juliasilge.com/blog/chicago-traffic-model/ Mon, 04 Jan 2021 00:00:00 +0000 https://juliasilge.com/blog/chicago-traffic-model/ Download up-to-date city data from Chicago’s open data portal and predict whether a traffic crash involved an injury with a bagged tree model. Upcoming changes to tidytext: threat of COLLAPSE https://juliasilge.com/blog/tidytext-collapse-change/ Wed, 16 Dec 2020 00:00:00 +0000 https://juliasilge.com/blog/tidytext-collapse-change/ The current development version of tidytext has changes that may affect your analyses. Tune random forests for #TidyTuesday IKEA prices https://juliasilge.com/blog/ikea-prices/ Thu, 03 Dec 2020 00:00:00 +0000 https://juliasilge.com/blog/ikea-prices/ Use tidymodels scaffolding functions for getting started quickly with commonly used models like random forests. Tune and interpret decision trees for #TidyTuesday wind turbines https://juliasilge.com/blog/wind-turbine/ Thu, 29 Oct 2020 00:00:00 +0000 https://juliasilge.com/blog/wind-turbine/ Use tidymodels to predict capacity for Canadian wind turbines with decision trees. Predicting class membership for the #TidyTuesday Datasaurus Dozen https://juliasilge.com/blog/datasaurus-multiclass/ Wed, 14 Oct 2020 00:00:00 +0000 https://juliasilge.com/blog/datasaurus-multiclass/ Which of the Datasaurus Dozen are easier or harder for a random forest model to identify? Learn how to use multiclass evaluation metrics to find out. Modeling #TidyTuesday NCAA women's basketball tournament seeds https://juliasilge.com/blog/ncaa-tuning/ Wed, 07 Oct 2020 00:00:00 +0000 https://juliasilge.com/blog/ncaa-tuning/ Tune a hyperparameter and then understand how to choose the best value afterward, using tidymodels for modeling the relationship between expected wins and tournament seed. Handle class imbalance in #TidyTuesday climbing expedition data with tidymodels https://juliasilge.com/blog/himalayan-climbing/ Wed, 23 Sep 2020 00:00:00 +0000 https://juliasilge.com/blog/himalayan-climbing/ Use tidymodels for feature engineering steps like imputing missing data and subsampling for class imbalance, and build predictive models to predict the probability of survival for Himalayan climbers. Introducing our new book, Tidy Modeling with R https://juliasilge.com/blog/tidymodels-book/ Thu, 17 Sep 2020 00:00:00 +0000 https://juliasilge.com/blog/tidymodels-book/ An initial version of the first eleven chapters are available today! Look for more chapters to be released in the near future. Train and analyze many models for #TidyTuesday crop yields https://juliasilge.com/blog/crop-yields/ Wed, 02 Sep 2020 00:00:00 +0000 https://juliasilge.com/blog/crop-yields/ Learn how to use tidyverse and tidymodels functions to fit and analyze many models at once. Build a #TidyTuesday predictive text model for The Last Airbender https://juliasilge.com/blog/last-airbender/ Tue, 11 Aug 2020 00:00:00 +0000 https://juliasilge.com/blog/last-airbender/ Use text features and tidymodels to predict the speaker of individual lines from the show, and learn how to compute model-agnostic variable importance for any kind of model. Get started with tidymodels and #TidyTuesday Palmer penguins https://juliasilge.com/blog/palmer-penguins/ Tue, 28 Jul 2020 00:00:00 +0000 https://juliasilge.com/blog/palmer-penguins/ Build two kinds of classification models and evaluate them using resampling. Supervised Machine Learning for Text Analysis in R https://juliasilge.com/blog/smltar-announce/ Fri, 24 Jul 2020 00:00:00 +0000 https://juliasilge.com/blog/smltar-announce/ Announcing our new book, to be published in the Chapman & Hall/CRC Data Science Series! Bagging with tidymodels and #TidyTuesday astronaut missions https://juliasilge.com/blog/astronaut-missions-bagging/ Wed, 15 Jul 2020 00:00:00 +0000 https://juliasilge.com/blog/astronaut-missions-bagging/ Learn how to use bootstrap aggregating to predict the duration of astronaut missions. The Bechdel test and the X-Mansion with tidymodels and #TidyTuesday https://juliasilge.com/blog/uncanny-xmen/ Tue, 30 Jun 2020 00:00:00 +0000 https://juliasilge.com/blog/uncanny-xmen/ Explore data from the Claremont Run Project on Uncanny X-Men with bootstrap resampling. Impute missing data for #TidyTuesday voyages of captive Africans with tidymodels https://juliasilge.com/blog/captive-africans-voyages/ Wed, 17 Jun 2020 00:00:00 +0000 https://juliasilge.com/blog/captive-africans-voyages/ Understand more about the forced transport of African people using the Slave Voyages database. PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes https://juliasilge.com/blog/cocktail-recipes-umap/ Wed, 27 May 2020 00:00:00 +0000 https://juliasilge.com/blog/cocktail-recipes-umap/ Use tidymodels for unsupervised dimensionality reduction. tidylo is now on CRAN! 🎉 https://juliasilge.com/blog/tidylo-cran/ Tue, 26 May 2020 00:00:00 +0000 https://juliasilge.com/blog/tidylo-cran/ Measure how the frequency of some feature differs across some group or set, using the weighted log odds. Tune XGBoost with tidymodels and #TidyTuesday beach volleyball https://juliasilge.com/blog/xgboost-tune-volleyball/ Thu, 21 May 2020 00:00:00 +0000 https://juliasilge.com/blog/xgboost-tune-volleyball/ Learn how to tune hyperparameters for an XGBoost classification model to predict wins and losses. Learn tidymodels with my supervised machine learning course https://juliasilge.com/blog/tidymodels-ml-course/ Fri, 15 May 2020 00:00:00 +0000 https://juliasilge.com/blog/tidymodels-ml-course/ I am happy to announce that a new version of my free, online, interactive course has been published! Multinomial classification with tidymodels and #TidyTuesday volcano eruptions https://juliasilge.com/blog/multinomial-volcano-eruptions/ Wed, 13 May 2020 00:00:00 +0000 https://juliasilge.com/blog/multinomial-volcano-eruptions/ Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to evaluate complex models. Today’s screencast demonstrates how to implement multiclass or multinomial classification using with this week’s #TidyTuesday dataset on volcanoes. 🌋 Here is the code I used in the video, for those who prefer reading instead of or in addition to video. Explore the data Our modeling goal is to predict the type of volcano from this week’s #TidyTuesday dataset based on other volcano characteristics like latitude, longitude, tectonic setting, etc. Sentiment analysis with tidymodels and #TidyTuesday Animal Crossing reviews https://juliasilge.com/blog/animal-crossing/ Wed, 06 May 2020 00:00:00 +0000 https://juliasilge.com/blog/animal-crossing/ A lot has been happening in the tidymodels ecosystem lately! There are many possible projects we on the tidymodels team could focus on next; we are interested in gathering community feedback to inform our priorities. If you are interested in sharing your opinion on next steps in tidymodels development, please take this short survey. Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Modeling #TidyTuesday GDPR violations with tidymodels https://juliasilge.com/blog/gdpr-violations/ Wed, 22 Apr 2020 00:00:00 +0000 https://juliasilge.com/blog/gdpr-violations/ This is an exciting week for us on the tidymodels team; we launched tidymodels.org, a new central location with resources and documentation for tidymodels packages. There is a TON to explore and learn there! 🚀 You can check out the official blog post for more details. Today, I’m publishing here on my blog another screencast demonstrating how to use tidymodels. This is a good video for folks getting started with tidymodels, using this week’s #TidyTuesday dataset on GDPR violations. PCA and the #TidyTuesday best hip hop songs ever https://juliasilge.com/blog/best-hip-hop/ Tue, 14 Apr 2020 00:00:00 +0000 https://juliasilge.com/blog/best-hip-hop/ Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m exploring a different part of the tidymodels framework; I’m showing how to implement principal component analysis via recipes with this week’s #TidyTuesday dataset on the best hip hop songs of all time as determinded by a BBC poll of music critics. Here is the code I used in the video, for those who prefer reading instead of or in addition to video. Bootstrap resampling with #TidyTuesday beer production data https://juliasilge.com/blog/beer-production/ Thu, 02 Apr 2020 00:00:00 +0000 https://juliasilge.com/blog/beer-production/ I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m using this week’s #TidyTuesday dataset on beer production to show how to use bootstrap resampling to estimate model parameters. Here is the code I used in the video, for those who prefer reading instead of or in addition to video. Tuning random forest hyperparameters with #TidyTuesday trees data https://juliasilge.com/blog/sf-trees-random-tuning/ Thu, 26 Mar 2020 00:00:00 +0000 https://juliasilge.com/blog/sf-trees-random-tuning/ I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m using a #TidyTuesday dataset from earlier this year on trees around San Francisco to show how to tune the hyperparameters of a random forest model and then use the final best model. Here is the code I used in the video, for those who prefer reading instead of or in addition to video. LASSO regression using tidymodels and #TidyTuesday data for The Office https://juliasilge.com/blog/lasso-the-office/ Tue, 17 Mar 2020 00:00:00 +0000 https://juliasilge.com/blog/lasso-the-office/ I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m using this week’s #TidyTuesday dataset on The Office to show how to build a lasso regression model and choose regularization parameters! Here is the code I used in the video, for those who prefer reading instead of or in addition to video. Preprocessing and resampling using #TidyTuesday college data https://juliasilge.com/blog/tuition-resampling/ Tue, 10 Mar 2020 00:00:00 +0000 https://juliasilge.com/blog/tuition-resampling/ I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first getting started to how to tune machine learning models. Today, I’m using this week’s #TidyTuesday dataset on college tuition and diversity at US colleges to show some data preprocessing steps and how to use resampling! Here is the code I used in the video, for those who prefer reading instead of or in addition to video. Hyperparameter tuning and #TidyTuesday food consumption https://juliasilge.com/blog/food-hyperparameter-tune/ Tue, 18 Feb 2020 00:00:00 +0000 https://juliasilge.com/blog/food-hyperparameter-tune/ Last week I published a screencast demonstrating how to use the tidymodels framework and specifically the recipes package. Today, I’m using this week’s #TidyTuesday dataset on food consumption around the world to show hyperparameter tuning! Here is the code I used in the video, for those who prefer reading instead of or in addition to video. Explore the data Our modeling goal here is to predict which countries are Asian countries and which countries are not, based on their patterns of food consumption in the eleven categories from the #TidyTuesday dataset. #TidyTuesday hotel bookings and recipes https://juliasilge.com/blog/hotels-recipes/ Tue, 11 Feb 2020 00:00:00 +0000 https://juliasilge.com/blog/hotels-recipes/ Last week I published my first screencast showing how to use the tidymodels framework for machine learning and modeling in R. Today, I’m using this week’s #TidyTuesday dataset on hotel bookings to show how to use one of the tidymodels packages recipes with some simple models! Here is the code I used in the video, for those who prefer reading instead of or in addition to video. #TidyTuesday and tidymodels https://juliasilge.com/blog/intro-tidymodels/ Wed, 05 Feb 2020 00:00:00 +0000 https://juliasilge.com/blog/intro-tidymodels/ This week I started my new job as a software engineer at RStudio, working with Max Kuhn and other folks on tidymodels. I am really excited about tidymodels because my own experience as a practicing data scientist has shown me some of the areas for growth that still exist in open source software when it comes to modeling and machine learning. Almost nothing has had the kind of dramatic impact on my productivity that the tidyverse and other RStudio investments have had; I am enthusiastic about contributing to that kind of user-focused transformation for modeling and machine learning. Modeling salary and gender in the tech industry https://juliasilge.com/blog/salary-gender/ Tue, 31 Dec 2019 00:00:00 +0000 https://juliasilge.com/blog/salary-gender/ One of the biggest projects I have worked on over the past several years is the Stack Overflow Developer Survey, and one of the most unique aspects of this survey is the extensive salary data that is collected. This salary data is used to power the Stack Overflow Salary Calculator, and has been used by various folks to explore how people who use spaces make more than those who use tabs, whether that’s just a proxy for open source contributions, and more. Opioid prescribing habits in Texas https://juliasilge.com/blog/texas-opioids/ Sat, 12 Oct 2019 00:00:00 +0000 https://juliasilge.com/blog/texas-opioids/ A paper I worked on was just published in a medical journal. This is quite an odd thing for me to be able to say, given my academic background and the career path I have had, but there you go! The first author of this paper is a long-time friend of mine working in anesthesiology and pain management, and he obtained data from the Texas Prescription Drug Monitoring Program (PDMP) about controlled substance prescriptions from April 2015 to 2018. (Re)Launching my supervised machine learning course https://juliasilge.com/blog/supervised-ml-course/ Mon, 23 Sep 2019 00:00:00 +0000 https://juliasilge.com/blog/supervised-ml-course/ Today I am happy to announce a new(-ish), free, online, interactive course that I have developed, Supervised Machine Learning: Case Studies in R! 💫 Supervised machine learning in R Predictive modeling, or supervised machine learning, is a powerful tool for using data to make predictions about the world around us. Once you understand the basic ideas of supervised machine learning, the next step is to practice your skills so you know how to apply these techniques wisely and appropriately. Practice using lubridate... THEATRICALLY https://juliasilge.com/blog/lubridate-london-stage/ Mon, 26 Aug 2019 00:00:00 +0000 https://juliasilge.com/blog/lubridate-london-stage/ I am so pleased to now be an RStudio-certified tidyverse trainer! 🎉 I have been teaching technical content for decades, whether in a university classroom, developing online courses, or leading workshops, but I still found this program valuable for my own professonal development. I learned a lot that is going to make my teaching better, and I am happy to have been a participant. If you are looking for someone to lead trainings or workshops in your organization, you can check out this list of trainers to see who might be conveniently located to you! Introducing tidylo https://juliasilge.com/blog/introducing-tidylo/ Mon, 08 Jul 2019 00:00:00 +0000 https://juliasilge.com/blog/introducing-tidylo/ Today I am so pleased to introduce a new package for calculating weighted log odds ratios, tidylo. Often in data analysis, we want to measure how the usage or frequency of some feature, such as words, differs across some group or set, such as documents. One statistic often used to find these kinds of differences in text data is tf-idf. Another option is to use the log odds ratio, but the log odds ratio alone does not account for sampling variability. Reordering and facetting for ggplot2 https://juliasilge.com/blog/reorder-within/ Mon, 01 Jul 2019 00:00:00 +0000 https://juliasilge.com/blog/reorder-within/ I recently wrote about the release of tidytext 0.2.1, and one of the most useful new features in this release is a couple of helper functions for making plots with ggplot2. These helper functions address a class of challenges that often arises when dealing with text data, so we’ve included them in the tidytext package. Let’s work through an example To show how to use these new functions, let’s walk through a more general example that does not deal with results that come from unstructured, free text. Fixing your mistakes: sentiment analysis edition https://juliasilge.com/blog/sentiment-lexicons/ Fri, 14 Jun 2019 00:00:00 +0000 https://juliasilge.com/blog/sentiment-lexicons/ Today tidytext 0.2.1 is available on CRAN! This new release of tidytext has a collection of nice new features. Bug squashing! 🐛 Improvements to error messages and documentation 📃 Switching from broom to generics for lighter dependencies Addition of some helper plotting functions I look forward to blogging about soon An additional change is significant and may be felt by you, the user, so I want to share a bit about it. Relaunching the qualtRics package https://juliasilge.com/blog/qualtrics-relaunch/ Tue, 30 Apr 2019 00:00:00 +0000 https://juliasilge.com/blog/qualtrics-relaunch/ Note: cross-posted with the rOpenSci blog. rOpenSci is one of the first organizations in the R community I ever interacted with, when I participated in the 2016 rOpenSci unconf. I have since reviewed several rOpenSci packages and been so happy to be connected to this community, but I have never submitted or maintained a package myself. All that changed when I heard the call for a new maintainer for the qualtRics package. Writing a letter to DataCamp https://juliasilge.com/blog/datacamp-misconduct/ Tue, 16 Apr 2019 00:00:00 +0000 https://juliasilge.com/blog/datacamp-misconduct/ Since 2017 I have been an instructor for DataCamp, the VC-backed online data science education platform. What this means is that I am not an employee, but I have developed content for the company as a contractor. I have two courses there, one on text mining and one on practical supervised machine learning. About two weeks ago, DataCamp published a blog post outlining an incident of sexual misconduct at the company. Read all about it! Navigating the R Package Universe https://juliasilge.com/blog/r-journal-navigating/ Sun, 24 Feb 2019 00:00:00 +0000 https://juliasilge.com/blog/r-journal-navigating/ In the most recent issue of the R Journal, I have a new paper out with coauthors John Nash and Spencer Graves. Check out the abstract: Today, the enormous number of contributed packages available to R users outstrips any given user’s ability to understand how these packages work, their relative merits, or how they are related to each other. We organized a plenary session at useR!2017 in Brussels for the R community to think through these issues and ways forward. Feeling the rstudio::conf ❤️ https://juliasilge.com/blog/rstudio-conf-2019/ Sun, 20 Jan 2019 00:00:00 +0000 https://juliasilge.com/blog/rstudio-conf-2019/ I am heading home from my third year of attending rstudio::conf! If you weren’t there, watch for the videos to be released so you can check out the talks; I know I will do the same so I can see the talks I was forced to miss by scheduling constraints. I love this conference, and once again this year, the organizers have succeeded in building an impactful, valuable, inclusive conference. Text classification with tidy data principles https://juliasilge.com/blog/tidy-text-classification/ Mon, 24 Dec 2018 00:00:00 +0000 https://juliasilge.com/blog/tidy-text-classification/ I am an enthusiastic proponent of using tidy data principles for dealing with text data. This kind of approach offers a fluent and flexible option not just for exploratory data analysis, but also for machine learning for text, including both unsupervised machine learning and supervised machine learning. I haven’t written much about supervised machine learning for text, i.e. predictive modeling, using tidy data principles, so let’s walk through an example workflow for this a text classification task. Word associations from the Small World of Words https://juliasilge.com/blog/word-associations/ Sun, 16 Dec 2018 00:00:00 +0000 https://juliasilge.com/blog/word-associations/ Do you subscribe to the Data is Plural newsletter from Jeremy Singer-Vine? You probably should, because it is a treasure trove of interesting datasets arriving in your email inbox. In the November 28 edition, Jeremy linked to the Small World of Words project, and I was entranced. I love stuff like that, all about words and how people think of them. I have been mulling around a blog post ever since, and today I finally have my post done, so let’s see what’s up! TensorFlow, Jane Austen, and Text Generation https://juliasilge.com/blog/tensorflow-generation/ Thu, 04 Oct 2018 00:00:00 +0000 https://juliasilge.com/blog/tensorflow-generation/ I remember the first time I saw a deep learning text generation project that was truly compelling and delightful to me. It was in 2016 when Andy Herd generated new Friends scenes by training a recurrent neural network on all the show’s episodes. Herd’s work went pretty viral at the time and I thought: via GIPHY And also: via GIPHY At the time I dabbled a bit with Andrej Karpathy’s tutorials for character-level RNNs; his work and tutorials undergird a lot of the kind of STUNT TEXT GENERATION work we see in the world. Training, evaluating, and interpreting topic models https://juliasilge.com/blog/evaluating-stm/ Sat, 08 Sep 2018 00:00:00 +0000 https://juliasilge.com/blog/evaluating-stm/ At the beginning of this year, I wrote a blog post about how to get started with the stm and tidytext packages for topic modeling. I have been doing more topic modeling in various projects, so I wanted to share some workflows I have found useful for training many topic models at one time, evaluating topic models and understanding model diagnostics, and exploring and interpreting the content of topic models. Amazon Alexa and Accented English https://juliasilge.com/blog/amazon-alexa/ Thu, 19 Jul 2018 00:00:00 +0000 https://juliasilge.com/blog/amazon-alexa/ Earlier this spring, one of my data science friends here in SLC got in contact with me about some fun analysis. My friend Dylan Zwick is a founder at Pulse Labs, a voice-testing startup, and they were chatting with the Washington Post about a piece on how devices like Amazon Alexa deal with accented English. The piece is published today in the Washington Post and turned out really interesting! Let’s walk through the analysis I did for Dylan and Pulse Labs. Punctuation in literature https://juliasilge.com/blog/punctution-literature/ Sat, 30 Jun 2018 00:00:00 +0000 https://juliasilge.com/blog/punctution-literature/ This morning I was scrolling through Twitter and noticed Alberto Cairo share this lovely data visualization piece by Adam J. Calhoun about the varying prevalence of punctuation in literature. I thought, “I want to do that!” It also offers me the opportunity to chat about a few of the new options available for tokenizing in tidytext via updates to the tokenizers package. Adam’s original piece explores how punctuation is used in nine novels, including my favorite Pride and Prejudice. Public Data Release of Stack Overflow’s 2018 Developer Survey https://juliasilge.com/blog/stack-survey-2018/ Wed, 30 May 2018 00:00:00 +0000 https://juliasilge.com/blog/stack-survey-2018/ Note: Cross-posted with the Stack Overflow blog. Starting today, you can access the public data release for Stack Overflow’s 2018 Developer Survey. Over 100,000 developers from around the world shared their opinions about everything from their favorite technologies to job preferences, and this data is now available for you to analyze yourself. This year, we are partnering with Kaggle to publish and highlight this dataset. This means you can access the data both here on our site and on Kaggle Datasets, and that on Kaggle, you can explore the dataset using Kernels. Understanding PCA using Stack Overflow data https://juliasilge.com/blog/stack-overflow-pca/ Fri, 18 May 2018 00:00:00 +0000 https://juliasilge.com/blog/stack-overflow-pca/ This year, I have given some talks about understanding principal component analysis using what I spend day in and day out with, Stack Overflow data. You can see a recording of one of these talks from rstudio::conf 2018. When I have given these talks, I’ve focused a lot on understanding PCA. This blog post walks through how I implemented PCA and how I made the plots I used in my talk. Stack Overflow questions around the world https://juliasilge.com/blog/stack-questions-cities/ Wed, 11 Apr 2018 00:00:00 +0000 https://juliasilge.com/blog/stack-questions-cities/ I am so lucky to work with so many generous, knowledgeable, and amazing people at Stack Overflow, including Ian Allen and Kirti Thorat. Both Ian and Kirti are part of biweekly sessions we have at Stack Overflow where several software developers join me in practicing R, data science, and modeling skills. This morning, the two of them went to a high school outreach event in NYC for students who have been studying computer science, equipped with Stack Overflow ✨ SWAG ✨, some coding activities based on Stack Overflow internal tools and packages, and a Shiny app that I developed to share a bit about who we are and what we do. The game is afoot! Topic modeling of Sherlock Holmes stories https://juliasilge.com/blog/sherlock-holmes-stm/ Thu, 25 Jan 2018 00:00:00 +0000 https://juliasilge.com/blog/sherlock-holmes-stm/ In a recent release of tidytext, we added tidiers and support for building Structural Topic Models from the stm package. This is my current favorite implementation of topic modeling in R, so let’s walk through an example of how to get started with this kind of modeling, using The Adventures of Sherlock Holmes. via GIPHY You can watch along as I demonstrate how to start with the raw text of these short stories, prepare the data, and then implement topic modeling in this video tutorial! tidytext 0.1.6 https://juliasilge.com/blog/tidytext-0-1-6/ Wed, 10 Jan 2018 00:00:00 +0000 https://juliasilge.com/blog/tidytext-0-1-6/ I am pleased to announce that tidytext 0.1.6 is now on CRAN! Most of this release, as well as the 0.1.5 release which I did not blog about, was for maintenance, updates to align with API changes from tidytext’s dependencies, and bugs. I just spent a good chunk of effort getting tidytext to pass R CMD check on older versions of R despite the fact that some of the packages in tidytext’s Suggests require recent versions of R. One year as a data scientist at Stack Overflow https://juliasilge.com/blog/one-year-at-stack/ Wed, 27 Dec 2017 00:00:00 +0000 https://juliasilge.com/blog/one-year-at-stack/ I recently passed my one-year anniversary of working at Stack Overflow as a data scientist. I have some very exciting news! I am joining the data team at @StackOverflow. ✨📊✨📊✨ — Julia Silge (@juliasilge) December 13, 2016 Coming to Stack Overflow has been an adventure for me. This is my first time to work at an actual tech company. I have been what I like to think of as “tech adjacent” my whole career, writing code and working on technical questions but never before working at a straight-up web company. Tidy word vectors, take 2! https://juliasilge.com/blog/word-vectors-take-two/ Mon, 27 Nov 2017 00:00:00 +0000 https://juliasilge.com/blog/word-vectors-take-two/ A few weeks ago, I wrote a post about finding word vectors using tidy data principles, based on an approach outlined by Chris Moody on the StitchFix tech blog. I’ve been pondering how to improve this approach, and whether it would be nice to wrap up some of these functions in a package, so here is an update! Like in my previous post, let’s download half a million posts from the Hacker News corpus using the bigrquery package. New sports from random emoji https://juliasilge.com/blog/emoji-sports/ Sat, 25 Nov 2017 00:00:00 +0000 https://juliasilge.com/blog/emoji-sports/ I love emoji ❤️ and I love xkcd, so this recent comic from Randall Munroe was quite a delight for me. I sat there, enjoying the thought of these new sports like horse hole and multiplayer avocado and I thought, “I can make more of these in just the barest handful of lines of code”. This is largely thanks to the emo package by Hadley Wickham, which if you haven’t installed and started using yet, WHY NOT? Word Vectors with tidy data principles https://juliasilge.com/blog/tidy-word-vectors/ Mon, 30 Oct 2017 00:00:00 +0000 https://juliasilge.com/blog/tidy-word-vectors/ Last week I saw Chris Moody’s post on the Stitch Fix blog about calculating word vectors from a corpus of text using word counts and matrix factorization, and I was so excited! This blog post illustrates how to implement that approach to find word vector representations in R using tidy data principles and sparse matrices. Word vectors, or word embeddings, are typically calculated using neural networks; that is what word2vec is. From Power Calculations to P-Values: A/B Testing at Stack Overflow https://juliasilge.com/blog/ab-testing/ Tue, 17 Oct 2017 00:00:00 +0000 https://juliasilge.com/blog/ab-testing/ Note: cross-posted with the Stack Overflow blog. If you hang out on Meta Stack Overflow, you may have noticed news from time to time about A/B tests of various features here at Stack Overflow. We use A/B testing to compare a new version to a baseline for a design, a machine learning model, or practically any feature of what we do here at Stack Overflow; these tests are part of our decision-making process. Mapping ecosystems of software development https://juliasilge.com/blog/tag-network/ Tue, 03 Oct 2017 00:00:00 +0000 https://juliasilge.com/blog/tag-network/ I have a new post on the Stack Overflow blog today about the complex, interrelated ecosystems of software development. On the data team at Stack Overflow, we spend a lot of time and energy thinking about tech ecosystems and how technologies are related to each other. One way to get at this idea of relationships between technologies is tag correlations, how often technology tags at Stack Overflow appear together relative to how often they appear separately. tidytext 0.1.4 https://juliasilge.com/blog/tidytext-0-1-4/ Sat, 30 Sep 2017 00:00:00 +0000 https://juliasilge.com/blog/tidytext-0-1-4/ I am pleased to announce that tidytext 0.1.4 is now on CRAN! This release of our package for text mining using tidy data principles has an excellent collection of delightfulness in it. First off, all the important functions in tidytext now support support non-standard evaluation through the tidyeval framework. library(janeaustenr) library(tidytext) library(dplyr) input_var <- quo(text) output_var <- quo(word) data_frame(text = prideprejudice) %>% unnest_tokens(!! output_var, !! input_var) ## # A tibble: 122,204 x 1 ## word ## <chr> ## 1 pride ## 2 and ## 3 prejudice ## 4 by ## 5 jane ## 6 austen ## 7 chapter ## 8 1 ## 9 it ## 10 is ## # . Sentiment analysis using tidy data principles at DataCamp https://juliasilge.com/blog/sentiment-datacamp/ Thu, 24 Aug 2017 00:00:00 +0000 https://juliasilge.com/blog/sentiment-datacamp/ NOTE: Read more here about why I no longer recommend taking my courses at DataCamp. I’ve been developing a course at DataCamp over the past several months, and I am happy to announce that it is now launched! The course is Sentiment Analysis in R: the Tidy Way and I am excited that it is now available for you to explore and learn from. This course focuses on digging into the emotional and opinion content of text using sentiment analysis, and it does this from the specific perspective of using tools built for handling tidy data. Understanding gender roles in movies with text mining https://juliasilge.com/blog/women-in-film/ Tue, 22 Aug 2017 00:00:00 +0000 https://juliasilge.com/blog/women-in-film/ I have a new visual essay up at The Pudding today, using text mining to explore how women are portrayed in film. The R code behind this analysis in publicly available on GitHub. I was so glad to work with the talented Russell Goldenberg and Amber Thomas on this project, and many thanks to Matt Daniels for inviting me to contribute to The Pudding. I’ve been a big fan of their work for a long time! Seeking guidance in choosing and evaluating R packages https://juliasilge.com/blog/package-guidance/ Tue, 08 Aug 2017 00:00:00 +0000 https://juliasilge.com/blog/package-guidance/ At useR!2017 in Brussels last month, I contributed to an organized session focused on navigating the 11,000+ packages on CRAN. My collaborators on this session and I recently put together an overall summary of the session and our goals, and now I’d like to talk more about the specific issue of learning about R packages and deciding which ones to use. John and Spencer will write more soon about the two other issues of our focus: Navigating the R Package Universe https://juliasilge.com/blog/navigating-packages/ Wed, 26 Jul 2017 00:00:00 +0000 https://juliasilge.com/blog/navigating-packages/ Earlier this month, I, along with John Nash, Spencer Graves, and Ludovic Vannoorenberghe, organized a session at useR!2017 focused on discovering, learning about, and evaluating R packages. You can check out the recording of the session. There are more than 11,000 packages on CRAN, and R users must approach this abundance of packages with effective strategies to find what they need and choose which packages to invest time in learning how to use. Text Mining of Stack Overflow Questions https://juliasilge.com/blog/text-mining-stack-overflow/ Thu, 06 Jul 2017 00:00:00 +0000 https://juliasilge.com/blog/text-mining-stack-overflow/ Note: Cross-posted with the Stack Overflow blog. This week, my fellow Stack Overflow data scientist David Robinson and I are happy to announce the publication of our book Text Mining with R with O’Reilly. We are so excited to see this project out in the world, and so relieved to finally be finished with it! Text data is being generated all the time around us, in healthcare, finance, tech, and beyond; text mining allows us to transform that unstructured text data into real insight that can increase understanding and inform decision-making. Using tidycensus and leaflet to map Census data https://juliasilge.com/blog/using-tidycensus/ Sat, 24 Jun 2017 00:00:00 +0000 https://juliasilge.com/blog/using-tidycensus/ Recently, I have been following the development and release of Kyle Walker’s tidycensus package. I have been filled with amazement, delight, and well, perhaps another feeling… There should be a word for “the regret felt when an R 📦, which would have saved untold hours of your life, is released”… #rstats 🤔 https://t.co/2THN4MwedO — Mara Averick (@dataandme) May 31, 2017 But seriously, I have worked with US Census data a lot in the past and this package tidytext 0.1.3 https://juliasilge.com/blog/tidytext-0-1-3/ Sun, 18 Jun 2017 00:00:00 +0000 https://juliasilge.com/blog/tidytext-0-1-3/ I am pleased to announce that tidytext 0.1.3 is now on CRAN! In this release, my collaborator David Robinson and I have fixed a handful of bugs, added tidiers for LDA models from the mallet package, and updated functions for changes to quanteda’s API. You can check out the NEWS for more details on changes. One enhancement in this release is the addition of the Loughran and McDonald sentiment lexicon of words specific to financial reporting. Mining CRAN DESCRIPTION Files https://juliasilge.com/blog/mining-cran-description/ Thu, 04 May 2017 00:00:00 +0000 https://juliasilge.com/blog/mining-cran-description/ A couple of weeks ago, I saw on Dirk Eddelbuettel’s blog that R 3.4.0 was going to include a function for obtaining information about packages currently on CRAN, including basically everything in DESCRIPTION files. When R 3.4.0 was released, this was one of the things I was most immediately excited about exploring, because although I recently dabbled in scraping CRAN to try to get this kind of information, it was rather onerous. Gender Roles with Text Mining and N-grams https://juliasilge.com/blog/gender-pronouns/ Sat, 15 Apr 2017 00:00:00 +0000 https://juliasilge.com/blog/gender-pronouns/ Today is the one year anniversary of the janeaustenr package’s appearance on CRAN, its cranniversary, if you will. I think it’s time for more Jane Austen here on my blog. via GIPHY I saw this paper by Matthew Jockers and Gabi Kirilloff a number of months ago and the ideas in it have been knocking around in my head ever since. The authors of that paper used text mining to examine a corpus of 19th century novels and explore how gendered pronouns (he/she/him/her) are associated with different verbs. How Do You Discover R Packages? https://juliasilge.com/blog/package-search/ Mon, 20 Mar 2017 00:00:00 +0000 https://juliasilge.com/blog/package-search/ Like I mentioned in my last blog post, I am contributing to a session at userR 2017 this coming July that will focus on discovering and learning about R packages. This is an increasingly important issue for R users as we all decide which of the 10,000+ packages to invest time in understanding and then use in our work. library(dplyr) available.packages() %>% tbl_df() ## # A tibble: 10,276 × 17 ## Package Version Priority Depends ## <chr> <chr> <chr> <chr> ## 1 A3 1. Scraping CRAN with rvest https://juliasilge.com/blog/scraping-cran/ Mon, 06 Mar 2017 00:00:00 +0000 https://juliasilge.com/blog/scraping-cran/ I am one of the organizers for a session at userR 2017 this coming July that will focus on discovering and learning about R packages. How do R users find packages that meet their needs? Can we make this process easier? As somebody who is relatively new to the R world compared to many, this is a topic that resonates with me and I am happy to be part of the discussion. What Programming Languages Are Used Most on Weekends? https://juliasilge.com/blog/weekends-weekdays/ Tue, 07 Feb 2017 00:00:00 +0000 https://juliasilge.com/blog/weekends-weekdays/ Note: Cross-posted with the Stack Overflow blog. Check out the code for this analysis on Kaggle. For me, the weekends are mostly about spending time with my family, reading for leisure, and working on the open-source projects I am involved in. These weekend projects overlap with the work that I do in my day job here at Stack Overflow, but are not exactly the same. Many developers tinker with side projects for learning or career development (or just for fun! Women in the 2016 Stack Overflow Survey https://juliasilge.com/blog/women-survey/ Thu, 19 Jan 2017 00:00:00 +0000 https://juliasilge.com/blog/women-survey/ Note: Cross-posted with the Stack Overflow blog The 2017 Stack Overflow Developer Survey opened last week, and we on the Data Team are looking forward to analyzing the survey results to better understand our developer community. I am particularly interested in women in tech, for probably obvious reasons, and recently I explored last year’s survey data to see what we can learn about women developers. How many women took the developer survey? Text Mining in R: A Tidy Approach https://juliasilge.com/blog/rstudio-conf/ Sat, 14 Jan 2017 00:00:00 +0000 https://juliasilge.com/blog/rstudio-conf/ I spoke on approaching text mining tasks using tidy data principles at rstudio::conf yesterday. I was so happy to have the opportunity to speak and the conference has been a great experience. If you want to catch up on what has been going on at rstudio::conf, Karl Broman put together a GitHub repo of slides and Sharon Machlis has been live-blogging the conference at Computerworld. A highlight for me was Andrew Flowers' talk on data journalism and storytelling; I don’t work in data journalism but I think I can apply almost everything he said to how I approach what I do. Reddit Responds to the Election https://juliasilge.com/blog/reddit-responds/ Tue, 06 Dec 2016 00:00:00 +0000 https://juliasilge.com/blog/reddit-responds/ It’s been about a month since the U.S. presidential election, with Donald Trump’s victory over Hillary Clinton coming as a surprise to most. Reddit user Jason Baumgartner collected and published every submission and comment posted to Reddit on the day of (and a bit surrounding) the U.S. election; let’s explore this data set and see what kinds of things we can learn. Data wrangling This first bit was the hardest part of this analysis for me, probably because I am not the most experienced JSON person out there. Measuring Gobbledygook https://juliasilge.com/blog/gobbledygook/ Fri, 25 Nov 2016 00:00:00 +0000 https://juliasilge.com/blog/gobbledygook/ In learning more about text mining over the past several months, one aspect of text that I’ve been interested in is readability. A text’s readability measures how hard or easy it is for a reader to read and understand what a text is saying; it depends on how sentences are written, what words are chosen, and so forth. I first became really aware of readability scores of books through my kids’ reading tracking websites for school, but it turns out there are lots of frameworks for measuring readability. Mapping Election Results in Utah https://juliasilge.com/blog/election-mapping/ Fri, 11 Nov 2016 00:00:00 +0000 https://juliasilge.com/blog/election-mapping/ My adopted home state of Utah has been a weird place this election cycle. For the unfamiliar, Utah is extremely conservative when it comes to politics; it is one of the reddest of the red states and has backed the Republican candidate for president for the past many decades. In 2012, about 3/4 of the popular vote went to Mitt Romney (who is LDS, like many here in the state) and there were no counties where Mitt Romney did not win. Tidy Text Mining with R https://juliasilge.com/blog/2016-10-28-tidy-text-mining/ Fri, 28 Oct 2016 00:00:00 +0000 https://juliasilge.com/blog/2016-10-28-tidy-text-mining/ I am so pleased to announce that tidytext 0.1.2 is now available on CRAN. This release of tidytext, a package for text mining using tidy data principles by Dave Robinson and me, includes some bug fixes and performance improvements, as well as some new functionality. There is now a handy function for accessing the various lexicons in the sentiments dataset without the columns that are not used in that particular dataset; this makes these datasets even easier to use with pipes and joins from dplyr. Non-Academic Careers for Astronomers and Physicists https://juliasilge.com/blog/non-academic-careers/ Fri, 30 Sep 2016 00:00:00 +0000 https://juliasilge.com/blog/non-academic-careers/ Today I’m giving a talk for the Department of Physics and Astronomy at the University of Utah about careers outside academia for astronomers and physicists. Check out my slides here, and links/references from my talk below. STEM PhD finishers in the UK (from the Royal Society) Science & Engineering PhDs awarded and faculty positions (from Nature) Declining proposal success rates in astronomy (from the National Science Foundation) Pamela Gay on soft money positions Katie Mack’s Twitter Josh Wills’ Twitter Drew Conway Oliver Keyes on data science David Robinson on building public artifacts Singing the Bayesian Beginner Blues https://juliasilge.com/blog/bayesian-blues/ Wed, 28 Sep 2016 00:00:00 +0000 https://juliasilge.com/blog/bayesian-blues/ Earlier this week, I published a post about song lyrics and how different U.S. states are mentioned at different rates, and at different rates relative to their populations. That was a very fun post to work on, but you can tell from that paragraph near the end that I am a little bothered by the uncertainty involved in calculating the rates by just dividing two numbers. David Robinson suggested on Twitter that I might try using empirical Bayes methods to estimate the rates. Song Lyrics Across the United States https://juliasilge.com/blog/song-lyrics-across/ Mon, 26 Sep 2016 00:00:00 +0000 https://juliasilge.com/blog/song-lyrics-across/ The inspiration for this post is a joint venture by both me and my husband, and its genesis lies more than 15 years in our past. One of the recurring conversations we have in our relationship (all long-term relationships have these, right?!) is about song lyrics and place names. I think the first time we ever had this conversation was in the late 1990s and was about Baltimore. “Why do so many songs talk about Baltimore? We Are Not Very Evenly Distributed https://juliasilge.com/blog/evenly-distributed/ Fri, 19 Aug 2016 00:00:00 +0000 https://juliasilge.com/blog/evenly-distributed/ I saw this tweet making the rounds this past week. Half of all Americans live in the red counties, half live in the orange counties pic.twitter.com/ptBXNbzSFQ — Conrad Hackett (@conradhackett) August 8, 2016 Interesting! I saw people using this map to make the argument that the Electoral College was super important, or a terrible idea, or any of a number of other sociopolitical thoughts. This map certainly caught my attention and made me want to know more about this kind of population density distribution. Something Strange in the Neighborhood https://juliasilge.com/blog/something-strange/ Fri, 05 Aug 2016 00:00:00 +0000 https://juliasilge.com/blog/something-strange/ Today I was so pleased to see a new data package hit CRAN, and how wonderful to see such accomplished women writing R packages. What a great new data package on CRAN! And always great to see more women authors in #rstats https://t.co/nROMibqPxX pic.twitter.com/UEayWgx9bz — Julia Silge (@juliasilge) August 5, 2016 The ghostr package includes a dataset of over 800 ghost sightings in Kentucky, with information on city, latitude, and longitude, along with URLs for finding more information about the ghost sightings. Return of the NEISS Data https://juliasilge.com/blog/return-of-neiss/ Fri, 22 Jul 2016 00:00:00 +0000 https://juliasilge.com/blog/return-of-neiss/ Almost six months ago (!) I wrote a blog post about the NEISS data set, a sample of accidents reported to emergency rooms in the U.S. that are related to consumer products. Ever since I did that exploration, I have been wanting to ask a bit of a different question from that sample of accidents. How do the accidents that people suffer depend on their demographic characteristics? We can get a bit of a sense of that from looking at the plot with age on the x-axis (or exploring Hadley Wickham’s NEISS Shiny app) but the NEISS data set includes quite a bit more demographic information to interact with. Fatal Police Shootings Across the U.S. https://juliasilge.com/blog/fatal-shootings/ Thu, 07 Jul 2016 00:00:00 +0000 https://juliasilge.com/blog/fatal-shootings/ I have been full of grief and sadness and some anger in the wake of yet more videos going viral in the past couple days showing black men being killed by police officers. I am not an expert on what it means to be a person of color in the United States or what is or isn’t wrong with policing today here, but it sure feels like something is deeply broken. Term Frequency and tf-idf Using Tidy Data Principles https://juliasilge.com/blog/term-frequency-tf-idf/ Mon, 27 Jun 2016 00:00:00 +0000 https://juliasilge.com/blog/term-frequency-tf-idf/ At the end of last week, Dave Robinson and I released a new version of tidytext on CRAN, our R package for text mining using tidy data principles. You can check out my first blog post about tidytext to learn a bit about the philosophy of the package and see some of the ways to use it, or see the package on GitHub. In this new release (tidytext 0.1.1), we have added more documentation, fixed some bugs, developed better testing/CI, and added some new functionality. A Beginner's Guide to Travis-CI for R https://juliasilge.com/blog/beginners-guide-to-travis/ Fri, 20 May 2016 00:00:00 +0000 https://juliasilge.com/blog/beginners-guide-to-travis/ Have you seen all those attractive green badges on other people’s R packages and thought, “I want a lovely green badge!” OF COURSE YOU DO. Well, let’s give it a shot, because today I am going to attempt a beginner’s guide to using Travis-CI for continuous integration for R packages. It is going to be a beginner’s guide because that is all I could possibly write; my knowledge and experience with Travis is limited. The Life-Changing Magic of Tidying Text https://juliasilge.com/blog/life-changing-magic/ Fri, 29 Apr 2016 00:00:00 +0000 https://juliasilge.com/blog/life-changing-magic/ When I went to the rOpenSci unconference about a month ago, I started work with Dave Robinson on a package for text mining using tidy data principles. What is this tidy data you keep hearing so much about? As described by Hadley Wickham, tidy data has a specific structure: each variable is a column each observation is a row each type of observational unit is a table This means we end up with a data set that is in a long, skinny format instead of a wide format. How I Learned to Stop Worrying and Love R CMD Check https://juliasilge.com/blog/how-i-stopped/ Mon, 18 Apr 2016 00:00:00 +0000 https://juliasilge.com/blog/how-i-stopped/ Last week, I officially became the maintainer of a CRAN package! My package for the texts of Jane Austen’s 6 completed, published novels, janeaustenr, was released on CRAN and my Twitter feed was filled with congratulatory Jane Austen GIFs. I think this might be my favorite. .@juliasilge *clears schedule* *opens @rstudio * pic.twitter.com/Hu7V2E0ULJ — Andrew MacDonald 🌈 (@polesasunder) April 15, 2016 It was a good day. During the process of getting janeaustenr ready to submit to CRAN, I was pointed to some resources that were very helpful to me as a first-time maintainer. Who Came to Vote in Utah's Caucuses? https://juliasilge.com/blog/who-came-to-vote/ Fri, 08 Apr 2016 00:00:00 +0000 https://juliasilge.com/blog/who-came-to-vote/ Late last month, I analyzed results from Utah’s Republican and Democratic caucuses to show how the different presidential candidates fared across Utah. That was fun work to do, but I realized there was one more map I wanted to make; I want to compare the Republican and Democratic voter turnout across the counties in Utah. Utah is a politically conservative state and we know from the last plot I made in that post that many more people voted in the Republican caucus than the Democratic caucus, but I would like to see how voter turnout was distributed across the state. I Went to ROpenSci Unconference and All I Got Were These Lousy Hex Stickers https://juliasilge.com/blog/i-went-to-ropensci/ Wed, 06 Apr 2016 00:00:00 +0000 https://juliasilge.com/blog/i-went-to-ropensci/ Just kidding; it was amazing. Last week, I traveled to San Francisco to participate in an unconference/hackathon organized and hosted by ROpenSci. This was my first R conference or meeting, and it was a such a great experience. I am still feeling a bit at a loss for words about what a tremendous time I had, actually, but I will make an attempt to share a bit about what it was like and what we did. Trump Losing and Feeling the Bern in Utah https://juliasilge.com/blog/mapping-utah-caucus/ Fri, 25 Mar 2016 00:00:00 +0000 https://juliasilge.com/blog/mapping-utah-caucus/ Well, it’s been an interesting election season so far, right? Everybody holding up OK? Utah held its caucuses this past Tuesday on March 22 and I thought I would do a bit of plotting to show the results. We can get the JSON data from CNN, as pointed out by Bob Rudis in his post here. Utah’s results were not available when he wrote that post but I was able to poke around and find them using the guidance he provided there. If I Loved Natural Language Processing Less, I Might Be Able to Talk About It More https://juliasilge.com/blog/if-i-loved-nlp-less/ Fri, 18 Mar 2016 00:00:00 +0000 https://juliasilge.com/blog/if-i-loved-nlp-less/ In my last post, I did some natural language processing and sentiment analysis for Jane Austen’s most well-known novel, Pride and Prejudice. It was just so much fun that I wanted to extend some of that work and compare across her body of writing. I decided to make an R package for her texts, for easy access for myself and anybody else who would like to do some text analysis on a nice sample of prose. You Must Allow Me To Tell You How Ardently I Admire and Love Natural Language Processing https://juliasilge.com/blog/you-must-allow-me/ Tue, 08 Mar 2016 00:00:00 +0000 https://juliasilge.com/blog/you-must-allow-me/ It is a truth universally acknowledged that sentiment analysis is super fun, and Pride and Prejudice is probably my very favorite book in all of literature, so let’s do some Jane Austen natural language processing. Project Gutenberg makes e-texts available for many, many books, including Pride and Prejudice which is available here. I am using the plain text UTF-8 file available at that link for this analysis. Let’s read the file and get it ready for analysis. My Baby Boomer Name Might Have Been "Debbie" https://juliasilge.com/blog/my-baby-boomer-name/ Mon, 29 Feb 2016 00:00:00 +0000 https://juliasilge.com/blog/my-baby-boomer-name/ I have always loved learning and thinking about names, how they are chosen and used, and how people feel about their names and the names around them. We had a traditional baby name book at our house when I was growing up (you know, lists of names with meanings), and I remember poring over it to find unusual or appealing names for my pretend play or the stories I wrote. As an adult, I read Laura Wattenberg’s excellent book on baby names when we were expecting our second baby, and I also discovered the NameVoyager on Wattenberg’s website. Your Floor Is the Most Dangerous Thing In Your House https://juliasilge.com/blog/your-floor/ Wed, 17 Feb 2016 00:00:00 +0000 https://juliasilge.com/blog/your-floor/ I saw this analysis at Flowing Data about the most common consumer products involved in hospital ER visits and was delighted, interested, etc. Nathan’s next related post is, um, also super interesting, if entirely horrifying. Apparently, I am not the only one who thought this data set was compelling, because this week Hadley Wickham took the NEISS data set that these beautiful analyses are based on and made an R package for them. A Tall Drink of Water https://juliasilge.com/blog/tall-drink-of-water/ Thu, 11 Feb 2016 00:00:00 +0000 https://juliasilge.com/blog/tall-drink-of-water/ In a previous post, I used water consumption data from Utah’s Open Data Catalog to explore what kind of users consume the most water in my home here in Salt Lake City, what the annual pattern of water use is, and how the drought of the past few years has affected water use. I made a predictive model for the total aggregate water use of the city and tested how drought affected the accuracy of such a model. Death Comes to Us All https://juliasilge.com/blog/death-comes/ Fri, 05 Feb 2016 00:00:00 +0000 https://juliasilge.com/blog/death-comes/ I have been working with a data set on causes of death in my adopted home state of Utah for a little while now, and I had been struggling with the best way to visualize it. This week, David Robinson released the gganimate package to create animated ggplot2 plots and I thought “AH HA! This is what I have needing.” The data on causes of death in Utah is available here via Utah’s Open Data Catalog and can be accessed via Socrata Open Data API. Connecting Religion and Demographics https://juliasilge.com/blog/connecting-religion/ Mon, 01 Feb 2016 00:00:00 +0000 https://juliasilge.com/blog/connecting-religion/ I have my second guest post up today at Ari Lamstein’s blog where I conclude my exploration of the Religious Congregations and Membership Study at the ARDA. In this post I show how we can look at the relationships between a data set like the religion census and demographic data to gain context and understanding. Go over there to read the details! More Fun with Choropleth Maps https://juliasilge.com/blog/more-fun-with-maps/ Mon, 25 Jan 2016 00:00:00 +0000 https://juliasilge.com/blog/more-fun-with-maps/ I have a guest post up today at Ari Lamstein’s blog where I show some more fun things that can be done with the Religious Congregations and Membership Study at the ARDA that I used to look at Utah. I looked in some detail at Iowa ahead of their caucus in a few days, in light of all the news lately about Republican presidential candidates courting evangelical voters. Go take a look to read more! Water World https://juliasilge.com/blog/2016-01-19-water-world/ Tue, 19 Jan 2016 00:00:00 +0000 https://juliasilge.com/blog/2016-01-19-water-world/ I live in Utah, an extremely dry state. Like much of the western United States, Utah is experiencing water stress from increasing demand, episodes of drought, and conflict over water rights. At the same time, Utahns use a lot of water per capita compared to residents of other states. According to the United States Geological Survey, in 2014 people in Utah used more water per person than in any other state, and in years before and after, Utah’s per capita water use is always near the very top in the U. Health Care Indicators in Utah Counties https://juliasilge.com/blog/health-care-indicators/ Mon, 11 Jan 2016 00:00:00 +0000 https://juliasilge.com/blog/health-care-indicators/ The state of Utah (my adopted home) has an Open Data Catalog with lots of interesting data sets, including a collection of health care indicators from 2014 for the 29 counties in Utah. The observations for each county include measurements such as the infant mortality rate, the percent of people who don’t have insurance, what percent of people have diabetes, and so forth. Let’s see how these health care indicators are related to each other and if we can use these data to cluster Utah counties into similar groups. This Is the Place, Apparently https://juliasilge.com/blog/this-is-the-place/ Sun, 03 Jan 2016 00:00:00 +0000 https://juliasilge.com/blog/this-is-the-place/ My family and I moved to Utah about 5 years ago and we have found ourselves thoroughly in love in with our new home state. I didn’t know much about it before we began the process of contemplating a move here, and I find that is often true of many people. Let’s use some choropleth maps and demographic exploration to learn a bit more about this place I call home now. Joy to the World, and also Anticipation, Disgust, Surprise... https://juliasilge.com/blog/joy-to-the-world/ Tue, 22 Dec 2015 00:00:00 +0000 https://juliasilge.com/blog/joy-to-the-world/ In my previous blog post, I analyzed my Twitter archive and explored some aspects of my tweeting behavior. When do I tweet, how much do retweet people, do I use hashtags? These are examples of one kind of question, but what about the actual verbal content of my tweets, the text itself? What kinds of questions can we ask and answer about the text in some programmatic way? This is what is called natural language processing, and I’ll give a first shot at it here. Ten Thousand Tweets https://juliasilge.com/blog/ten-thousand-tweets/ Tue, 08 Dec 2015 00:00:00 +0000 https://juliasilge.com/blog/ten-thousand-tweets/ I started learning the statistical programming language R this past summer, and discovering Hadley Wickham’s data visualization package ggplot2 has been a joy and a revelation. When I think back to how I made all the plots for my astronomy dissertation in the early 2000s (COUGH SUPERMONGO COUGH), I feel a bit in awe of what ggplot2 can do and how easy and, might I even say, delightful it is to use. License https://juliasilge.com/license/ Mon, 01 Jan 0001 00:00:00 +0000 https://juliasilge.com/license/ My blog posts are released under a Creative Commons Attribution-ShareAlike 4.0 International License.