Welcome to my data page! Below is my data linked as a file, a Data Dictionary for some of the key terms, and a guide for tips and tricks
Data Dictionary:
Corpus– A corpus, when referred to in digital spaces, is an electronically accessible collection of sources, often text but also including images, statistics, and other digital data. Also interchangeable here with corpora, but used as the plural most frequently, so as a way to describe more than one corpus, or corpora.
Some examples of similar corpora which may help you frame your considerations from USC libraries’ research guide, https://libguides.usc.edu/c.php?g=1443977&p=10726956
“Literary corpus: A collection of digitized texts from a specific era or genre of literature, used to study stylistic changes or authorial influences.
Historical document corpus: A compilation of digitized historical documents like letters, newspapers, and government records, enabling analysis of social trends across time.
Visual corpus: A collection of digitized images (paintings, photographs) categorized by subject matter or style, allowing for comparative analysis of visual representations.”
Best Practices
Building your Corpus!
When you choose texts from the internet or from a database, it is important that you collect your data carefully. Identifying the text type is important, ie, whether your corpus is text or images, or other digital data. Your naming of text within your corpus must be done systematically, making sure you name, order, and categorize each individual piece of data or document uniformly. Here’s an example of how a mismatched corpus could look compared to a consistently named and organized one…
Textbook_2
PearsonHistoryBook1
Pearson_textbook3_
Or…
1_Pearson
2_Pearson
3_Pearson
Keep documentation of your choices and, importantly, of your sources!
Open Refine!
One tool I did not use necessarily in the example you saw, but use in most digital projects, is a tool called OpenRefine. This will help you wrangle large amounts of data and smooth over any inconsistencies. When doing digital work, you must keep your data “clean”, which means without any extra added stuff. This could be done using operefine, or on Voyant, you can utilize the “reader” tool to get a closer look at your text. If you see a word you don’t like, you could put it as a “stop” word, which would not show in the text. Or, if a word is unfamiliar and doesn’t belong, using the reader tool, you could eliminate that word and fix your corpus more easily. You mustn’t change the written word of the speaker within a historical document, but, for example, in the State of the Union speeches, there were a bunch of unnecessary details on when to applaud. These stage directions almost change the document, and are added in, and aren’t necessary for my research.
Consistency Is Key!
Another example, one OpenRefine is more suited for, would be the way in which I NAMED my documents. At first, inconsistent titles with the last name and first name of some suffragette authors, and only the year of the speech for the State of the Unions. This made it confusing and difficult for me to organize and understand the data. What’s even more significant? A computer has no way of understanding what I meant by labeling my data that way. Tools like the “trends” tool on Voyant were unable to be viewed in chronological order, and I had to be more intentional about my choices. Data is all choices, so the first question you have to ask yourself with your data is, What are you using it for? Why is it important? After answering that, you can shift your methods for how you edit your data. If you were doing text analysis of Shakespeare, for example, you might keep the stage directions. If you were doing, let’s say, a comparative project with a bunch of documents, you might consider having the unique aspect of their differences in the title, and marking consistency with the year of publishing.