\documentclass{article} \title{Voynich Manuscript - Latin Texts} \author{Sarah Goslee} \date{2006-10-22} \begin{document} \maketitle \section{Introduction} Statistical analysis of an unknown text is useless without some other known text to compare it to. Latin was the language of scientific writing throughout the medieval period. While vernacular manuscripts are known from the likely timeperiod of the Voynich MS, Latin seems like a reasonable starting point. So as not to bias the comparision toward a particular authorial style or topic, I chose three Latin texts on different topics, all available online as full texts. \begin{itemize} \item l1: Apuleius - de Mundo \linebreak \verb!(http://www.gmu.edu/departments/fld/CLASSICS/apuleius.mundo.html)! \item l2: TITI LVCRETI CARI DE RERVM NATVRA LIBER PRIMVS \linebreak \verb!(http://www.thelatinlibrary.com/lucretius1.html)! \item l3: Isidorus Hispalensis - De natura rerum \linebreak \verb!(http://www.forumromanum.org/literature/isidorus_hispalensis/natura.html)! \end{itemize} Before analysis I removed all punctuation and converted all capital letters to lowercase (EXCEPT those that were parts of Roman numerals). None of the texts were paginated, so I treated the paragraph as the basic unit. Spaces were replaced with "." and paragraph beginnings and endings were marked with "=". The texts did not contain linebreaks, so those were not marked. I also removed the lines of poetry from l1 and the chapter headings from l3, since I've only been using paragraph text from the VMS. <>= library(ecodist) l1 <- read.table("l1.txt", as.is=TRUE)[[1]] l2 <- read.table("l2.txt", as.is=TRUE)[[1]] l3 <- read.table("l3.txt", as.is=TRUE)[[1]] # split l1s into characters l1.chars <- sapply(l1, function(x)strsplit(x, "")) # make character pairs l1.pairs <- l1.chars for(i in 1:length(l1.chars)) { temp <- l1.chars[[i]] if(length(temp) > 0) { l1.pairs[[i]] <- paste(temp[1:(length(temp)-1)], temp[2:length(temp)], sep="") } } rm(i, temp) # split l1s into words l1.words <- sapply(l1, function(x)sub("^=", "", x)) l1.words <- sapply(l1.words, function(x)sub("=$", "", x)) l1.words <- sapply(l1.words, function(x)gsub("=", "\\.", x)) l1.words <- sapply(l1.words, function(x)gsub("-", "\\.", x)) l1.words <- sapply(l1.words, function(x)strsplit(x, "\\.")) # visualize character pairs by l1 across entire ms l1.pairs.table <- table(unlist(l1.pairs)) l1.pairs.table <- data.frame(c1=substring(names(l1.pairs.table),1,1), c2=substring(names(l1.pairs.table),2,2), l1.pairs.table) l1.pairs.table <- crosstab(l1.pairs.table$c1, l1.pairs.table$c2, l1.pairs.table$Freq) # create chars by l1 and words by l1 tables l1.chars.table <- cbind(rep(1, length(l1.chars[[1]])), l1.chars[[1]]) l1.words.table <- cbind(rep(1, length(l1.words[[1]])), l1.words[[1]]) for(i in 2:length(l1)) { l1.chars.table <- rbind(l1.chars.table, cbind(rep(i, length(l1.chars[[i]])), l1.chars[[i]])) l1.words.table <- rbind(l1.words.table, cbind(rep(i, length(l1.words[[i]])), l1.words[[i]])) } rm(i) l1.chars.table <- crosstab(as.numeric(l1.chars.table[,1]), l1.chars.table[,2], rep(1, nrow(l1.chars.table))) l1.words.table <- crosstab(as.numeric(l1.words.table[,1]), l1.words.table[,2], rep(1, nrow(l1.words.table))) # drop non-letter characters . = l1.chars.table <- l1.chars.table[,-c(1,2)] # scale tables by rowsum l1.chars.rowsum <- apply(l1.chars.table, 1, sum) l1.words.rowsum <- apply(l1.words.table, 1, sum) l1.chars.table <- sweep(l1.chars.table, 1, l1.chars.rowsum, "/") l1.words.table <- sweep(l1.words.table, 1, l1.words.rowsum, "/") # principal coordinates analysis l1.chars.pco <- pco(dist(l1.chars.table)) l1.words.pco <- pco(dist(l1.words.table)) ### # split l2s into characters l2.chars <- sapply(l2, function(x)strsplit(x, "")) # make character pairs l2.pairs <- l2.chars for(i in 1:length(l2.chars)) { temp <- l2.chars[[i]] if(length(temp) > 0) { l2.pairs[[i]] <- paste(temp[1:(length(temp)-1)], temp[2:length(temp)], sep="") } } rm(i, temp) # split l2s into words l2.words <- sapply(l2, function(x)sub("^=", "", x)) l2.words <- sapply(l2.words, function(x)sub("=$", "", x)) l2.words <- sapply(l2.words, function(x)gsub("=", "\\.", x)) l2.words <- sapply(l2.words, function(x)gsub("-", "\\.", x)) l2.words <- sapply(l2.words, function(x)strsplit(x, "\\.")) # visualize character pairs by l2 across entire ms l2.pairs.table <- table(unlist(l2.pairs)) l2.pairs.table <- data.frame(c1=substring(names(l2.pairs.table),1,1), c2=substring(names(l2.pairs.table),2,2), l2.pairs.table) l2.pairs.table <- crosstab(l2.pairs.table$c1, l2.pairs.table$c2, l2.pairs.table$Freq) # create chars by l2 and words by l2 tables l2.chars.table <- cbind(rep(1, length(l2.chars[[1]])), l2.chars[[1]]) l2.words.table <- cbind(rep(1, length(l2.words[[1]])), l2.words[[1]]) for(i in 2:length(l2)) { l2.chars.table <- rbind(l2.chars.table, cbind(rep(i, length(l2.chars[[i]])), l2.chars[[i]])) l2.words.table <- rbind(l2.words.table, cbind(rep(i, length(l2.words[[i]])), l2.words[[i]])) } rm(i) l2.chars.table <- crosstab(as.numeric(l2.chars.table[,1]), l2.chars.table[,2], rep(1, nrow(l2.chars.table))) l2.words.table <- crosstab(as.numeric(l2.words.table[,1]), l2.words.table[,2], rep(1, nrow(l2.words.table))) # drop non-letter characters . = l2.chars.table <- l2.chars.table[,-c(1,2)] # scale tables by rowsum l2.chars.rowsum <- apply(l2.chars.table, 1, sum) l2.words.rowsum <- apply(l2.words.table, 1, sum) l2.chars.table <- sweep(l2.chars.table, 1, l2.chars.rowsum, "/") l2.words.table <- sweep(l2.words.table, 1, l2.words.rowsum, "/") # principal coordinates analysis l2.chars.pco <- pco(dist(l2.chars.table)) l2.words.pco <- pco(dist(l2.words.table)) ### # split l3s into characters l3.chars <- sapply(l3, function(x)strsplit(x, "")) # make character pairs l3.pairs <- l3.chars for(i in 1:length(l3.chars)) { temp <- l3.chars[[i]] if(length(temp) > 0) { l3.pairs[[i]] <- paste(temp[1:(length(temp)-1)], temp[2:length(temp)], sep="") } } rm(i, temp) # split l3s into words l3.words <- sapply(l3, function(x)sub("^=", "", x)) l3.words <- sapply(l3.words, function(x)sub("=$", "", x)) l3.words <- sapply(l3.words, function(x)gsub("=", "\\.", x)) l3.words <- sapply(l3.words, function(x)gsub("-", "\\.", x)) l3.words <- sapply(l3.words, function(x)strsplit(x, "\\.")) # visualize character pairs by l3 across entire ms l3.pairs.table <- table(unlist(l3.pairs)) l3.pairs.table <- data.frame(c1=substring(names(l3.pairs.table),1,1), c2=substring(names(l3.pairs.table),2,2), l3.pairs.table) l3.pairs.table <- crosstab(l3.pairs.table$c1, l3.pairs.table$c2, l3.pairs.table$Freq) # create chars by l3 and words by l3 tables l3.chars.table <- cbind(rep(1, length(l3.chars[[1]])), l3.chars[[1]]) l3.words.table <- cbind(rep(1, length(l3.words[[1]])), l3.words[[1]]) for(i in 2:length(l3)) { l3.chars.table <- rbind(l3.chars.table, cbind(rep(i, length(l3.chars[[i]])), l3.chars[[i]])) l3.words.table <- rbind(l3.words.table, cbind(rep(i, length(l3.words[[i]])), l3.words[[i]])) } rm(i) l3.chars.table <- crosstab(as.numeric(l3.chars.table[,1]), l3.chars.table[,2], rep(1, nrow(l3.chars.table))) l3.words.table <- crosstab(as.numeric(l3.words.table[,1]), l3.words.table[,2], rep(1, nrow(l3.words.table))) # drop non-letter characters . = l3.chars.table <- l3.chars.table[,-c(1,2)] # scale tables by rowsum l3.chars.rowsum <- apply(l3.chars.table, 1, sum) l3.words.rowsum <- apply(l3.words.table, 1, sum) l3.chars.table <- sweep(l3.chars.table, 1, l3.chars.rowsum, "/") l3.words.table <- sweep(l3.words.table, 1, l3.words.rowsum, "/") # principal coordinates analysis l3.chars.pco <- pco(dist(l3.chars.table)) l3.words.pco <- pco(dist(l3.words.table)) write.table(table(unlist(l1.chars)), "l1chartable.csv", quote=FALSE, row.names=FALSE) write.table(table(unlist(l2.chars)), "l2chartable.csv", quote=FALSE, row.names=FALSE) write.table(table(unlist(l3.chars)), "l3chartable.csv", quote=FALSE, row.names=FALSE) write.table(table(unlist(l1.words)), "l1wordtable.csv", quote=FALSE, row.names=FALSE) write.table(table(unlist(l2.words)), "l2wordtable.csv", quote=FALSE, row.names=FALSE) write.table(table(unlist(l3.words)), "l3wordtable.csv", quote=FALSE, row.names=FALSE) write.table(l1.pairs.table, "l1charpair.csv", quote=FALSE) write.table(l2.pairs.table, "l2charpair.csv", quote=FALSE) write.table(l3.pairs.table, "l3charpair.csv", quote=FALSE) l1.word.pairs <- l1.words for(i in 1:length(l1.words)) { temp <- l1.words[[i]] l1.word.pairs[[i]] <- paste(temp[1:(length(temp)-1)], temp[2:length(temp)], sep=".") } rm(i, temp) l2.word.pairs <- l2.words for(i in 1:length(l2.words)) { temp <- l2.words[[i]] l2.word.pairs[[i]] <- paste(temp[1:(length(temp)-1)], temp[2:length(temp)], sep=".") } rm(i, temp) l3.word.pairs <- l3.words for(i in 1:length(l3.words)) { temp <- l3.words[[i]] l3.word.pairs[[i]] <- paste(temp[1:(length(temp)-1)], temp[2:length(temp)], sep=".") } rm(i, temp) write.table(table(unlist(l1.word.pairs)), "l1wordpair.csv", quote=FALSE) write.table(table(unlist(l2.word.pairs)), "l2wordpair.csv", quote=FALSE) write.table(table(unlist(l3.word.pairs)), "l3wordpair.csv", quote=FALSE) ### # do the same for Voynich paragraphs myevt <- read.table("newall.evt", sep="\t") pageinfo <- read.table("pages.txt", sep="\t", header=TRUE) lineinfo <- sapply(myevt[,1], function(x)substring(x, 2, nchar(x)-1)) lineinfo <- sapply(lineinfo, function(x)strsplit(x, "\\.")) lineinfo <- data.frame(do.call("rbind", lineinfo)) colnames(lineinfo) <- c("page", "para", "line") lines <- as.character(myevt[,2]) # separate out paragraph text from labels, other text paraevt <- lines[substring(lineinfo$para, 1, 1) == "P" | substring(lineinfo$para, 1, 1) == "Q"] lineinfo <- merge(lineinfo, pageinfo, all.x = TRUE, all.y = FALSE) paralineinfo <- lineinfo[substring(lineinfo$para, 1, 1) == "P" | substring(lineinfo$para, 1, 1) == "Q",] # merge lines into their paragraphs paragraphs <- rep("", length(paraevt)) temp <- rep(0, length(paragraphs)) currline <- 0 for(i in 1:length(paraevt)) { if(substring(paraevt[i], 1, 1) == "=") { currline <- currline + 1 paragraphs[currline] <- paraevt[i] temp[i] <- 1 } else { paragraphs[currline] <- paste(paragraphs[currline], substring(paraevt[i], 2, nchar(paraevt[i])), sep="") } } paragraphs <- paragraphs[1:currline] paragraphinfo <- paralineinfo[temp == 1,] rm(i, currline, temp) # split paragraphs into characters paragraph.chars <- sapply(paragraphs, function(x)strsplit(x, "")) # make character pairs paragraph.pairs <- paragraph.chars for(i in 1:length(paragraph.chars)) { temp <- paragraph.chars[[i]] if(length(temp) > 0) { paragraph.pairs[[i]] <- paste(temp[1:(length(temp)-1)], temp[2:length(temp)], sep="") } } rm(i, temp) # split paragraphs into words paragraph.words <- sapply(paragraphs, function(x)sub("^=", "", x)) paragraph.words <- sapply(paragraph.words, function(x)sub("=$", "", x)) paragraph.words <- sapply(paragraph.words, function(x)gsub("=", "\\.", x)) paragraph.words <- sapply(paragraph.words, function(x)gsub("-", "\\.", x)) paragraph.words <- sapply(paragraph.words, function(x)strsplit(x, "\\.")) # visualize character pairs by paragraph across entire ms paragraph.pairs.table <- table(unlist(paragraph.pairs)) paragraph.pairs.table <- data.frame(c1=substring(names(paragraph.pairs.table),1,1), c2=substring(names(paragraph.pairs.table),2,2), paragraph.pairs.table) paragraph.pairs.table <- crosstab(paragraph.pairs.table$c1, paragraph.pairs.table$c2, paragraph.pairs.table$Freq) # create chars by paragraph and words by paragraph tables paragraph.chars.table <- cbind(rep(1, length(paragraph.chars[[1]])), paragraph.chars[[1]]) paragraph.words.table <- cbind(rep(1, length(paragraph.words[[1]])), paragraph.words[[1]]) for(i in 2:length(paragraphs)) { paragraph.chars.table <- rbind(paragraph.chars.table, cbind(rep(i, length(paragraph.chars[[i]])), paragraph.chars[[i]])) paragraph.words.table <- rbind(paragraph.words.table, cbind(rep(i, length(paragraph.words[[i]])), paragraph.words[[i]])) } rm(i) paragraph.chars.table <- crosstab(as.numeric(paragraph.chars.table[,1]), paragraph.chars.table[,2], rep(1, nrow(paragraph.chars.table))) paragraph.words.table <- crosstab(as.numeric(paragraph.words.table[,1]), paragraph.words.table[,2], rep(1, nrow(paragraph.words.table))) # drop non-letter characters - * . = paragraph.chars.table <- paragraph.chars.table[,-c(1,2,3,4)] # scale tables by rowsum paragraph.chars.rowsum <- apply(paragraph.chars.table, 1, sum) paragraph.words.rowsum <- apply(paragraph.words.table, 1, sum) paragraph.chars.table <- sweep(paragraph.chars.table, 1, paragraph.chars.rowsum, "/") paragraph.words.table <- sweep(paragraph.words.table, 1, paragraph.words.rowsum, "/") # principal coordinates analysis, minus the very short paragraphs 302 and 713 paragraph.chars.pco <- pco(dist(paragraph.chars.table[-c(302, 713),])) paragraph.words.pco <- pco(dist(paragraph.words.table[-c(302, 713),])) ### pooled latin texts lall <- c(l1, l2, l3) lall.index <- c(rep(1, length(l1)), rep(2, length(l2)), rep(3, length(l3))) # split lalls into characters lall.chars <- sapply(lall, function(x)strsplit(x, "")) # make character pairs lall.pairs <- lall.chars for(i in 1:length(lall.chars)) { temp <- lall.chars[[i]] if(length(temp) > 0) { lall.pairs[[i]] <- paste(temp[1:(length(temp)-1)], temp[2:length(temp)], sep="") } } rm(i, temp) # split lalls into words lall.words <- sapply(lall, function(x)sub("^=", "", x)) lall.words <- sapply(lall.words, function(x)sub("=$", "", x)) lall.words <- sapply(lall.words, function(x)gsub("=", "\\.", x)) lall.words <- sapply(lall.words, function(x)gsub("-", "\\.", x)) lall.words <- sapply(lall.words, function(x)strsplit(x, "\\.")) # visualize character pairs by lall across entire ms lall.pairs.table <- table(unlist(lall.pairs)) lall.pairs.table <- data.frame(c1=substring(names(lall.pairs.table),1,1), c2=substring(names(lall.pairs.table),2,2), lall.pairs.table) lall.pairs.table <- crosstab(lall.pairs.table$c1, lall.pairs.table$c2, lall.pairs.table$Freq) # create chars by lall and words by lall tables lall.chars.table <- cbind(rep(1, length(lall.chars[[1]])), lall.chars[[1]]) lall.words.table <- cbind(rep(1, length(lall.words[[1]])), lall.words[[1]]) for(i in 2:length(lall)) { lall.chars.table <- rbind(lall.chars.table, cbind(rep(i, length(lall.chars[[i]])), lall.chars[[i]])) lall.words.table <- rbind(lall.words.table, cbind(rep(i, length(lall.words[[i]])), lall.words[[i]])) } rm(i) lall.chars.table <- crosstab(as.numeric(lall.chars.table[,1]), lall.chars.table[,2], rep(1, nrow(lall.chars.table))) lall.words.table <- crosstab(as.numeric(lall.words.table[,1]), lall.words.table[,2], rep(1, nrow(lall.words.table))) # drop non-letter characters . = lall.chars.table <- lall.chars.table[,-c(1,2)] # scale tables by rowsum lall.chars.rowsum <- apply(lall.chars.table, 1, sum) lall.words.rowsum <- apply(lall.words.table, 1, sum) lall.chars.table <- sweep(lall.chars.table, 1, lall.chars.rowsum, "/") lall.words.table <- sweep(lall.words.table, 1, lall.words.rowsum, "/") # principal coordinates analysis lall.chars.pco <- pco(dist(lall.chars.table)) lall.words.pco <- pco(dist(lall.words.table)) @ \section{Ordination} \begin{figure} \begin{center} <>= par(mfrow=c(2,2)) plot(l1.chars.pco$vectors[,1:2], xlab="PCO 1", ylab="PCO 2", main="l1", pch=16, col="red") plot(l2.chars.pco$vectors[,1:2], xlab="PCO 1", ylab="PCO 2", main="l2", pch=16, col="darkred") plot(l3.chars.pco$vectors[,1:2], xlab="PCO 1", ylab="PCO 2", main="l3", pch=16, col="darkorange") @ \caption{Ordination of character frequencies for three Latin texts, by paragraph.} \label{fig:charPCO} \end{center} \end{figure} None of the three Latin texts show any clear internal groupings based on character frequencies (Fig. \ref{fig:charPCO}). \begin{figure} \begin{center} <>= par(mfrow=c(2,2)) plot(l1.words.pco$vectors[,1:2], xlab="PCO 1", ylab="PCO 2", main="l1", pch=16, col="red") plot(l2.words.pco$vectors[,1:2], xlab="PCO 1", ylab="PCO 2", main="l2", pch=16, col="darkred") plot(l3.words.pco$vectors[,1:2], xlab="PCO 1", ylab="PCO 2", main="l3", pch=16, col="darkorange") @ \caption{Ordination of word frequencies for Voynich and three Latin texts, by paragraph.} \label{fig:wordPCO} \end{center} \end{figure} Ordination on word frequencies picks some topical outliers in l2, but no other consistent groupings (Fig. \ref{fig:wordPCO}). \begin{figure} \begin{center} <>= par(mfrow=c(1,1)) plot(paragraph.chars.pco$vectors[,1:2], xlab="PCO 1", ylab="PCO 2", main="VMS", type="n") points(paragraph.chars.pco$vectors[paragraphinfo[-c(302, 713),]$lang == "A", 1:2], pch=16, col="red") points(paragraph.chars.pco$vectors[paragraphinfo[-c(302, 713),]$lang == "B", 1:2], pch=16, col="blue") @ \caption{Ordination of character frequencies for Voynich text, by paragraph.} \label{fig:charPooledA} \end{center} \end{figure} \begin{figure} \begin{center} <>= par(mfrow=c(1,1)) plot(lall.chars.pco$vectors[,1:2], xlab="PCO 1", ylab="PCO 2", main="Latin", type="n") points(lall.chars.pco$vectors[lall.index == 3,1:2], pch=16, col="darkorange") points(lall.chars.pco$vectors[lall.index == 1,1:2], pch=16, col="red") points(lall.chars.pco$vectors[lall.index == 2,1:2], pch=16, col="darkred") @ \caption{Ordination of character frequencies for three pooled Latin texts, by paragraph.} \label{fig:charPooledB} \end{center} \end{figure} \begin{figure} \begin{center} <>= par(mfrow=c(1,1)) plot(paragraph.words.pco$vectors[,1:2], xlab="PCO 1", ylab="PCO 2", main="VMS", type="n") points(paragraph.words.pco$vectors[paragraphinfo[-c(302, 713),]$lang == "A", 1:2], pch=16, col="red") points(paragraph.words.pco$vectors[paragraphinfo[-c(302, 713),]$lang == "B", 1:2], pch=16, col="blue") @ \caption{Ordination of word frequencies for Voynich and three pooled Latin texts, by paragraph.} \label{fig:wordPooledA} \end{center} \end{figure} \begin{figure} \begin{center} <>= par(mfrow=c(1,1)) plot(lall.words.pco$vectors[,1:2], xlab="PCO 1", ylab="PCO 2", main="Latin", type="n") points(lall.words.pco$vectors[lall.index == 3,1:2], pch=16, col="darkorange") points(lall.words.pco$vectors[lall.index == 1,1:2], pch=16, col="red") points(lall.words.pco$vectors[lall.index == 2,1:2], pch=16, col="darkred") @ \caption{Ordination of word frequencies for Voynich and three pooled Latin texts, by paragraph.} \label{fig:wordPooledB} \end{center} \end{figure} Question: are the VMS sections more or less different than can be explained by topic? The A/B language distinction in the VMS shows up reasonably well in an ordination on character frequencies (Fig. \ref{fig:charPooledA}). No similar division is apparent in the Latin texts. An ordination on the character frequencies of all three texts togethers demonstrates that they all have very similar character frequences (Fig. \ref{fig:charPooledB}). The A/B language distinction in the VMS shows up even more clearly in an ordination on word frequencies (Fig. \ref{fig:wordPooledA}). The Latin texts show some separation, but there is considerable overlap (Fig. \ref{fig:wordPooledB}). \section{Character Frequencies} The ordination analyses suggested that there was little difference in character frequency patterns among the three Latin texts. Unlike the VMS subsets, there is little difference in overall character frequencies between the different Latin texts (Fig. \ref{fig:chardist}). \begin{figure} \begin{center} <>= lchars <- sort(unique(c(unlist(l1.chars), unlist(l2.chars), unlist(l3.chars)))) l1ct <- table(factor(unlist(l1.chars), levels=lchars)) l1ct <- l1ct[-c(1,2)] l1ct.sum <- sum(l1ct) l1ct <- l1ct/l1ct.sum l2ct <- table(factor(unlist(l2.chars), levels=lchars)) l2ct <- l2ct[-c(1,2)] l2ct.sum <- sum(l2ct) l2ct <- l2ct/l2ct.sum l3ct <- table(factor(unlist(l3.chars), levels=lchars)) l3ct <- l3ct[-c(1,2)] l3ct.sum <- sum(l3ct) l3ct <- l3ct/l3ct.sum par(mfrow=c(1,1)) plot(1:length(l1ct), l1ct, type="b", col="red", pch=names(l1ct), xlab="Character", ylab="Frequency", ylim=c(0, .15)) lines(1:length(l2ct), l2ct, type="b", col="darkred", pch=names(l2ct)) lines(1:length(l3ct), l3ct, type="b", col="darkorange", pch=names(l3ct)) @ \end{center} \caption{Character distribution in the three Latin texts.} \label{fig:chardist} \end{figure} \begin{figure} \begin{center} <>= l1wt <- table(unlist(l1.words)) l1wt.sum <- sum(l1wt) l1wt.count <- length(l1wt) l1wt <- l1wt[l1wt > 1] l1wt <- l1wt/l1wt.sum l1wt <- sort(l1wt, decreasing=TRUE) l2wt <- table(unlist(l2.words)) l2wt.sum <- sum(l2wt) l2wt.count <- length(l2wt) l2wt <- l2wt[l2wt > 1] l2wt <- l2wt/l2wt.sum l2wt <- sort(l2wt, decreasing=TRUE) l3wt <- table(unlist(l3.words)) l3wt.sum <- sum(l3wt) l3wt.count <- length(l3wt) l3wt <- l3wt[l3wt > 1] l3wt <- l3wt/l3wt.sum l3wt <- sort(l3wt, decreasing=TRUE) par(mfrow=c(1,1)) plot(1:50, l1wt[1:50], type="b", col="red", xlab="Word rank", ylab="Frequency", main="50 most common words", pch=16) lines(1:50, l2wt[1:50], type="b", col="darkred", pch=16) lines(1:50, l3wt[1:50], type="b", col="darkorange", pch=16) @ \caption{Word frequencies in the three Latin texts.} \label{fig:wordline} \end{center} \end{figure} The word frequency distributions are not as extreme as that of set H of the VMS (Fig. \ref{fig:wordline}). In l1 and l2, \verb!et! has more than three times as many occurrences as the next most common word. Ten most common words in l1 (percentage occurrence): <>= round(100*(head(l1wt, 10))) @ Ten most common words in l2: <>= round(100*(head(l2wt, 10))) @ Ten most common words in l3: <>= round(100*(head(l3wt, 10))) @ The same word is most frequent in all three Latin texts, and another appears in the top four in all texts, but the same is not true of the Voynich sections. Set H shows extreme skew, and no word is found in the top four in any two sets. Word length is longer in the Latin texts than in the VMS, and the mean number of occurrences of each word is lower (Table \ref{tab:setchar}). The percentage of words that occur only once is similar in the Latin texts and VMS, although l1 is lower than the others. \begin{center} <>= library(xtable) # npages, nchars, nwords, noccurrences, pctunique x <- data.frame(matrix( c(length(l1), l1ct.sum, l1wt.sum, l1ct.sum/l1wt.sum, l1wt.count, l1wt.sum/l1wt.count, 100*length(l1wt)/l1wt.count, length(l2), l2ct.sum, l2wt.sum, l2ct.sum/l2wt.sum, l2wt.count, l2wt.sum/l2wt.count, 100*length(l2wt)/l2wt.count, length(l3), l3ct.sum, l3wt.sum, l3ct.sum/l3wt.sum, l3wt.count, l3wt.sum/l3wt.count, 100*length(l3wt)/l3wt.count) , nrow=3, byrow=TRUE)) dimnames(x) <- list(c("l1", "l2", "l3"), c("Paras", "Chars", "Word occ.", "Word length", "Words", "N occ.", "Pct. Unique Words")) xtable(x, caption="Text Characteristics", label="tab:setchar", digits=c(0,0,0,0,1,0,1,0)) @ \end{center} \end{document}