\documentclass{article} \title{Voynich Manuscript - Basic Analyses} \author{Sarah Goslee} \date{2006-10-22} \begin{document} \maketitle \section{Introduction} I am working from a modified EVA transcription of the VMS (Reeds/Landini's interlinear file in EVA, version 1.6e6 - http://www.ic.unicamp.br/~stolfi/voynich/98-12-28-interln16e6/). Starting with that file, I constructed a consensus version, and I've fixed some apparent transcription errors by comparing the transcription with the high-resolution SIDs available courtesy of the Beinecke Rare Book and Manuscript Library of Yale University, current owners of the manuscript. (I downloaded them from the links available at www.voynich.nu.) I've also made some other changes to match my conception of the symbol set (in no particular order): \begin{itemize} \item ch -> c because I think the c-ligature-c combination is one character. \item sh -> C I'm not sure of the meaning of the c-ligatureswirl-c - it seems to behave very much like the regular c - but am keeping it separate. \item paragraph initial gallows (fkpt) -> capitals FKPT - I think these have some other meaning and are not part of the following word. \item All possible word breaks (,) have been marked as definite word breaks (.). I'm continuing to check these as I proofread the transcription. \item = used to mark paragraph / label beginnings and endings, and - used to mark other line beginning and endings. \item Internal - marking intruding images have been replaced with word breaks (.) \item I've "untangled" the ch-gallows combinations, putting the gallows letter first. I think they are scribal conceits, and this idea is supported by the appearance of gallows-ch combinations at the beginning of paragraphs. \item "weirdos" are all denoted by X \end{itemize} A complete list of changes suitable for running as a source file in vim is contained in appendix 1. <>= library(ecodist) myevt <- read.table("newall.evt", sep="\t") pageinfo <- read.table("pages.txt", sep="\t", header=TRUE) lineinfo <- sapply(myevt[,1], function(x)substring(x, 2, nchar(x)-1)) lineinfo <- sapply(lineinfo, function(x)strsplit(x, "\\.")) lineinfo <- data.frame(do.call("rbind", lineinfo)) colnames(lineinfo) <- c("page", "para", "line") lines <- as.character(myevt[,2]) # separate out paragraph text from labels, other text paraevt <- lines[substring(lineinfo$para, 1, 1) == "P" | substring(lineinfo$para, 1, 1) == "Q"] paralineinfo <- lineinfo[substring(lineinfo$para, 1, 1) == "P" | substring(lineinfo$para, 1, 1) == "Q",] # merge lines into their pages pages <- rep("", nrow(pageinfo)) currline <- which(pageinfo$page == paralineinfo$page[1]) pages[currline] <- paraevt[currline] for(i in 2:length(paraevt)) { if(paralineinfo$page[i] != paralineinfo$page[i-1]) { currline <- which(pageinfo$page == paralineinfo$page[i]) pages[currline] <- paraevt[i] } else { pages[currline] <- paste(pages[currline], substring(paraevt[i], 2, nchar(paraevt[i])), sep="") } } rm(i, currline) # split pages into characters page.chars <- sapply(pages, function(x)strsplit(x, "")) # make character pairs page.pairs <- page.chars for(i in 1:length(page.chars)) { temp <- page.chars[[i]] if(length(temp) > 0) { page.pairs[[i]] <- paste(temp[1:(length(temp)-1)], temp[2:length(temp)], sep="") } } rm(i, temp) # split pages into words page.words <- sapply(pages, function(x)sub("^=", "", x)) page.words <- sapply(page.words, function(x)sub("=$", "", x)) page.words <- sapply(page.words, function(x)gsub("=", "\\.", x)) page.words <- sapply(page.words, function(x)gsub("-", "\\.", x)) page.words <- sapply(page.words, function(x)strsplit(x, "\\.")) # visualize character pairs across entire ms page.pairs.table <- table(unlist(page.pairs)) page.pairs.table <- data.frame(c1=substring(names(page.pairs.table),1,1), c2=substring(names(page.pairs.table),2,2), page.pairs.table) page.pairs.table <- crosstab(page.pairs.table$c1, page.pairs.table$c2, page.pairs.table$Freq) #$ # create chars by page and words by page tables page.chars.table <- cbind(rep(1, length(page.chars[[1]])), page.chars[[1]]) page.words.table <- cbind(rep(1, length(page.words[[1]])), page.words[[1]]) for(i in 2:227) { page.chars.table <- rbind(page.chars.table, cbind(rep(i, length(page.chars[[i]])), page.chars[[i]])) page.words.table <- rbind(page.words.table, cbind(rep(i, length(page.words[[i]])), page.words[[i]])) } rm(i) page.chars.table <- crosstab(as.numeric(page.chars.table[,1]), page.chars.table[,2], rep(1, nrow(page.chars.table))) page.words.table <- crosstab(as.numeric(page.words.table[,1]), page.words.table[,2], rep(1, nrow(page.words.table))) # drop non-letter characters - * . = page.chars.table <- page.chars.table[,-c(1,2,3,4)] pageinfo.table <- pageinfo[unlist(lapply(page.words, length)) > 0,] # scale tables by rowsum page.chars.rowsum <- apply(page.chars.table, 1, sum) page.words.rowsum <- apply(page.words.table, 1, sum) page.chars.table <- sweep(page.chars.table, 1, page.chars.rowsum, "/") page.words.table <- sweep(page.words.table, 1, page.words.rowsum, "/") # principal coordinates analysis # drop 204, f116v, because it has only 3 words page.chars.pco <- pco(dist(page.chars.table[-204,])) page.words.pco <- pco(dist(page.words.table[-204,])) @ \pagebreak \section{Ordination} First, I looked at overall pattern of character frequencies (Fig. \ref{fig:charAB}) and word frequencies (Fig. \ref{fig:wordAB}) between Currier languages and section types, using principal coordinates ordination on Euclidean distances of row-standardized frequencies. Labeling the same ordinations with section types as well as Currier language (Figs. \ref{fig:charSec}, \ref{fig:wordSec}) showed that the language groups separate for both character and word frequencies (although not cleanly), but that thematic (topical(?) section) groups separate more clearly for word frequencies. This suggests to me that character frequency is determined by A/B "encoding", while word frequencies are more closely related to theme. \begin{figure} \begin{center} <>= plot(page.chars.pco$vectors[,1:2], xlab="PCO 1", ylab="PCO 2", main="Character frequencies") #$ points(page.chars.pco$vectors[pageinfo.table$lang[-204] == "A", 1:2], col="red", pch=16) points(page.chars.pco$vectors[pageinfo.table$lang[-204] == "B", 1:2], col="blue", pch=16) legend("bottomright", legend=c("A", "B"), col=c("red", "blue"), pch=16) @ \caption{Ordination of character frequencies in Currier A and Currier B pages (paragraph text only).} \label{fig:charAB} \end{center} \end{figure} \begin{figure} \begin{center} <>= plot(page.words.pco$vectors[,1:2], xlab="PCO 1", ylab="PCO 2", main="Word frequencies") #$ points(page.words.pco$vectors[pageinfo.table$lang[-204] == "A", 1:2], col="red", pch=16) points(page.words.pco$vectors[pageinfo.table$lang[-204] == "B", 1:2], col="blue", pch=16) legend("bottomright", legend=c("A", "B"), col=c("red", "blue"), pch=16) @ \caption{Ordination of word frequencies in Currier A and Currier B pages (paragraph text only).} \label{fig:wordAB} \end{center} \end{figure} \begin{figure} \begin{center} <>= plot(page.chars.pco$vectors[,1:2], xlab="PCO 1", ylab="PCO 2", main="Character frequencies", type="n") points(page.chars.pco$vectors[pageinfo.table$lang[-204] == "A", 1:2], col="red", pch=16, cex=.2) points(page.chars.pco$vectors[pageinfo.table$lang[-204] == "B", 1:2], col="blue", pch=16, cex=.2) text(page.chars.pco$vectors[pageinfo.table$section[-204] == "h", 1:2], "h", col="green", cex=.8, pos=2) text(page.chars.pco$vectors[pageinfo.table$section[-204] == "h2", 1:2], "h2", col="darkgreen", cex=.8, pos=2) text(page.chars.pco$vectors[pageinfo.table$section[-204] == "r", 1:2], "r", col="darkgoldenrod2", cex=.8, pos=2) text(page.chars.pco$vectors[pageinfo.table$section[-204] == "a", 1:2], "a", col="black", cex=.8, pos=2) text(page.chars.pco$vectors[pageinfo.table$section[-204] == "b", 1:2], "b", col="purple", cex=.8, pos=2) text(page.chars.pco$vectors[pageinfo.table$section[-204] == "c", 1:2], "c", col="black", cex=.8, pos=2) text(page.chars.pco$vectors[pageinfo.table$section[-204] == "m", 1:2], "m", col="black", cex=.8, pos=2) @ \caption{Ordination of character frequencies labeled by section and Currier language (paragraph text only).} \label{fig:charSec} \end{center} \end{figure} \begin{figure} \begin{center} <>= plot(page.words.pco$vectors[,1:2], xlab="PCO 1", ylab="PCO 2", main="Word frequencies", type="n") points(page.words.pco$vectors[pageinfo.table$lang[-204] == "A", 1:2], col="red", pch=16, cex=.2) points(page.words.pco$vectors[pageinfo.table$lang[-204] == "B", 1:2], col="blue", pch=16, cex=.2) text(page.words.pco$vectors[pageinfo.table$section[-204] == "h", 1:2], "h", col="green", cex=.8, pos=2) text(page.words.pco$vectors[pageinfo.table$section[-204] == "h2", 1:2], "h2", col="darkgreen", cex=.8, pos=2) text(page.words.pco$vectors[pageinfo.table$section[-204] == "r", 1:2], "r", col="darkgoldenrod2", cex=.8, pos=2) text(page.words.pco$vectors[pageinfo.table$section[-204] == "a", 1:2], "a", col="black", cex=.8, pos=2) text(page.words.pco$vectors[pageinfo.table$section[-204] == "b", 1:2], "b", col="purple", cex=.8, pos=2) text(page.words.pco$vectors[pageinfo.table$section[-204] == "c", 1:2], "c", col="black", cex=.8, pos=2) text(page.words.pco$vectors[pageinfo.table$section[-204] == "m", 1:2], "m", col="black", cex=.8, pos=2) @ \caption{Ordination of word frequencies labeled by section and Currier language (paragraph text only).} \label{fig:wordSec} \end{center} \end{figure} \pagebreak \section{Analysis of Distinct Subsets} I've decided to concentrate my initial analyses on three subsets of data chosen to be distinct based on supposed content and on separation in the ordination diagrams (especially word frequencies). Set R is the recipe section, set B is the balneological section, and Set H is the herbal section in the right tail of the word-frequency PCO. This gives two thematically-different sections in language B, and only one in language A. These groups are shown overlaid on the character ordination (Fig. \ref{fig:charSub}) and the word ordination (Fig. \ref{fig:wordSub}). <>= subset.index <- data.frame(pageinfo$page, rep(0, nrow(pageinfo))) subset.index[pageinfo$section == "r", 2] <- 2 subset.index[pageinfo$section == "b", 2] <- 3 subset.temp <- pageinfo.table[-204,] subset.temp <- data.frame(subset.temp[,1], rep(0, nrow(subset.temp))) subset.temp[page.words.pco$vectors[,1]> (.09) & page.words.pco$vectors[,2] > 0, 2] <- 1 subset.index <- merge(subset.index, subset.temp, by.x = 1, by.y = 1, all.x = TRUE) subset.index[,2] <- subset.index[,2] + subset.index[,3] subset.index <- subset.index[,1:2] colnames(subset.index) <- c("page", "index") rm(subset.temp) seth.pages <- pages[subset.index$index == 1] seth.pages <- seth.pages[!is.na(seth.pages)] setr.pages <- pages[subset.index$index == 2] setr.pages <- setr.pages[!is.na(setr.pages)] setb.pages <- pages[subset.index$index == 3] setb.pages <- setb.pages[!is.na(setb.pages)] seth.page.chars <- page.chars[subset.index$index == 1] setr.page.chars <- page.chars[subset.index$index == 2] setb.page.chars <- page.chars[subset.index$index == 3] seth.page.words <- page.words[subset.index$index == 1] setr.page.words <- page.words[subset.index$index == 2] setb.page.words <- page.words[subset.index$index == 3] write.table(table(unlist(seth.page.chars)), "hchartable.csv", quote=FALSE, row.names=FALSE) write.table(table(unlist(setr.page.chars)), "rchartable.csv", quote=FALSE, row.names=FALSE) write.table(table(unlist(setb.page.chars)), "bchartable.csv", quote=FALSE, row.names=FALSE) write.table(table(unlist(seth.page.words)), "hwordtable.csv", quote=FALSE, row.names=FALSE) write.table(table(unlist(setr.page.words)), "rwordtable.csv", quote=FALSE, row.names=FALSE) write.table(table(unlist(setb.page.words)), "bwordtable.csv", quote=FALSE, row.names=FALSE) @ \begin{figure} \begin{center} <>= temp <- unlist(lapply(page.chars, length)) subset.index.short <- subset.index[temp > 0,] rm(temp) plot(page.chars.pco$vectors[,1:2], xlab="PCO 1", ylab="PCO 2", main="Character frequencies", pch=16, cex=.2) points(page.chars.pco$vectors[pageinfo.table$lang[-204] == "A", 1:2], col="red", pch=16, cex=.2) points(page.chars.pco$vectors[pageinfo.table$lang[-204] == "B", 1:2], col="blue", pch=16, cex=.2) points(page.chars.pco$vectors[subset.index.short[-204,2] == 1, 1:2], col="green", pch=0, cex=.8) points(page.chars.pco$vectors[subset.index.short[-204,2] == 2, 1:2], col="darkgoldenrod2", pch=1, cex=.8) points(page.chars.pco$vectors[subset.index.short[-204,2] == 3, 1:2], col="purple", pch=2, cex=.8) @ \caption{Ordination of character frequencies with group membership superimposed.} \label{fig:charSub} \end{center} \end{figure} \begin{figure} \begin{center} <>= plot(page.words.pco$vectors[,1:2], xlab="PCO 1", ylab="PCO 2", main="Word frequencies", pch=16, cex=.2) points(page.words.pco$vectors[pageinfo.table$lang[-204] == "A", 1:2], col="red", pch=16, cex=.2) points(page.words.pco$vectors[pageinfo.table$lang[-204] == "B", 1:2], col="blue", pch=16, cex=.2) points(page.words.pco$vectors[subset.index.short[-204,2] == 1, 1:2], col="green", pch=0, cex=.8) points(page.words.pco$vectors[subset.index.short[-204,2] == 2, 1:2], col="darkgoldenrod2", pch=1, cex=.8) points(page.words.pco$vectors[subset.index.short[-204,2] == 3, 1:2], col="purple", pch=2, cex=.8) @ \caption{Ordination of word frequencies with group membership superimposed.} \label{fig:wordSub} \end{center} \end{figure} As suggested by the ordination diagram, the character distribution is very similar in sets B and R (Fig. \ref{fig:chardist}). Set H differs in its frequencies of c and e, and also o, t and V. \begin{figure} \begin{center} <>= cnames <- names(table(unlist(page.chars))) hct <- table(factor(unlist(seth.page.chars), levels=cnames)) hct <- hct[-c(1,2,3,4)] hct <- hct[-match("X", names(hct))] hct.sum <- sum(hct) hct <- hct/sum(hct) rct <- table(factor(unlist(setr.page.chars), levels=cnames)) rct <- rct[-c(1,2,3,4)] rct <- rct[-match("X", names(rct))] rct.sum <- sum(rct) rct <- rct/sum(rct) bct <- table(factor(unlist(setb.page.chars), levels=cnames)) bct <- bct[-c(1,2,3,4)] bct <- bct[-match("X", names(bct))] bct.sum <- sum(bct) bct <- bct/sum(bct) plot(1:length(hct), hct, type="b", col="green", pch=names(hct), xlab="Character", ylab="Frequency") lines(1:length(rct), rct, type="b", col="darkgoldenrod2", pch=names(rct)) lines(1:length(bct), bct, type="b", col="purple", pch=names(bct)) @ \end{center} \caption{Character distribution in the three chosen subsets.} \label{fig:chardist} \end{figure} \begin{figure} \begin{center} <>= hwt <- table(unlist(seth.page.words)) hwt <- hwt[-1] hwt.sum <- sum(hwt) hwt.count <- length(hwt) hwt <- hwt[hwt > 1] hwt <- hwt/sum(hwt) hwt <- sort(hwt, decreasing=TRUE) rwt <- table(unlist(setr.page.words)) rwt <- rwt[-1] rwt.sum <- sum(rwt) rwt.count <- length(rwt) rwt <- rwt[rwt > 1] rwt <- rwt/sum(rwt) rwt <- sort(rwt, decreasing=TRUE) bwt <- table(unlist(setb.page.words)) bwt <- bwt[-1] bwt.sum <- sum(bwt) bwt.count <- length(bwt) bwt <- bwt[bwt > 1] bwt <- bwt/sum(bwt) bwt <- sort(bwt, decreasing=TRUE) par(mfrow=c(1,1)) plot(1:50, hwt[1:50], type="b", col="green", xlab="Word rank", ylab="Frequency", main="50 most common words", pch=16) lines(1:50, rwt[1:50], type="b", col="darkgoldenrod2", pch=16) lines(1:50, bwt[1:50], type="b", col="purple", pch=16) @ \caption{Word frequencies in the three chosen subsets.} \label{fig:wordline} \end{center} \end{figure} Word frequency distribution is extremely skewed in set H, with \verb!daVn! being extremely common (Fig. \ref{fig:wordline}). Neither set R nor set B show such an extreme pattern. The ten most common words in set H (by percent occurrence): <>= round(head(hwt * 100, 10)) @ The ten most common words in set R: <>= round(head(rwt * 100, 10)) @ The ten most common words in set B: <>= round(head(bwt * 100, 10)) @ \begin{center} <>= library(xtable) # npages, nchars, nwords, noccurrences, pctunique x <- data.frame(matrix( c(length(seth.pages), hct.sum, hwt.sum, hct.sum/hwt.sum, hwt.count, hwt.sum/hwt.count, 100*length(hwt)/hwt.count, length(setr.pages), rct.sum, rwt.sum, rct.sum/rwt.sum, rwt.count, rwt.sum/rwt.count, 100*length(rwt)/rwt.count, length(setb.pages), bct.sum, bwt.sum, bct.sum/bwt.sum, bwt.count, bwt.sum/bwt.count, 100*length(bwt)/bwt.count) , nrow=3, byrow=TRUE)) dimnames(x) <- list(c("H", "R", "B"), c("Pages", "Chars", "Word occ.", "Word length", "Words", "N occ.", "Pct. Unique Words")) xtable(x, caption="Subset Characteristics", label="tab:setchar", digits=c(0,0,0,0,1,0,1,0)) @ \end{center} The mean word length is slightly greater in set R than in set H (4.7 vs 4.0; Table \ref{tab:setchar}). The mean number of times a word occurs is different among the groups as well - lowest in H, highest in B, with R having an intermediate position. The percentage of words appearing only once is similar in sets H and R, and a bit higher in set B. \pagebreak \section{Appendix 1: Changes to EVA 2006-09-13} \begin{verbatim} %s/!\+//g %s/,/\./g %s/ch/c/g %s/sh/C/g %s/iiii/VV/g %s/iii/W/g %s/ii/V/g %s/cfh/fc/g %s/ckh/kc/g %s/cph/pc/g %s/cth/tc/g %s/\.\+/\./g %s/ \+/ -/ gg s/-/=/ # don't have a good way to do the next step # currently, alternate # /=$ # map j^f^Ilr= %s/=f\(.*\)-/=F\1-/ %s/=k\(.*\)-/=K\1-/ %s/=p\(.*\)-/=P\1-/ %s/=t\(.*\)-/=T\1-/ g/%/d %s/{.\{-}}/X/g %s/h//g %s/-/\./g %s/ \./ -/ %s/\.$/-/ %s/f\([0-9]\)\([rv]\)/f00\1\2/ %s/f\([0-9][0-9]\)\([rv]\)/f0\1\2/ \end{verbatim} \end{document}