Tuesday, April 24, 2012

Online contributions

Digital mediums which I have used during the module (digital histories) are: Twitter, Studynet, Blogger, Weebly and Tumblr. In assessing the use of these mediums, I feel I have used Twitter and Blogger most frequently. Studynet was useful for uploading my blog domain as it made the blog more accessible for others in the class, but most discussions took place on Twitter. I opened a Tumblr account at the beginning of the semester and intended to use it as a visual blog. However, I only managed to upload a few posts on the topics studied this year. I followed other Tumblrs that displayed historical images and videos which I found interesting. Another aspect of my online contributions was Weebly. My essay regarding ‘Data mining’ helped to produce a website which allowed me to be creative in using widgets and present information interactively.
Additionally, I created a blog to expand on topics covered in class or in relation to them. The blog was presented more informally and was found more appealing to readers; proven by the statistics and comments left. I used blogs from academics to formulate my essay as the information provided me with a better understanding of the historian’s view. My blog was used as a way of publishing work rather than gathering opinions on the posts, I wanted my work online so there was a possibility for it to be viewed internationally.
Discussions were instigated through Twitter where I expressed my opinions concerning some of the debates and comments. Moreover, I used Twitter as way to state immediate thoughts: a miniature way of blogging on what had been discussed in class. Overall I feel throughout the progression of this module, online contributions have assisted my understanding of the methods which are used to present historical information online.              

Sunday, April 22, 2012

What is data mining, and does it encourage the creation of a specific kind of history?



Kurt Thearling defines data mining in its simplest form, ‘the process of efficient` discovery of nonobvious valuable patterns from a large collection of data.’[1] His definition was intended for business companies. Even though the same definition is astute, for academics, historians and scholars this notion is expand further by describing data mining as, ‘‘mining of knowledge’ as distinct from the ‘retrieval of data.’’[2] This is a general consensus which goes beyond historians and has added more depth to the process of data mining. It shows the difference between methodologies such as ‘keyword’ searches, which highlights a specific piece of data (a word), compared to highlighting information through the Semantic Web;  an implied  meaning within the results.[3] Data mining is summarised into three steps: classification, clustering and regression. These processes will demonstrate the different methodologies within them. Furthermore, different methods such as Ngram and Topic Modeling will evaluate how data mining is presented, as well as showing how historians interpret them. Finally, the question of whether analysing data online creates a different kind of history will be developed further on.


      
The concept of data mining is when relevant information is depicted from a large corpus of data and is then used to formulate a conclusion. However, defining the complexity of the process can be differentiated. One can organise data into tables and charts and from there interpret their own conclusion based on a pattern drawn from the information given. This is feasible with small amounts of data, yet when dealing with Big Data, reams of data sets are too complex for just one person to analyse. Therefore, data mining accomplished this by using algorithms to find a consistency and compare it with inconsistent data. This is done in marketing and usually to make a prediction.[4]  In addition, data mining includes databases that are unapparent to the user who uses them because they provide a larger variety of information. When variables are associated with each other, predictions can be made which can either disagree or confirm the users theory. Data mining can form different links between different variables. Implementing some of these can be risky because the user does not know of the solution that the mining process has concluded. The data is presented in different ways and visual presentation is a key part of that. This aided the user in understanding the relations more coherently, whether it be a graph, table or image. Data mining consist of many tools which analyse information from different perspectives. It is used particularly to compress large databases which span across different fields.[5] They process data in order to make it usable.

Classification is the first step in data mining; it contains different techniques in separating data depending on the variables. The tree method is notably used to draw up relations between these variables. When using the tree method the facts gathered from a range of databases are divided into categories. These are further divided until each group contains only a few factors and has created many branches in the process. A limit is needed to prevent ‘overfitting;’ where categories divide repeatedly leaving as little as one factor within it, this is one of the dangers and disadvantages in the decision tree methodology. For analytical evaluation the tree primarily highlights key variables.[6] This means that it would be quick and easy to locate variables that had initially branched out, thus showing their immediate importance in relation rather than further down on the second or third branch. Interestingly, Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) are techniques within the decision tree that has accepted unclassified data in creating new predictions.[7]  Another type of classification is the Naïve Bayesian, which is predominantly reliable and specifically deals with textual data.[8] This is different to the decision tree in one fundamental way; the Naïve Bayesian is based on subjunctive probability.  This means that instead of categorizing words based on their association to each other, the Bayesian system relies on the number of times the word appears or does not appear in the text. The extent in which this type of Text mining is useful for historians or researchers is debatable. The algorithms would create more results in searching for a word, yet the relevance of the search may not always be practical because words may have more than one significant connotation; meaning some results will be unrelated to the question. Matt Kirschenbaum applied analytical data to poems written by Emily Dickinson. He used the Naïve Bayesian method to compare whether the system could detect an erotic poem or not, compared to Scholarly standards. He argued that if technology proved to be effective then it would confirm the information scholars already possessed.[9] This may be the case but it does not produce new and exciting relationships to debate about, or spark a new outlook to be discussed. It might however, give the reader more confidence in their method to evaluate other poems or literature in this way.             

David Blei’s paper on the Topic Modeling caused quite a commotion. With the rise of its popularity, people applied such methodology with ease yet had not quite grasped the concept.[10] In one of Beli’s lectures, Topic Modeling shows a relationship to the Bayesian system but a ‘hierarchical Bayesian system.’ This would provide the most relevant information first.  Essentially, Topic Modeling is selecting topics from a body of data, Beli mentioned ‘Wikipedia.’ Then by selecting a file you can connect the topics, which means ‘annotating’ the file with the use of algorithms to locate different topics.[11] This is equivalent to the classification process in data mining but by using a different method. Once located, there are different ways in which they can be presented. For example, seeing topics change over time can be depicted through a graph; similar to Ngram. To see connections or relations between topics, branching would assist with this. Finally, images can also be annotated by the algorithms or gridding,  thereby it is treated like a document. Latent Semantic Analysis (LSA) and Latent Dirichlit Allocation (LDA) are some of the forms of modeling and in association, classification in data mining.  LDA helps associate word to a topic but the topic is not named.   

Secondly, clustering techniques are used to group data to detect a possible outcome. This section causes further complications within data mining, because the results need to be grouped logically before being analysed.[12] K-means is used commonly due to its manageability and the main factor is the K centroids, which are present in each cluster. The position of the centroid is vital during the process as each centroid moves from their location and in doing so generates different results. Keeping a fair distance from one centroid to another optimizes results. Once in place data sets are placed near the closest centroid; and after the process new centroids are binded to an old set of data and continue to moves until it reaches its final point. This creates an algorithm in ‘minimizing an objective function.’[13] Instead of K-means, search engines use TF*IDF in order to locate documents that are the most significant to the input data (question) first.  Historians have credited the Inverse Document Frequency (IDF) by highlighting the experimental nature. However, when it is amalgamated with the Frequency of the Term (TF) it is advance in its, ‘text retrieval into methods for retrieval of other media, and into language processing techniques for other purposes.’ [14] This proves to be a good way of correlating data to provide maximum results.         



Lastly, regression in data mining usually uses mathematical formulas; these are algorithms to determine the prediction or outcome of the results and to establish connections between them.[15] This is used for quantum-numerical data, yet text data is depicted and effectively used for conclusions and interpretations through the semantic web. Within the semantic web there are web agents, one of these tools is called Armadillo which exhibit their findings on a Resource Description Framework (RDF.) Despite this useful tool, humanities resources are equipped to function without it. This does not mean that they are equipped to deal with the ‘black box’ problem in data mining.[16] Unfortunately, the 'black box' problem is when some output data does not correspond with the input data and thus presents unsatisfactory results. In some ways it is similar to the Bayesian system from the previous step of clustering, when the output data is not relevant  to  the input data on some occasions. This is especially impractical for historians.            


Historian Ben Schmidt created a Ngram on Google, but like most historians he found it difficult to subject much information from it. Ngrams presents data on a graph format. The relationship of both axis in producing a linear sequence of data can visually be interpreted. From the rise and fall of the information displayed on the graph obvious correlations can be seen, but further interpretation seems quite difficult to deduce.[17]  Nevertheless, more than one variable at a time can be analysed and the data presented can be displayed as a direct comparison or similarity depending on the lines representing each word on the graph.  For these reasons Ngrams would not be entirely useful for historians; as no further analysis is made. The vertical axis has always displayed the year; therefore in searching a specific question for example, when was there depression in America? The words ‘Depression’ cross referenced against ‘America’ would analyse each words as separate entities, counting how many times they were mentioned rather than the significance the words have together.  Mathew Hurst analysed it from a language perspective, he accumulated data on the words themselves rather than correlating them visually. He compared words from different versions of English, American English and British English, to see how they had changed over the years, which was hardly at all. Subsequently, he compared the same word but by one beginning with a capital letter and the other one with a lower case letter. From the results he gathered the words with capital letters were used at the start of the sentence and were noted more than those in lower cases. Mathew Hurst enjoyed coming up with these types of conclusions and seemed quite excited about the topic.  [18]       




It can be argued that digitalising history has ushered in a new era through the way in which information is researched. Stephen Ramsay credits this in his appraisal of big data, but he states that this would further encourage the traditional humanist research nonetheless.  Methodologies in data mining assisted historians by categorising and producing new and perhaps unthought-of perspectives behind the information displayed. Furthermore, Tim Hitchcock and some of his colleagues consider historians who discredit this type of history as fairly old fashioned and isolated people within archives.[19] Intriguingly, this interpretation of historians may have stemmed from the debate of books vs. Internet books and journals; arguing whether reading a physical book is better than reading text off a screen and vice versa. In relation to data mining, it could be argued that digital history is easier and more precise for research. On the other hand, books were the original form of information that cannot replace personal interaction and sentiment that one feels with the document. This is a further point to discuss: nonetheless, everyone has their own preference, but what can be noted is that more and more data is being uploaded.

To conclude, data mining is mixture of classification, clustering and regression. Classification organises the data which can be done through a number of methods but in particular for text mining, decision tree or Naïve Bayesian would be appropriate. When the data is passed down through clustering it is grouped through systems such as K-mean, and lastly, algorithms use mathematical methods during regression to correlates the data, for future predictions.  In comparison, Topic Modeling has been credited as a well known form of classification. It has the potential to develop the way we perceive and form new correlations between data.  Ngram may be considered as the progression to open new questions, yet most scholars can state that it is limited in what information can be extracted from it. Furthermore, this is the case for some historians; military historians would argue, ‘this ontology does not represent their knowledge.’[20] This shows that categorised data cannot replace the deeper meaning that historians depict when analysing documents, rather than forming parallels.             



[1] A Data Mining Glossary, http://www.thearling.com/glossary.htm; consulted 15 April 2012
[2] Fabio Ciravegna, Mark Greengrass, Tim Hitchcock, Sam Chapman, Jamie Mc Laughlin and Ravish Bhagdev, ‘Finding Needles in the Haystacks: Data-mining in Distributed Historical Datasets’ in Mark Greengrass and Lorna Hughes, The virtual representation of the past (surrey, 2008) p.66
[3] Fabio Ciravegna, Mark Greengrass, Tim Hitchcock, Sam Chapman, Jamie Mc Laughlin and Ravish Bhagdev, haystack, pp65-67
[4] Data mining techniques, http://www.obgyn.cam.ac.uk/cam-only/statsbook/stdatmin.html; consulted 15 April 2012  
[5] Data Mining: What is Data Mining?, http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm; consulted 14 April 2012
[6] Data mining for process improvement, http://www.crosstalkonline.org/storage/issue-archives/2011/201101/201101-Below.pdf; consulted 15 April 2012
[7] Data, http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm;
[8] New Methods for Humanities Research; http://people.lis.illinois.edu/~unsworth/lyman.htm; consulted 15 April 2012               
[9] New Methods for Humanities Research; http://people.lis.illinois.edu/~unsworth/lyman.htm; consulted 15 April 2012             
[10] Topic Modeling and Network Analysis,http://www.scottbot.net/HIAL/?p=221; consulted 18 April 2012
[11] Topic models, http://videolectures.net/mlss09uk_blei_tm/; consulted 18 April 2012
[12] New, http://people.lis.illinois.edu/~unsworth/lyman.htm;               
[13]New, http://people.lis.illinois.edu/~unsworth/lyman.htm;                
[14] Stephen Robertson, Understanding Inverse Document Frequency: On theoretical arguments for IDF, Journal of Documentation, 60 no. vol. 5, pp. 503–520
[15] Introduction to data mining, http://www.youtube.com/watch?v=_QH4oIOd9nc; consulted 13 April 2012 
[16] Fabio Ciravegna, Mark Greengrass, Tim Hitchcock, Sam Chapman, Jamie Mc Laughlin and Ravish Bhagdev, haystack, pp.67-78
[17]  Sapping Attention, http://sappingattention.blogspot.co.uk/; consulted 18 April 2012
[19]  With Criminal Intent, http://criminalintent.org/; consulted 18 April 2012
[20]  Fabio Ciravegna, Mark Greengrass, Tim Hitchcock, Sam Chapman, Jamie Mc Laughlin and Ravish Bhagdev, haystack, p. 72

Saturday, April 21, 2012

Tucker's Websites


For my digital histories essay, I wanted to create a website. Through Weebly I thought that this would be pretty straight forward because you can place texts and images where ever you want them with a click of a button. I thought I would go a step further and speak to a friend of mine who does digital media and see how he goes about creating websites using different programs. These programs entail Photoshop, Flash and Dreamweaver. He is a university student studying at the university of Hertfordhire and his name is Chris Tucker, no not the actor but yes, the web designer. 
  
  
The program he used for his first website which was about his 'past, present and future,' and therefore was quite personal to him, was designed in Photoshop. According to him, Photoshop has two good qualities in creating a website. 

The way in which you use Photoshop is fairly simple. Also when placing imagery on Photoshop the images are in the same position once uploaded online. 


However, like all things it has its bad qualities too. 

Firstly, the page is perceived in a picture format and therefore text can not be embedded in the HTML. This means that when uploading the website online Search Engine Optimisation (SEO) cannot function. SEO is when search engines place a website at the top of the list because it has the highest number of words relating to the search query, or in some cases websites pay to place their website first. He said, that not having this would halt promotion of the website if he decided to use it for a business. 
 
Secondly, the fact that the website is internally an image rather than being uploaded as separate image and text entities means that an error would cause the whole web page to be taken down to be repaired rather than just that section. 
 
From looking at the website you can see that it is very minimalistic. This is mainly because it depicts his personality as an easygoing guy. However, the contrast of a grey scale back ground with colour pictures really captures the eye. The layout is very simple and easy to use, but to be fair it is only made up of three interlinked pages. Moreover, I think that the text could have been slightly bigger and situated in a light black frame because personally the the thick black border sounding it emphasises the small size. 



The second website that Chris created was an invite to a fictional birthday party. This is my favourite out of the three and he used Flash instead of Photoshop to create it. once again he incorporated his minimalistic style yet used animated colour objects. I feel that the grey style design he used for this particular website is not entirely inviting for a birthday party. Similarly, he has not added a cover page stating that this is his birthday invitation. That said, the website contains three pages which I feel is appropriate for an birthday invitation. He also used a different font for the title which also gives a stylish appeal. He adds much more information this time and uses a light grey box rather than an outline, this is good as it doesn't close off the text. The only other thing which would make the website complete would be by placing the website in the centre of the page.

The highlight of using Flash is ...


The animation features. This made the website more interactive and as Chris said, it brought more 'to the character of the website.' The balloons and the birthday badge were animated to come on the screen, quite like how animation in power point is used. What's more, the buttons at the top of the page (links to other pages within the website) are animated to show a white ring around it when the cursor is near. This application is not available to the other two programs, thus giving flash that edge.   

 The downside to using flash...

Flash is a video based program which means that people will have to wait longer for it to download, especially if they have poor Internet connection.  Also flash is not compatible with phones.

Chris found this website the most time consuming. He said, that the process was difficult and the smallest fault would stop the website from functioning. This is a major problem compared to Phototshop because at least with Photoshop the correction can be made and uploaded again, but with Flash this is not the case. In addition, SEO is once again affected as the whole site is embedded in a video and prevents key word searches.   




Finally, Chris used Dreamweaver for his last website and found it the most convenient. The design has changed more than the other two because this time he has used a brown scale instead of grey. Also he has a tribal like logo which appears four time at the top of the page. A scroll bar has been enabled within the website rather than being on the far right out side the margins. In addition, he has a separate page for contact information rather than being displayed on the homepage. Consequently, the website has two extra pages and the reason for this is because structurally, it contains a style sheet which make it easier to add more pages and keep them in place. 
Lastly, even though animation could not be added, unlike the other two it managed to separate the text data from the image. This meant that SEO could finally be operated. 

 One disadvantage from all these wonderful features is that if you make an error with the margins and padding at the beginning it is extremely difficult to rectify, which means blocking the whole creation process.  

  
Overall, Chris stated that he would prefer Dreamweaver as it had more benefits such as being less time consuming than the rest. However I would  have to disagree, it may be less time consuming but the animation was far more effective. As these these three have shown, Chris has a lot of potential to uncover in future websites. As far as I can see, the use of colour has finally come out and is only the beginning.      

 
  

 

                                            


  

Friday, April 20, 2012

The Suez Crisis


The two weeks I had off  (if you can call it that) for Easter, were mostly spent by reading up on the Suez crisis.




       The essay tittle
that I had to complete was...

 


‘The US government actually quite liked the British holding a dominant position in the Middle East. The 1957 Eisenhower Doctrine was unforeseen and the result of botched Anglo-American diplomacy during the Suez crisis’. Discuss.
  
These were my conclusions...

The debate on whether the US liked Britain holding a dominant position in the Middle East was illustrated through a number of factors such as, their ‘special friendship,’ Britain’s responsibilities and its location. These factors initially showed America liking Britain’s position. However, the correlation of views before and after the Suez Crisis revealed how America, after the Suez crisis, disliked Britain’s position due to its association with colonialism.  This implicated US interests such as anti-communist defence  and  possibly oil flow from good Arab relations. An anti-communist legislation called the Eisenhower doctrine was expected in response to Britain being connected to colonialism and its economy and military strength been weakened, also in order to fill the power vacuum. Contrary to this belief, the British did not anticipate the US to respond in this way, they had hoped the US would support Britain to maintain her position through the Baghdad Pact and thus share her responsibilities of the Middle East. 

Moreover, some Americans did not foresee the Doctrine because it failed to disclose how much US finance and resources were given, not to mention the speed at which it was finalised was unexpected. Overall the Eisenhower doctrine was largely unforeseen, but the American government were under the impression that at the end of the Suez crisis, the doctrine was necessary to prevent communism spreading to the Middle East. Finally, the Eisenhower Doctrine was seen as another method in preventing communism as well as an indirect aim to topple Nasser. From what can be gathered the doctrine was tailored to benefit the US more than other countries. Aiding King Saud to gain a dominant position in the Middle East would have outweighed Nasser and maintained oil flow as well as a good relationship between both countries.     

Typically the lecture that covered the Eisenhower doctrine was done after the Easter holidays. I was quite annoyed that I couldn't put in one more theory that Tony Shaw had mentioned:  America felt that they had to step in to control the mess that the British had made of the Suez Crisis.  
 
            
Creative Commons Licence
This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.