<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cyberborean Chronicles &#187; Information Retrieval</title>
	<atom:link href="http://blog.cyberborean.org/tag/information-retrieval/feed" rel="self" type="application/rss+xml" />
	<link>http://blog.cyberborean.org</link>
	<description>by Alex Alishevskikh</description>
	<lastBuildDate>Wed, 18 Jan 2012 07:52:32 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=</generator>
		<item>
		<title>SCAN Version 1.3 released</title>
		<link>http://blog.cyberborean.org/2008/05/12/scan-version-13-released</link>
		<comments>http://blog.cyberborean.org/2008/05/12/scan-version-13-released#comments</comments>
		<pubDate>Mon, 12 May 2008 19:41:24 +0000</pubDate>
		<dc:creator>Alex Alishevskikh</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[desktop]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[SCAN]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[tagging]]></category>

		<guid isPermaLink="false">http://cyberborean.wordpress.com/?p=258</guid>
		<description><![CDATA[What&#8217;s new in 1.3 version »]]></description>
			<content:encoded><![CDATA[<p><img class="alignright size-full wp-image-259" src="http://cyberborean.org/blog/wp-content/uploads/2008/05/scan.png" alt="" width="64" height="64" /><a href="http://scan.sourceforge.net/?page_id=30">What&#8217;s new in 1.3 version »</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.cyberborean.org/2008/05/12/scan-version-13-released/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SCAN FAQ updated</title>
		<link>http://blog.cyberborean.org/2008/02/11/scan-faq-updated</link>
		<comments>http://blog.cyberborean.org/2008/02/11/scan-faq-updated#comments</comments>
		<pubDate>Mon, 11 Feb 2008 13:31:05 +0000</pubDate>
		<dc:creator>Alex Alishevskikh</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[desktop]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[SCAN]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[tagging]]></category>

		<guid isPermaLink="false">http://cyberborean.wordpress.com/?p=249</guid>
		<description><![CDATA[New version of SCAN Frequently Asked Questions page is available. &#8220;How does SCAN help me?&#8221;, &#8220;Why should I use it?&#8221;, &#8220;Who are the users?&#8221;, &#8220;Why it is smart?&#8221;, &#8220;Can it replace a &#8230;?&#8221;, &#8220;What is autotagging?&#8221;, technical tips-n-tricks, development questions and a lot of other things you would want to know about Smart Content Aggregation [...]]]></description>
			<content:encoded><![CDATA[<p>New version of <a href="http://scan.sourceforge.net/?page_id=20">SCAN Frequently Asked Questions</a> page is available.</p>
<p>&#8220;How does SCAN help me?&#8221;, &#8220;Why should I use it?&#8221;, &#8220;Who are the users?&#8221;, &#8220;Why it is smart?&#8221;, &#8220;Can it replace a &#8230;?&#8221;, &#8220;What is autotagging?&#8221;, technical tips-n-tricks, development questions and a lot of other things you would want to know about Smart Content Aggregation &amp; Navigation technology.</p>
<p><a href="http://scan.sourceforge.net/?page_id=20">Read SCAN FAQ »</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.cyberborean.org/2008/02/11/scan-faq-updated/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>More automation</title>
		<link>http://blog.cyberborean.org/2008/01/23/more-automation</link>
		<comments>http://blog.cyberborean.org/2008/01/23/more-automation#comments</comments>
		<pubDate>Wed, 23 Jan 2008 00:07:32 +0000</pubDate>
		<dc:creator>Alex Alishevskikh</dc:creator>
				<category><![CDATA[Essays]]></category>
		<category><![CDATA[email]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[metadata]]></category>
		<category><![CDATA[SCAN]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[tagging]]></category>

		<guid isPermaLink="false">http://cyberborean.wordpress.com/2008/01/23/more-automation/</guid>
		<description><![CDATA[I&#8217;m thinking about new feature for SCAN — some conditional actions to be executed individually or in a batch on selected documents. It would be useful for automation of metadata setting, or for defining custom autotagging rules. An idea is borrowed from e-mail clients, where the similar feature exists for decades as the user-defined filters [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m thinking about new feature for <a href="http://scan.sf.net">SCAN</a> — some conditional actions to be executed individually or in a batch on selected documents. It would be useful for automation of metadata setting, or for defining custom autotagging rules.</p>
<p><span id="more-200"></span></p>
<p>An idea is borrowed from e-mail clients, where the similar feature exists for decades as the user-defined <a href="http://en.wikipedia.org/wiki/E-mail_filtering">filters</a> for processing the messages. This is how it looks in KMail:</p>
<p><img src="http://cyberborean.org/blog/wp-content/uploads/2008/01/kmail-filters.png" alt="kmail-filters.png" /></p>
<p>In general, a filter checks if a document matches to specific criteria (a rule) and does some action if yes. For instance, if a condition &#8220;text contains &#8216;viagra&#8217; or &#8216;cialis&#8217;&#8221; is true, then some action (&#8220;move to spam&#8221; or &#8220;send assassins to the author&#8221;) would be executed. What is especially good is that it&#8217;s old, popular and intuitive user experience.</p>
<p>In a content aggregator like SCAN, this concept may allow a user to define custom automation rules to set document metadata properties. For instance,</p>
<p><code>IF (url starts with "http://cyberborean.wordpress.com") SET author = "me"</code></p>
<p>Another using I have in my mind is to empower an &#8220;artificial intelligence&#8221; of autotagging with a human intelligence, by user-defined tagging rules:</p>
<p><code>IF (text contains "latte") ADD TAG "coffee"</code></p>
<p>I only doubt about terminology — &#8220;filter&#8221; might be confusing, as it is already used in SCAN vocabulary (URI filters to include/exclude some documents by their URI pattern). Any ideas?</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.cyberborean.org/2008/01/23/more-automation/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SCAN 1.2 released</title>
		<link>http://blog.cyberborean.org/2008/01/08/scan-12-released</link>
		<comments>http://blog.cyberborean.org/2008/01/08/scan-12-released#comments</comments>
		<pubDate>Tue, 08 Jan 2008 08:58:33 +0000</pubDate>
		<dc:creator>Alex Alishevskikh</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[desktop]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[SCAN]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[tagging]]></category>

		<guid isPermaLink="false">http://cyberborean.wordpress.com/2008/01/08/scan-12-released/</guid>
		<description><![CDATA[Along with lots of minor bugfixes and performance tweaks, SCAN 1.2 introduces few essential improvements, mainly in search experience and plugins management&#8230; (Read more on SCAN website)]]></description>
			<content:encoded><![CDATA[<p>Along with lots of minor bugfixes and performance tweaks, SCAN 1.2 introduces few essential improvements, mainly in search experience and plugins management&#8230;</p>
<p><a href="http://scan.sourceforge.net/?page_id=28">(Read more on SCAN website)</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.cyberborean.org/2008/01/08/scan-12-released/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SCAN Mail plugin</title>
		<link>http://blog.cyberborean.org/2007/11/14/scan-mail-plugin</link>
		<comments>http://blog.cyberborean.org/2007/11/14/scan-mail-plugin#comments</comments>
		<pubDate>Wed, 14 Nov 2007 14:06:54 +0000</pubDate>
		<dc:creator>Alex Alishevskikh</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[email]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[javamail]]></category>
		<category><![CDATA[SCAN]]></category>
		<category><![CDATA[search]]></category>

		<guid isPermaLink="false">http://cyberborean.wordpress.com/2007/11/14/scan-mail-plugin/</guid>
		<description><![CDATA[I&#8217;d like to announce that SCAN now is able to work with your email. The mail plugin released yesterday introduces new type of the locations for SCAN repository &#8211; mailbox locations. The plugin purpose is to crawl the specified local email folders and aggregate the email messages as the documents in SCAN repository. It also [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;d like to announce that SCAN now is able to work with your email. The <a href="http://scan.sourceforge.net/?page_id=27">mail plugin</a> released yesterday introduces new type of the locations for SCAN repository &#8211; mailbox locations.</p>
<p><span id="more-188"></span></p>
<p>The plugin purpose is to crawl the specified local email folders and aggregate the email messages as the documents in SCAN repository. It also introduces new document type &#8220;<code>message/rfc822</code>&#8221; for emails and uses the email message headers for setting document metadata. Attached files are extracted from messages and processed as the separate documents with appropriate parsers depending on its content type.</p>
<p>As there are no common convention on how to identify a separate message for opening it with an external application, mail plugin implements its own message viewer UI to open the messages by default. The messages are identified with &#8220;<code>mid:</code>&#8221; URI scheme which <a href="http://www.ietf.org/rfc/rfc2392.txt">is a standard</a> but unfortunately, seems to not supported by known MUA&#8217;s so far. However, it is implemented in the hope of that the future MUA&#8217;s will support this standard scheme to open the specific messages with their command line (something like &#8220;<code>thunderbird mid:<em>message-id</em></code>&#8220;).</p>
<p>Mail plugin uses <a href="http://www.gnu.org/software/classpathx/javamail/javamail.html">GNU JavaMail</a> implementation because it includes JavaMail providers for local mail stores &#8211; <a href="http://www.qmail.org/man/man5/mbox.html">mboxes</a> and <a href="http://www.qmail.org/qmail-manual-html/man5/maildir.html">maildirs</a>. In addition, Outlook Express mail is also supported via the separate JavaMail provider made by <a href="http://sourceforge.net/projects/jmbox">jmbox project</a>. Thus, the mail plugin supports a wide range of popular MUA&#8217;s, including mbox-based (Mozilla family), maildir-based, mixed mbox/maildir (KMail and Evolution) and Outlook Express.</p>
<p>The plugin has been tested on Linux with Mozilla Thunderbird 1.5 and KMail 1.9.6 (in maildir mode) and Windows XP with Mozilla Thunderbird 1.5 and MS Outlook Express 5.0.</p>
<h3>Problems?</h3>
<p>Yep, they are. The issues with mail plugin go both from GNU JavaMail limitations and from the side of the concrete MUA&#8217;s which seem to be easy with interpreting mbox/maildir principles for their convenience. For instance, Thunderbird uses the non-standard &#8220;.sbd&#8221; directories to keep the mail subfolders, which the mbox provider is unaware of. So, recursive crawling the Thunderbird mail folders does not work. The only solution for those problems is to develop the MUA-specific JavaMail providers which would know how to deal with a concrete mbox or maildir implementation with all its peculiarities.</p>
<p>Outlook Express provider, as an example of such MUA-specific implementation works rather good, however we noticed a bug with localized MSOE versions where the translated folder names (&#8220;Inbox&#8221;, &#8220;Sent&#8221;, &#8220;Trash&#8221; etc) with non-latin letters had not been processed.</p>
<p>The reports on testing the mail plugin with other MUA&#8217;s/platforms are more than welcome.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.cyberborean.org/2007/11/14/scan-mail-plugin/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SCAN 1.1 released</title>
		<link>http://blog.cyberborean.org/2007/10/22/scan-11-released</link>
		<comments>http://blog.cyberborean.org/2007/10/22/scan-11-released#comments</comments>
		<pubDate>Mon, 22 Oct 2007 02:03:09 +0000</pubDate>
		<dc:creator>Alex Alishevskikh</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[desktop]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[SCAN]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[tagging]]></category>

		<guid isPermaLink="false">http://cyberborean.wordpress.com/2007/10/22/scan-11-released/</guid>
		<description><![CDATA[See what&#8217;s new in 1.1 version.]]></description>
			<content:encoded><![CDATA[<p>See <a href="http://scan.sourceforge.net/?page_id=26">what&#8217;s new in 1.1 version</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.cyberborean.org/2007/10/22/scan-11-released/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tags beauty</title>
		<link>http://blog.cyberborean.org/2007/10/05/tags-beauty</link>
		<comments>http://blog.cyberborean.org/2007/10/05/tags-beauty#comments</comments>
		<pubDate>Fri, 05 Oct 2007 17:04:25 +0000</pubDate>
		<dc:creator>Alex Alishevskikh</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[cluster]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[SCAN]]></category>
		<category><![CDATA[tagging]]></category>
		<category><![CDATA[taxonomy]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://cyberborean.wordpress.com/2007/10/05/tags-beauty/</guid>
		<description><![CDATA[The forthcoming SCAN 1.1 will be released together with a new plugin: TagClusters Panel. TagClusters is a a user interface extension (like Dates Panel) for taxonomy visualization. TagClusters visualize the selected tags as overlapping clusters of documents. With that cluster map, it is easy to see how the tags relate each other via the documents [...]]]></description>
			<content:encoded><![CDATA[<p>The forthcoming <a href="http://scan.sf.net">SCAN</a> 1.1 will be released together with a new plugin: TagClusters Panel. TagClusters is a a user interface extension (like <a href="http://scan.sourceforge.net/?page_id=18">Dates Panel</a>) for taxonomy visualization.</p>
<p><span id="more-184"></span></p>
<p><img src="http://cyberborean.org/blog/wp-content/uploads/2007/10/tagclusters_tmb.jpg" alt="TagClusters" /></p>
<p>TagClusters visualize the selected tags as overlapping clusters of documents. With that cluster map, it is easy to see how the tags relate each other via the documents they have in common. The plugin uses Tags Grouping &#8211; new core SCAN feature for finding the groups of interrelated tags. In TagClusters, this feature is used for automatic expanding a selected tag to a group of its semantic neighbors. By clicking a single tag, a user would see a map visualizing this tag plus all related tags, so that the whole taxonomy can be explored just with few mouse clicks.</p>
<p>These colorful amoebae are drawn by <a href="http://www.aduna-software.com/technologies/clustermap/overview.view">Aduna Cluster Map</a> library. It&#8217;s just an eye-candy &#8211; it&#8217;s going to be a sexiest SCAN panel, I think.</p>
<p>What&#8217;s also new in 1.1? Well, hm, a plugin for scanning Del.icio.us accounts, for instance. It&#8217;s coming soon &#8211; be on the watch for announcements <a href="http://sourceforge.net/export/rss2_projnews.php?group_id=189359"><img src="http://scan.sourceforge.net/wp-includes/images/rss.png" alt="RSS" /></a> !</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.cyberborean.org/2007/10/05/tags-beauty/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tagging &#8220;The Lord of the Rings&#8221;</title>
		<link>http://blog.cyberborean.org/2007/09/18/tagging-the-lord-of-the-rings</link>
		<comments>http://blog.cyberborean.org/2007/09/18/tagging-the-lord-of-the-rings#comments</comments>
		<pubDate>Tue, 18 Sep 2007 07:42:40 +0000</pubDate>
		<dc:creator>Alex Alishevskikh</dc:creator>
				<category><![CDATA[Essays]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[SCAN]]></category>
		<category><![CDATA[tagging]]></category>
		<category><![CDATA[tolkien]]></category>

		<guid isPermaLink="false">http://cyberborean.wordpress.com/2007/09/18/tagging-the-lord-of-the-rings/</guid>
		<description><![CDATA[There are results of experiment with using SCAN for text analysis and autotagging J.R.R. Tolkien&#8217;s &#8220;The Lord of the Rings&#8221; book. Process: Canonical LotR corpus consists of 62 chapters (arranged into six books), author&#8217;s foreword, Prologue and 6 appendices (A-F) &#8211; thus, 70 items. This documents collection (as OpenDocument files) has been indexed and autotagged [...]]]></description>
			<content:encoded><![CDATA[<p>There are results of experiment with using <a href="http://scan.sourceforge.net">SCAN</a> for text analysis and autotagging J.R.R. Tolkien&#8217;s &#8220;The Lord of the Rings&#8221; book.<br />
<span id="more-181"></span><br />
Process:<br />
Canonical LotR corpus consists of 62 chapters (arranged into six books), author&#8217;s foreword, Prologue and 6 appendices (A-F) &#8211; thus, 70 items. This documents collection (as OpenDocument files) has been indexed and autotagged with <a href="http://scan.sourceforge.net">SCAN 1.0</a>. SCAN autotagging has been set to use 10 tags per each document and maximal tags specifity (because of relatively small collection size). Additionally, each of six books has been tagged separately.</p>
<p>&#8220;Foreword&#8221;<br />
<em>chapters corrected edition experience glimpses information process readers reference story</em></p>
<p>&#8220;Prologue&#8221;<br />
<em>bilbo book copy families history hobbits shire smials thain westmarch</em></p>
<h3>Book I</h3>
<p><em>baggins bilbo bree butterbur frodo goldberry ponies sam strider tom</em></p>
<ol>
<li>&#8220;A Long-expected Party&#8221;<br />
<em>bag baggins bagginses bilbo birthday fireworks joke lobelia presents sackville</em></li>
<li>&#8220;The Shadow of the Past&#8221;<br />
<em>bilbo birthday déagol frodo gandalf gollum grandmother hated ring ted</em></li>
<li>&#8220;Three is Company&#8221;<br />
<em>bag frodo gildor lane lobelia pippin sackville sam sir sold</em></li>
<li>&#8220;A Short Cut to Mushrooms&#8221;<br />
<em>dogs farm farmer ferry frodo lane maggot mushrooms pippin waggon</em></li>
<li>&#8220;A Conspiracy Unmasked&#8221;<br />
<em>bath brandy brandybucks bucklanders fatty ferry fredegar frodo maggot merry</em></li>
<li>&#8220;The Old Forest&#8221;<br />
<em>bonfire forest hedge lilies merry path ponies tom willow withywindle</em></li>
<li>&#8220;In the House of Tom Bombadil&#8221;<br />
<em>barrows bombadil candle chimney creak derry goldberry tom wights willow</em></li>
<li>&#8220;Fog on the Barrow-Downs&#8221;<br />
<em>barrow bombadil fog goldberry hoy jogging lumpkin ponies tom wight</em></li>
<li>&#8220;At the Sign of The Prancing Pony&#8221;<br />
<em>barliman bree butterbur comer cow frodo inn landlord nob underhill</em></li>
<li>&#8220;Strider&#8221;<br />
<em>baggins bree butterbur ferny frodo landlord letter nob strider underhill</em></li>
<li>&#8220;A Knife in the Dark&#8221;<br />
<em>beren bob bree dell ferny galad gil strider tinúviel weathertop</em></li>
<li>&#8220;Flight to the Ford&#8221;<br />
<em>bone boot ford frodo glorfindel horse strider troll trolls weathertop</em></li>
</ol>
<h3>Book II</h3>
<p><em>aragorn boats boromir elrond frodo gandalf gimli haldir legolas sam</em></p>
<ol>
<li>&#8220;Many Meetings&#8221;<br />
<em>bilbo dúnadan elrond flood frodo glorfindel glóin mortals rivendell strider</em></li>
<li>&#8220;The Council of Elrond&#8221;<br />
<em>boromir elrond erestor galad gil glóin isildur radagast ring saruman</em></li>
<li>&#8220;The Ring Goes South&#8221;<br />
<em>aragorn bill boromir caradhras drift elrond frodo gandalf redhorn snow</em></li>
<li>&#8220;A Journey in the Dark&#8221;<br />
<em>bill boromir doors durin gandalf gimli howls mines moria wolves</em></li>
<li>&#8220;The Bridge of Khazad-dum&#8221;<br />
<em>balin balrog beats boromir chamber doom door drum hall mazarbul</em></li>
<li>&#8220;Lothlórien&#8221;<br />
<em>flet frodo galadhrim gimli haldir ladder legolas lothlórien nimrodel silverlode</em></li>
<li>&#8220;The Mirror of Galadriel&#8221;<br />
<em>basin celeborn frodo galadhrim galadriel haldir lady magic mirror pedestal</em></li>
<li>&#8220;Farewell to Lórien&#8221;<br />
<em>boat boats boromir cakes celeborn galadhrim galadriel gimli lady lórien</em></li>
<li>&#8220;The Great River&#8221;<br />
<em>aragorn boat boats frodo gebir log paddle paddles rapids sarn</em></li>
<li>&#8220;The Breaking of the Fellowship&#8221;<br />
<em>amon aragorn boromir brandir frodo hen lawn lhaw sam tol</em></li>
</ol>
<h3>Book III</h3>
<p><em>aragorn ents gimli isengard legolas saruman théoden treebeard uglúk éomer</em></p>
<ol>
<li>&#8220;The Departure of Boromir&#8221;<br />
<em>aragorn boat boromir galen gimli gulls legolas orcs rauros tire</em></li>
<li>&#8220;The Riders of Rohan&#8221;<br />
<em>aragorn downs entwash fangorn gimli horses legolas orcs trail éomer</em></li>
<li>&#8220;The Uruk-Hai&#8221;<br />
<em>ankles grishnákh isengarders knoll lugbúrz merry orcs pippin uglúk wrists</em></li>
<li>&#8220;Treebeard&#8221;<br />
<em>ent entish ents entwives hasty hoom isengard rowan treebeard trees</em></li>
<li>&#8220;The White Rider&#8221;<br />
<em>aragorn fangorn gandalf gimli haft legolas man old saruman treebeard</em></li>
<li>&#8220;The King of the Golden Hall&#8221;<br />
<em>aragorn gandalf gríma hall háma neighed théoden wormtongue éomer éowyn</em></li>
<li>&#8220;Helm&#8217;s Deep&#8221;<br />
<em>aragorn coomb deeping dike erkenbrand gamling helm hornburg westfold éomer</em></li>
<li>&#8220;The Road to Isengard&#8221;<br />
<em>caverns caves fords gandalf gimli isen isengard legolas saruman théoden</em></li>
<li>&#8220;Flotsam and Jetsam&#8221;<br />
<em>barrels ent ents gimli huorns isengard pipe quickbeam saruman treebeard</em></li>
<li>&#8220;The Voice of Saruman&#8221;<br />
<em>dismissed ents gandalf injuries orthanc rail saruman spell théoden treebeard</em></li>
<li>&#8220;The Palantír&#8221;<br />
<em>ball bracken gandalf globe orthanc palantíri pippin saruman shadowfax wizard</em></li>
</ol>
<h3>Book IV</h3>
<p><em>faramir frodo gollum nice precious rope sam shagrat shelob sméagol</em></p>
<ol>
<li>&#8220;The Taming of Sméagol&#8221;<br />
<em>cliff fix frodo gollum knot precious rope sam sméagol yess</em></li>
<li>&#8220;The Passage of the Marshes&#8221;<br />
<em>fish frodo gollum gully marshes mires nice precious sam sméagol</em></li>
<li>&#8220;The Black Gate is Closed&#8221;<br />
<em>frodo gollum helps lithui master nice oliphaunt oliphaunts sam sméagol</em></li>
<li>&#8220;Of Herbs and Stewed Rabbit&#8221;<br />
<em>coney fern gollum herbs mablung pans rabbits sam sméagol taters</em></li>
<li>&#8220;The Window on the West&#8221;<br />
<em>arts blindfold boromir damrod faramir frodo isildur mablung men mithrandir</em></li>
<li>&#8220;The Forbidden Pool&#8221;<br />
<em>anborn faramir fish fissh frodo gollum precious protection sly sméagol</em></li>
<li>&#8220;Journey to the Cross-roads&#8221;<br />
<em>brewing decent faramir frodo glare gollum mould sam staves thorns</em></li>
<li>&#8220;The Stairs of Cirith Ungol&#8221;<br />
<em>dad frodo gollum morgul sam sméagol sneaking stairway wink wraith</em></li>
<li>&#8220;Shelob&#8221;<br />
<em>frodo glass gollum lair phial sam shelob sméagol stench tunnel</em></li>
<li>&#8220;The Choices of Master Samwise&#8221;<br />
<em>cords frodo gorbag lads ladyship lugbúrz sam shagrat shelob tunnel</em></li>
</ol>
<h3>Book V</h3>
<p><em>aragorn beregond city denethor faramir ghân men pippin éomer éowyn</em></p>
<ol>
<li>&#8220;Minas Tirith&#8221;<br />
<em>beregond bergil citadel city denethor forlong ingold mithrandir pippin street</em></li>
<li>&#8220;The Passing of the Grey Company&#8221;<br />
<em>aragorn burg elladan erech gimli halbarad king legolas ride théoden</em></li>
<li>&#8220;The Muster of Rohan&#8221;<br />
<em>dwimorberg edoras harrowdale hirgon king merry ride théoden éomer éowyn</em></li>
<li>&#8220;The Siege of Gondor&#8221;<br />
<em>beregond citadel city denethor faramir gandalf garrison pelennor pippin retreat</em></li>
<li>&#8220;The Ride of the Rohirrim&#8221;<br />
<em>buri dernhelm elfhelm ghân men stonewain théoden wild éomer éored</em></li>
<li>&#8220;The Battle of the Pelennor Fields&#8221;<br />
<em>dernhelm harlond king knights mûmakil snowmane southrons théoden éomer éowyn</em></li>
<li>&#8220;The Pyre of Denethor&#8221;<br />
<em>beregond bier burn city denethor faramir gandalf healing oil thou</em></li>
<li>&#8220;The Houses of Healing&#8221;<br />
<em>aragorn athelas bergil faramir healing imrahil ioreth sick éomer éowyn</em></li>
<li>&#8220;The Last Debate&#8221;<br />
<em>aragorn gimli gulls imrahil lamedon lebennin legolas pelargir ships thousands</em></li>
<li>&#8220;The Black Gate Opens&#8221;<br />
<em>army captains heralds imrahil lieutenant messenger morannon sauron terms thou</em></li>
</ol>
<h3>Book VI</h3>
<p><em>cotton faramir frodo lotho rosie ruffians sam shagrat sharkey éowyn</em></p>
<ol>
<li>&#8220;The Tower of Cirith Ungol&#8221;<br />
<em>court frodo gorbag ladder orc sam shagrat shelob snaga turret</em></li>
<li>&#8220;The Land of Shadow&#8221;<br />
<em>frodo isenmouthe morgai orc range sam shagrat spur tracker whip</em></li>
<li>&#8220;Mount Doom&#8221;<br />
<em>cone frodo gaping gollum load masster mountain sam sammath tearing</em></li>
<li>&#8220;The Field of Cormallen&#8221;<br />
<em>captains eagles frodo gwaihir isle knights merriment praise pure sam</em></li>
<li>&#8220;The Steward and the King&#8221;<br />
<em>aragorn city faramir healer healing ioreth king lady warden éowyn</em></li>
<li>&#8220;Many Partings&#8221;<br />
<em>arwen beggar celeborn hoom orthanc théoden treebeard wain éomer éowyn</em></li>
<li>&#8220;Homeward Bound&#8221;<br />
<em>barliman bill bree butterbur deadmen ferny gandalf nob pony troubles</em></li>
<li>&#8220;The Scouring of the Shire&#8221;<br />
<em>cotton farmer hob lotho ruffians sharkey sheds shirriff shirriffs village</em></li>
<li>&#8220;The Grey Havens&#8221;<br />
<em>bywater cotton deputy elanor frodo married mayor rosie ruffians sam</em></li>
</ol>
<h3>Appendices</h3>
<ul>
<li>A. &#8220;Annals of the kings and rulers&#8221;<br />
<em>arthedain arvedui eldacar eärnil gondor king náin tar thráin witch</em></li>
<li>B. &#8220;The Tale of Years&#8221;<br />
<em>begins birth guldur king meets reaches samwise sauron sets thranduil</em></li>
<li>C. &#8220;Family Trees&#8221;<br />
<em>addition appendix birth blanco family guests names party recorded recounted</em></li>
<li>D. &#8220;Shire Calendar&#8221;<br />
<em>calendar corresponded eldar lithe months names quenya reckoning week yule</em></li>
<li>E. &#8220;Writing and spelling&#8221;<br />
<em>english languages letters pronounced quenya represented represents series sindarin vowel</em></li>
<li>F. &#8220;The languages and peoples of the Third Age&#8221;<br />
<em>english languages letters pronounced quenya represented represents series sindarin vowel</em></li>
</ul>
<h3>Resulting Tags Cloud</h3>
<p><a href='http://cyberborean.org/blog/wp-content/uploads/2007/09/lotrtags.png' title='lotrtags.png'><img src='http://cyberborean.org/blog/wp-content/uploads/2007/09/lotrtags.png' alt='lotrtags.png' /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.cyberborean.org/2007/09/18/tagging-the-lord-of-the-rings/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SCAN project announce</title>
		<link>http://blog.cyberborean.org/2007/09/14/scan-project-announce</link>
		<comments>http://blog.cyberborean.org/2007/09/14/scan-project-announce#comments</comments>
		<pubDate>Fri, 14 Sep 2007 12:36:54 +0000</pubDate>
		<dc:creator>Alex Alishevskikh</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[desktop]]></category>
		<category><![CDATA[IA]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[metadata]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[SCAN]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[tagging]]></category>
		<category><![CDATA[taxonomy]]></category>
		<category><![CDATA[Tools]]></category>

		<guid isPermaLink="false">http://cyberborean.wordpress.com/2007/09/14/scan-project-announce/</guid>
		<description><![CDATA[ViceVersa Technologies presents the first public release of SCAN (Smart Content Aggregation and Navigation) platform. SCAN is a personal Information Retrieval framework, combining search, text analysis, tagging and metadata functions to provide new user experience of desktop navigation and document management. About SCAN &#8220;&#8230; the abundance of information will be such that either you have [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://scan.sourceforge.net"><img src='http://cyberborean.org/blog/wp-content/uploads/2007/09/scan100i.png' alt='SCAN' align='left' hspace='5'></a><em>ViceVersa Technologies presents the first public release of <a href="http://scan.sourceforge.net">SCAN (Smart Content Aggregation and Navigation)</a> platform. SCAN  is a personal        Information Retrieval framework, combining search, text analysis,        tagging and metadata functions to provide new user experience of desktop        navigation and document management.</em></p>
<p><span id="more-179"></span></p>
<h3>About SCAN</h3>
<blockquote><p>&#8220;&#8230; the abundance of information will be such that either you have        reached such a level of maturity that you are able to be your own        filter, or you will desperately need a filter&#8230; some professional        filter.&#8221;<br />
<em>Umberto Eco: A Conversation on Information<br />
(<a href="http://carbon.cudenver.edu/~mryder/itc_data/eco/eco.html">an interview  by Patrick Coppock</a>, February, 1995)</em></p></blockquote>
<p style="margin-top:0;">       SCAN is aiming for a solution of major problems of content organization        and findability in information overload era.</p>
<p style="margin-top:0;"><a href='http://scan.sourceforge.net/uploads/images/browse.png' title='Browse documents'><img src='http://scan.sourceforge.net/uploads/images/browse_tmb.png' alt='Browse documents' align='left' hspace='5' vspace='5' /></a>SCAN aggregates content from different sources into a single documents        collection. This repository may keep records on thousands of documents        independently of their original locations and formats. Every document        record contains a number of metadata properties (such as title,        description, author, creation date, etc) which can be either set        automatically or edited manually.</p>
<p style="margin-top:0;">       Adding documents to the repository is an automated operation. A user        only need to point SCAN to a location and the application will find and        add every document from there. Added document locations will be        monitored for changes (new, modified or deleted documents) to keep the        repository up-to-date.</p>
<p style="margin-top:0;">       The documents content is indexed for search and text analysis. You can        search the documents either by simple text queries, or by using special        forms to make complex queries for searching on document text and        properties. The queries can be saved for repeatable use.</p>
<p style="margin-top:0;"> <a href='http://scan.sourceforge.net/uploads/images/tags.png' title='Tags panel'><img src='http://scan.sourceforge.net/uploads/images/tags_tmb.png' alt='Tags panel' align='left' hspace='5' vspace='5' /></a>      The documents collection is structured with a system of tags, similar to        the services like <a href="http://del.icio.us/">del.icio.us</a> or <a href="http://flickr.com/">Flickr</a>.        Tags are keywords or labels attached to the items to identify them for        quick navigation and finding. All tags together form a t<em>axonomy</em>        representing the semantics of the documents collection. The taxonomy can        be viewed as a &#8220;tags cloud&#8221; for navigating through the documents        repository.</p>
<p style="margin-top:0;">       SCAN text analysis mechanism simplifies the process of tagging. It        analyzes a document content and suggests the most relevant words as to-be tags. It makes manual tagging as simple as selecting the tags from        the proposed candidates. It also can undertake the whole manual process        of tagging, either by automated assigning the tags to the documents, or        by finding the documents, relevant to a specific tag. Another text        analysis application is searching the documents similar to a specific        one (search by pattern).</p>
<p style="margin-top:0;">       SCAN is a component-based software using a number of plugins for        specific features. The basic SCAN platform can be easily extended with        plugins for new document formats, document locations (RSS feeds,        web-sites, e-mail, etc) and language analyzers. Whole new areas of        functionality can be added with user interface extensions. An example of        such extensions is the plugin to browse the repository with a calendar        (grouping the documents by their creation dates).</p>
<p style="margin-top:0;">       SCAN is a <a href="http://java.sun.com/">Java</a> application, so it        works on any Java-enabled platform. SCAN is a free open source software,        distributed under <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache        License, Version 2.0</a></p>
<h3>See also:</h3>
<ul>
<li><a href="http://scan.sourceforge.net/?page_id=19">List of current features</a></li>
<li><a href="http://sourceforge.net/project/showfiles.php?group_id=189359">Download SCAN</a></li>
<li><a href="http://scan.sourceforge.net/?page_id=7">How to obtain SCAN sources from SVN repository</a></li>
<li><a href="http://scan.sourceforge.net/?page_id=4">User&#8217;s Manual</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blog.cyberborean.org/2007/09/14/scan-project-announce/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What do the tags mean from IR perspective?</title>
		<link>http://blog.cyberborean.org/2007/06/04/what-do-the-tags-mean-from-ir-perspective</link>
		<comments>http://blog.cyberborean.org/2007/06/04/what-do-the-tags-mean-from-ir-perspective#comments</comments>
		<pubDate>Mon, 04 Jun 2007 08:49:25 +0000</pubDate>
		<dc:creator>Alex Alishevskikh</dc:creator>
				<category><![CDATA[Essays]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[tagging]]></category>
		<category><![CDATA[tfidf]]></category>

		<guid isPermaLink="false">http://cyberborean.wordpress.com/2007/06/04/what-do-the-tags-mean-from-ir-perspective/</guid>
		<description><![CDATA[I&#8217;m trying to define the popular concept of &#8220;tags&#8221; in terms of the Information Retrieval and to find an appropriate algorithmic strategy for to automate the processes of choosing the tags for text documents. The tags are a handy intuitive concept for labeling and describing the pieces of information. They are usually assigned informally, and [...]]]></description>
			<content:encoded><![CDATA[<p><em>I&#8217;m trying to define the popular concept of &#8220;tags&#8221; in terms of the Information Retrieval and to find an appropriate algorithmic strategy for to automate the processes of choosing the tags for text documents.</em></p>
<p><span id="more-169"></span>The tags are a handy intuitive concept for labeling and describing the pieces of information. They are usually assigned informally, and as for any matter of  &#8220;a common sense&#8221;, it is difficult to find their formal definition. It is even more difficult to reproduce and algorithmize the mental processes of tag choosing.</p>
<p>To avoid that fruitless work, I will consider some IR approaches which may be used for evaluation the document terms as the possible tag candidates. For to make that evaluation possible, two important purposes of the tags may be identified:</p>
<ul>
<li><strong>Differentiation<br />
</strong>To name a thing means to separate it from others. So, a good tag should identify a document by showing its unique feature &#8211; how the document differs from other documents in a corpus.</li>
<li><strong>Unification</strong><br />
A tag should also identify a semantic group to which the document belongs.</li>
</ul>
<p>These two key properties seem totally different and even mutually opposite and discordant. Nevertheless, they are both important and <em>the best</em> tags perhaps integrates both properties in a dialectical way. Brief analysis of human-assigned tags on popular social web-services confirmed that assumption.</p>
<h3>Traditional keyword extraction: TFIDF</h3>
<p>The first strategy of autotagging is based on a popular term weighting technique, proposed by classical IR discipline. This technique is called shortly as <a href="http://en.wikipedia.org/wiki/TFIDF">TFIDF</a> and this abbreviation reflects the basic formula for calculating a measure of relevance (or, &#8220;weight&#8221;) of a specific term in a given document: <em>Term Frequency x Inverted Document Frequency</em>.<br />
Though there is a number of variations of this formula (differ in various normalization techniques), the basic idea behind them is the same: weight of a term in a document is directly proportional to the number of times the term appears in the document (<em>Term Frequency, TF</em>) and inversely proportional to the total number of documents containing this term (<em>Document Frequency, DF</em>).</p>
<p>This weighting scheme allows to effectively filter out so-called &#8220;common words&#8221; (with high DF) as well as the rare terms and misspellings (with low TF and DF both). The terms with high TF and relatively low DF got highest &#8220;importance&#8221; ranks and can be considered as possible candidates to tags.</p>
<p>Thus, TFIDF-based document tagging strategy may be described as follows:</p>
<ol>
<li>Build a list of all unique terms of a given document (term vector)</li>
<li>Calculate TFIDF weight for each term in the list</li>
<li>Sort the terms by their TFIDF weights</li>
<li>Select <em>N</em> topmost terms, where <em>N</em> is a number of tags we want to assign for each document. We can also choose another method of tags number limitation, e.g. to select the terms above specified TFIDF threshold.</li>
</ol>
<h3> Improved autotagging strategies</h3>
<p>The problem with reviewed simple TFIDF-based analysis is that it conflicts with the second part of our definition of the tags purpose: <em>unification</em>. Reviewed approach is good when we need to know what terms make a document different from others, but it doesn&#8217;t help if we need to select the terms which connect the document with its counterparts.</p>
<p>With this technique, the &#8220;tags cloud&#8221; tends to grow proportionally to the number of indexed documents with minimization of the number of documents per single tag at the same time. In an ideal case, it leads to the situation when each tag corresponds to an unique document. Such level of granularity is not what we want if we are going to control our taxonomy size and care about grouping function of our tags. The tags which would contribute into documents unification and grouping are overlooked by that algorithm because their DF&#8217;s are higher than of those contributing into a document uniqueness. We could smooth that effect by reduction an influence of DF parameter, but it increases the risk of appearing the common words as the document tags. Obviously, tweaking the basic TFIDF formula is not a solution for this problem.</p>
<p>To implement better autotagging strategy (which would consider a documents grouping factor), we need to analyze a document together with its semantic context &#8211; that is a cluster of documents including an initial document plus the documents similar to it. TFIDF ranking algorithm applied to that cluster gives us the terms relevant to the whole group, while the common words are still effectively suppressed. There is a high probability that the documents inside the cluster are mutually similar to each other, thus this strategy tends to re-use the same tag set for semantically related documents, as possible.</p>
<p>An easy way to find the document similarities is to build the document term vector (as we did before) and to issue it (or its highest ranked part) to a search system. Some search engines implement &#8220;more-like-this&#8221; feature that works the same way. The better results may be achieved if the search system allows to set a &#8220;boost factor&#8221; for separate query terms, proportionally to their weights in the term vector. Then, we just select the search results (above specific threshold) as the documents similar to the given one.</p>
<p>The whole strategy may looks like:</p>
<ol>
<li>Find a set of documents, similar to a given document.</li>
<li>Build a <em>composite term vector</em> as an union of term vectors of the given document and its similarities.</li>
<li>Calculate TFIDF term weights against the composite term vector.</li>
<li>Sort the terms by their TFIDF weights</li>
<li>Select topmost terms as the document tags</li>
</ol>
<p>By tweaking the similarity thresholds and TFIDF normalizing (e.g. considering  similarity coefficients in TFIDF calculation), we can adjust the results to get &#8220;more specific&#8221; or &#8220;more general&#8221; set of the tags (differentiation/unification balance), depending on our goals.</p>
<p>Another improved autotagging strategy may be implemented as an adaptive technique where TFIDF evaluation takes into account whether a given term is already a tag or not. But this approach strictly depends on an initial tag set and likely, will require some manual tagging before it could be used in auto mode.</p>
<blockquote><p><em>P.S. If you found this article useful, please consider to participate in <a href="http://cyberborean.wordpress.com/2007/03/26/survey-how-do-you-find-your-documents/">this survey</a>. It will help the author in his further work on this subject. Thanks!</em></p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://blog.cyberborean.org/2007/06/04/what-do-the-tags-mean-from-ir-perspective/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

