Semantic Tagging in Search
Turning to the area of automated search and retrieval, enterprise search engines, content management systems, and related discovery and data mining products that do not utilize human indexing, semantic tagging obviously plays a smaller role. Nevertheless, some of these vendors claim to offer semantic capabilities. In the competitive enterprise search space, new technologies are often based on either autocategorization (automatic indexing/tagging) or various text analytics techniques, such as pattern recognition or entity extraction. Most of text analytics is not semantic because it does not discern the meaning of words, but rather may classify words by part of speech (grammar). Various forms of autocategorization, on the other hand, may or may not have a degree of semantic technology involved.
In cases where autocategorization search solutions or content management software come prepackaged with taxonomies or have a feature to build or automatically generate taxonomies (which only some vendors offer), there is a potential for what may be called semantic tagging. A simple taxonomy as used in information architecture with a hierarchy of category terms is not sufficient for effective autocategrization. What is needed is really more of a "thesaurus" style of taxonomy, whereby there is a cluster of synonyms or other equivalent terms (abbreviations, acronyms, spelling variations, grammatical variations, etc.) for each concept in the taxonomy. Thus, the taxonomy is comprised not merely of words, but of concepts which derive meaning ("semantics") from their cluster of synonyms. Autocategorization products that provide integrated taxonomies include Interwoven, Inc.’s MetaTagger; Teragram Corp.’s Categorizer and Taxonomy Manager; and Northern Light Goup, LLC’s Enterprise Search Engine, MI Analyst, and Analyst Direct. Northern Light supports what it calls "meaning extraction."
While much of text analytics does not involve semantic analysis, the specialty of natural language processing (NLP) is often involved in such attempts. NLP has many other applications beyond semantic analysis and tagging, but it is being applied in that area as well. At the fourth annual Semantic Technology conference in San Jose, Calif., in May, the topic of semantic tagging was presented by TextWise, a developer of text extraction, search, categorization, and classification technologies using both NLP and statistics. In the presentation "Applying Trainable Semantic Vectors to Tagging, Search/Discovery, Bookmarking and Matching," a panel of TextWise speakers explained how its Semantic Signatures function as tags for bookmarking or in generating tags to map/link an existing tag set.
Semantic tagging’s integration with search technologies is also being applied in niche service areas. For example, Relevad, whose tagline is "semantic keyword analytics," provides hosted web service for online advertisement placing. Relevad claims a growing database of more than 8 million keywords and more than 500 million neighbor keyword meanings. Trovix, meanwhile, provides a web service of matching jobs to resumes utilizing complex scoring algorithms in combination with a "hierarchical knowledgebase" of U.S. cities, skills, positions, industries, and companies.
Semantic Social Tagging
The term "tagging" is most strongly associated these days with social tagging or social bookmarking, whereby people assign tags (terms or keywords) of their own choice to documents, blog posts, or webpages that they have created or have viewed to assist in locating the documents later, whether by themselves or by others. Better known tagging websites and services include Delicious, Flickr, and Technorati. There is generally no taxonomy or controlled vocabulary involved, as any words can be used as tags, although this is changing in some applications.
Fundamentally, this type of tagging is "semantic" as well, because humans manually tag content for what it means. The problem is that this tagging is done based on what the document means to the tagger at the time of tagging, not necessarily what it means to other users or even to the initial tagger at a later time. Furthermore, any lists of the occurrences of a tag can be long, undifferentiated, and ambiguous. The term "semantic tagging" within the sphere of social tagging, therefore, is being used to refer to a method of imposing consistent and more refined meaning. In other words, utilizing some kind of a taxonomy. Such semantic social tags are also being called "rich tags." Not only are the tags’ meanings clarified by synonyms, but there also may be links to related-term tags and the presence of glossary definitions for tags. In other words, semantic tags or rich tags are essentially terms in what is known to librarians as a thesaurus.
Social tagging sites/services that offer what they call semantic tagging include Zigtag, a Canadian startup, and individual-led projects Faviki and Fuzzzy (yes, with three z’s). Zigtag (in private beta as of this writing) is a sidebar plug-in, which differentiates itself from other tagging services by providing a "semantic dictionary" of more than 2 million tags. Tags are defined and synonyms are linked together. Faviki is a social bookmarking tool that provides terms from Wikipedia, extracted by the open DBpedia tool. This not only provides consistency, but also extensive definitions for each of more than 2.18 million Wikipedia resources. Fuzzzy, on the other hand, did not start with a prebuilt taxonomy, but user-created terms are entered into a shared tag set (thesaurus) and various relationships (broader, narrower, related) are supported. Thus, Fuzzzy "enables global distributed tagging." The organic tag set of Fuzzzy is built upon the Topic Map ISO standard and an underlying infrastructure with Web Services.
It isn’t just new kids with extra consonants pursuing social tagging however. Big, established content players are also getting involved. Thomson Reuters offers its open Calais Web Service, which ingests unstructured text and, using NLP, and returns RDF-formatted results identifying entities, facts, and events within the text. In May, Calais was made available as plug-in software for the Drupal publishing platform, Yahoo!’s new Searchmonkey service, and the WordPress blogging platform. The Calais plug-in for WordPress, called Tagaroo, returns tag suggestions based on text typed into a blog but gives users the option of choosing which they want to apply. Calais also offers licensed code to make one’s site part of the "Semantic Web."