blog.poucet.org Rotating Header Image

Organizing information

Looking over the huge directory of papers that I have collected, I notice that it gets harder and harder to categorize papers in a purely hierarchical system. As my supervisor said, papers often contain different orthogonal subjects at once. On the other hand, often you come accross a paper that looks interesting and you do not have time to read, so it ends up in Unsorted, which slowly grows and grows.

Therefore, I think it would be ideal if there were some system that allowed one to easily categorize information. Currently I’m not aware of any. The sort of features that I would look for are:

  • The ability to annotate files with information (be they pdf or ps, but why not html-links as well?).
  • Have some easy way of calling up the files and reorganize them without having to duplicate the work in the tool as well as in the filesystem.
  • Have the ability to have a hierarchical tag system. By that I mean the ability to organize tags in a hierarchical way (and possibly even visualize them in some sort of graphical way to see where your major set of papers reside, indicating that perhaps that needs to be categorized more, although each item that is actually coordinated by the system can be categorized with different tags. This, I think, combines the power of tagging with the power of a hierarchical system. Besides, I do believe that in the world of research it is possible to categorize tags into a hierarchical system. If anyone disagrees, of course, I would definitely be interested in their point of view.
  • Have the ability to annotate other paper-specific items, such as the author list, possibly a list of references (linked to the other papers in the system if they exist, and otherwise creating an empty entry), the abstract and personal comments.
  • Lastly, of course, the ability to search, either by author, institute, tag or keyword either in the abstract or comments.
  • A desireful, though not necessary, feature would be the ability to have this application communicate with others (not necessary online, but maybe through some patch system) to allow people to share their own categorization.

If anyone has any ideas regarding such a system, I would definitely be interested in hearing about it.

Be Sociable, Share!

11 Comments

  1. Christophe (vincenz) says:

    I have to admit that I’ve desired such a system so much I almost feel tempted to write it myself. I still haven’t commmitted myself to it as I am not sure I’d finally get it done. However if others are interested we could definitely discuss it. I am also not sure which language I’d use: Haskell, Scheme, Ruby, C++, Java?

  2. Anonymous says:

    researchers at the university of waikato and others have applied good ol’ naive bayes for keyword extraction – see, e.g., the webpage for the Kea project at http://www.nzdl.org/Kea/ . you could then apply classic clustering algorithms (k-means or whatever) to the list of keywords.

    if you implemented this you could group together papers in a copmletely automated fashion that wouldn’t require any input (e.g., you wouldn’t have to tag your papers). this would be an easy web app to implement in java or whatever because of the availability of excellent machine learning/data mining libraries for it like Weka, YALE and so on.

  3. Anonymous says:

    not exactely what you suggest, but have you tried google notes?
    (works in the same way as opera notes, and is currently only available as a firefox-extension)

    skip skip skip copy to note skip

    http://www.google.com/notebook

  4. Christophe (vincenz) says:

    Hello anonymous (wkh@#haskell). I think your suggestion is an interesting one and not something I had considered. Indeed it might be useful to automatically categorize papers that haven’t been read yet, that would allow an easier uptake of unread material. However I do believe that I would need the manually annotated system. I believe, and this is not my invention but my supervisor has shown me a lot regarding this, that it is necessary to orthogonalize research information in a consistent manner. Hence the need for the manual tagging system.

    Of course if other information, such as references, title, abstract etc… as well as other metadata such as the bayesian information you mention could be extracted automatically, that would definitely be a bonus.

  5. Christophe (vincenz) says:

    A colleague of mine pointed me to the application known as JabRef. It looks quite impressive though it’s not -quite- there yet. I would like to have a categorization of papers, which it does not do.

  6. Anonymous says:

    I suggest you consider google mail. You can use a seperate account for this purpose and just mail pdfs to it, including some keywords in the mail body, and if you feel like it you can copy paste the abstract and add some comments.

    The search engine is powerfull enough to make categorization redundant, but if you want te use categories, the labels are perfect for it: different labels can be applied to a single paper and you can add more detailed labels as the collection grows.

    The storage is large enough to store a lot of papers…

    One more added benefit: you can access your papers from any computer !

    Have fun !

    PS: I don’t work for google …

  7. Anonymous says:

    Look up citeulike. Enough people post papers there that the tags there provide you with your non-hierarchical system, and you can tag the paper multiple ways when you add it to your list.

    -Edward Kmett

  8. Christophe (vincenz) says:

    Regarding the suggestion about Gmail, that certainly seems like an interesting approach. ABout the mention about citeulike, this will not do as it does not organize the papers I have on my computer. In addition not all information is public and therefore it’d have to be a private system.

    So far the best suggestions seem GMail or JabRef, though I admit I wish there were some extra features, like a graphical hierarchical view of the tag-set, so that you can notice what needs more triage.

  9. Andre Pang says:

    BibDesk on Mac OS X is designed for exactly this task: it’s iTunes, but for papers instead of music. You’ll have to have a Mac, though…

  10. Bart Masschelein says:

    >>I am also not sure which language >>I’d use: Haskell, Scheme, Ruby, C++, >>Java?

    Like we discussed, this sounds like an interesting project to work on. My first choice would be to use Rails, as it nicely avoid the mumbo-jumbo you might have when dealing with databases, while being more then powerfull enough to use them. In fact, I don’t think it would be that difficult to implement it, when there would be 30 hours in a day.

    For the time being, this GMail proposal sounds like a nice temporarly solution…

  11. Yang says:

    I may be way off mark, but this is what I always thought that systems like WinFS and GNOME Storage were meant to do. I don’t know how flexible (easily extensible) they are (since the only thing I’ve seen them do is replace iTunes in categorizing your media files). Alas, neither project seems to be ‘ready.’

    There’s also DoXFS, which uses XFS to store meta-data about files. That’s about all I know about it – I don’t know if this will let you classify them.

    http://sourceforge.net/projects/doxfs/

    I’ve in fact been working on a simple database software to manage my information and notes in a structured, typed manner. The data model focuses on extensibility, integration, and consistency, and I have every intention of using it to classify papers I come across. I think that such a platform could be a first step before trying to do something more ambitious such as machine classification or “social” classification. (I have since the start of my project kept in mind the former – the latter just occurred to me, when I realized that more than one person may be interested in using my software.)

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>