I've seen Georgina's post on this topic. Its most important advice is likely to be this:
"If you wish to use TDM in your work, we highly
recommend that you ensure you are doing so legally and that you contact likeminded folk such as
the team at ContentMine to ask for advice."
Will do. My own post, like those of other participants, has to be short for want of experience. I have not had occasion to use data-mining in my own work, but I now know that any research query that sounds as though data mining would help towards the answer is a matter for ContentMine.
Meanwhile, I suppose I get a frisson of what data mining is like when I dabble in Google Books' Ngram Viewer. This enables the user to search vast numbers of books for the occurrence of phrases. By it I have satisfied my idle curiosity as to the frequency of use of the locution "And Oh!" (it seems to have peaked in 1842 and then slowly declined), and the relative frequency of the phrases "railway station" and "train station" (the latter overtook the former in 1994, and peaked in 2000; they now seem to be rapidly converging again). But I am not an expert user of this site, and I increased my knowledge of it around 150% in the past hour, revisiting it for this post.
Monday, 20 February 2017
Thing 19: Text and data mining
Labels:
23researchthingscam,
data mining,
Ngram Viewer,
text mining
Monday, 13 February 2017
Thing 18: Research data management (RDM)
This is where I hit the '23 research things' hard stuff. Research data management is an area where I have clearly more to learn than things like Twitter. I draw comfort from other participants' blogs, in which similar confessions are made.
I'd better start by copying out those four types of data that Georgina names in the post under that link above:
- Observational – so data captured in real-time that is usually irreplaceable and can include anything from survey data to images of someone’s brain.
- Experimental – this can be data from lab work which can be reproducible such as gene sequences.
- Simulation – this can be data generated from test models where the models themselves are sometimes more important than the results, such as climate or economic models.
- Derived or compiled data – this is data that is reproducible and can include 3D models, text and data mining, and compiled databases.
A waterfall of consciousness notes that data management begins with "personal diaries, work e-mails, holiday snapshots and, even, home videos". Emboldened by that, let me tell how I have managed data at such a level.
I'd better start by copying out those four types of data that Georgina names in the post under that link above:
- Observational – so data captured in real-time that is usually irreplaceable and can include anything from survey data to images of someone’s brain.
- Experimental – this can be data from lab work which can be reproducible such as gene sequences.
- Simulation – this can be data generated from test models where the models themselves are sometimes more important than the results, such as climate or economic models.
- Derived or compiled data – this is data that is reproducible and can include 3D models, text and data mining, and compiled databases.
A waterfall of consciousness notes that data management begins with "personal diaries, work e-mails, holiday snapshots and, even, home videos". Emboldened by that, let me tell how I have managed data at such a level.
Personal diaries. Mine go back, in an unbroken series, to 1 January 1969. I keep them all together, in a reasonably consistent compromise between size and chronological order, and I usually have no difficulty finding a particular year. I say "usually" because my latest search of them failed. The unfound years are presumably buried on my desk somewhere, and maybe it's time for a spring-clean.
Work emails. For these I have many folders. When I answer an email, I incorporate the incoming email into my reply, save that thread, and delete the incoming email. This works, though not as well as it used to, by reason of the sheer volume of email to deal with.
Word documents and spreadsheets. In the days when my Word documents were mostly letters, naming them was an easy matter of applying the date, written yymmdd, plus a sequential number for letters written on that date, and the file type extension ("17021201.doc"). An adaptation of that applies to things that fit into a regular sequence ("170212minutes.doc"). But I'm a bit indecisive with other files, leaving the form of the date and the relative position of date and content liable to variation ("170212members.xls", "Events Feb 2017.doc"). I need to get a grip.
Money statements. Domestic money statements I've got more or less under control. They're still on paper, for the most part, and I shred them after set intervals (three months for those relating to food and cash withdrawals, two years for those relating to utilities, six years for all others). Bank statements and similar I keep in separate files for each account; bills &c from other organisations I keep in an A/Z sequence by organisation, and within each organisation in reverse chronological order. All this is a consequence of reading Taming the paper tiger at home by Barbara Hemphill, which I'll have read ca 2002.
Poetic output. This I keep track of using a card-index system I devised in 1994. The card fronts show poem title, number of lines, and year of composition; the backs show where I've submitted the poem and when, and the outcome of the submission. While a poem's being considered by an editor or competition adjudicator, I flag the card with a yellow sticker. If an unpublished poem is between submissions, I flag the card with a blue sticker. If I'm lucky enough to have the poem published, or placed in a competition, I mark the card front with a diagonal red line and the place of this success. And all this information is necessary. Poetry competitions often have rules about number of lines, and about the ineligibility of poems that have been already published.
So all the above attempts at data management are creaking, and the paper-based ones will have to be replaced with electronic equivalents sooner or later. The poetry card index might be worth replacing with a database -- something I've had some training in, but never actually made. An alternative might be to mark the information among the properties of the file, but I can see two disadvantages to that: the risk of information loss between file versions, and the amount of digging that would need to be done in order to get at the information.
Further research needed. What a thing for Love Your Data Week!
Labels:
data management,
diaries,
email,
Love Your Data Week,
money,
poetry card index
Monday, 6 February 2017
Thing 17: Survey tools (ii)
An update on what I posted on 8 January.
I said then I'd created a survey, using Google Forms, on the subject of superhero powers. It has drawn a single response, which is obviously not an adequate statistical sample. 100% of respondents named telekinesis as the superpower they hadn't got, and television as the source of their hearing about it; they were disappointed not to have this power, had aspired to it, and, for the story behind their aspiration, gave the following statement:
"Moving equipment would be a gesture away! Computers would be instantly fixed because I willed it to happen."
I said then I'd created a survey, using Google Forms, on the subject of superhero powers. It has drawn a single response, which is obviously not an adequate statistical sample. 100% of respondents named telekinesis as the superpower they hadn't got, and television as the source of their hearing about it; they were disappointed not to have this power, had aspired to it, and, for the story behind their aspiration, gave the following statement:
"Moving equipment would be a gesture away! Computers would be instantly fixed because I willed it to happen."
Let us press on to Georgina's recommended application, Qualtrics. This is now live for me, and I've had a go at creating a survey using it. It's a spoof, I admit, and neither an illuminating survey nor particularly funny, but had me exploring large areas of Qualtrics that I shall be able to evaluate properly as I use them, or not, in creating surveys in earnest for the Haddon. They will be no worse for this dash of prior Qualtrics experience.
Subscribe to:
Posts (Atom)